..
The term vocabulary and Posting Lists
Tokenization
Definitions
- Token
- A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing
- Type
- A type is the class of all tokens term containing the same character sequence
- Term
- A term is a (perhaps normalized) type that is included in the IR system’s dictionary
Example:
to sleep perchance to dream
Here the tokens are to, sleep, perchance, to, dream
. But the types are to, sleep, perchance, dream
. If to
is not to be indexed as it is a stop word, then the terms will be sleep, perchance, dream
Problems in tokenization
- How to tokenize words like
O'neal
and words likearen't
- If we are to remove or expand punctuations then how would we recognize words like
C#
orM*A*S*H
- We should also be able to recognize special structures such as websites (
deebakkarthi.com
), IP addresses (127.0.0.1
) and package tracking numbers (1Z9999W99845399981
). - Hyphen also pose a complex issue as we need to separate them sometimes but other times we don’t
- Simply splitting on white space is not as easy as one would think.
- Terms like Los Angeles or San Fransisco should not be split
- It is language specific. The above problems mainly pertain to the English language. Each language has its own set of quirks