..

The term vocabulary and Posting Lists

Tokenization

Definitions

  • Token
    • A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing
  • Type
    • A type is the class of all tokens term containing the same character sequence
  • Term
    • A term is a (perhaps normalized) type that is included in the IR system’s dictionary

Example:

to sleep perchance to dream

Here the tokens are to, sleep, perchance, to, dream. But the types are to, sleep, perchance, dream. If to is not to be indexed as it is a stop word, then the terms will be sleep, perchance, dream

Problems in tokenization

  • How to tokenize words like O'neal and words like aren't
  • If we are to remove or expand punctuations then how would we recognize words like C# or M*A*S*H
  • We should also be able to recognize special structures such as websites (deebakkarthi.com), IP addresses (127.0.0.1) and package tracking numbers (1Z9999W99845399981).
  • Hyphen also pose a complex issue as we need to separate them sometimes but other times we don’t
  • Simply splitting on white space is not as easy as one would think.
  • Terms like Los Angeles or San Fransisco should not be split
  • It is language specific. The above problems mainly pertain to the English language. Each language has its own set of quirks