..

impl-log

TODO

  • Preprocess
    • Lowercase
    • Tokenize
      • Separate out the href thing. Make it into a special tuple of the form (word, wikipedia link's back part)
    • Stemming
    • [words, words, ..., (), (), ()]
  • Term-Document Incidence Matrix
    • Vocabulary
    • tf
    • 1 + log(tf)
    • IDF
    • (1 + log(tf)) * IDF
    • [0.4, 0, 4.5, 0, …]
    • Normalize
  • Query
    • Follow the same procedure as the document and represent it as a vector
    • Find cosine similarity of all docs and query
    • Sort them