2023-11-30

impl-log

TODO

Preprocess
- Lowercase
- Tokenize
  - Separate out the href thing. Make it into a special tuple of the form (word, wikipedia link's back part)
- Stemming
- [words, words, ..., (), (), ()]
Term-Document Incidence Matrix
- Vocabulary
- tf
- 1 + log(tf)
- IDF
- (1 + log(tf)) * IDF
- [0.4, 0, 4.5, 0, …]
- Normalize
Query
- Follow the same procedure as the document and represent it as a vector
- Find cosine similarity of all docs and query
- Sort them