..
impl-log
TODO
- Preprocess
- Lowercase
- Tokenize
- Separate out the href thing. Make it into a special tuple of the form
(word, wikipedia link's back part)
- Separate out the href thing. Make it into a special tuple of the form
- Stemming
-
[words, words, ..., (), (), ()]
- Term-Document Incidence Matrix
- Vocabulary
- tf
- 1 + log(tf)
- IDF
- (1 + log(tf)) * IDF
- [0.4, 0, 4.5, 0, …]
- Normalize
- Query
- Follow the same procedure as the document and represent it as a vector
- Find cosine similarity of all docs and query
- Sort them