Permuterm Index
- B-tree can handle prefix queries such as
mon* - They can also handle suffix queries like
*ed. We can do this by creating a new inverted index but with all the words backwards - The part that they struggle with is queries like
an*l. The naive way is to process the query asan* AND *l - But this is very expensive
- The solution is the Permuterm Index
- It is a separate index that we create with all the permutations of the each of the term in the regular inverted index
Procedure to create all the permutations of a term
- Suppose we have a term
hello - First add the special character
$to end to make ithello$ - The perform Left Circular Shift until the
$character is the first
[!example]
hello$ ello$h llo$he lo$hel o$hell $hello
Creation of the permuterm index
[!example] Consider a very small vocabulary of
[hello, world]The permutations ofhelloarehello$, ello$h ... $helloThe permutations ofworldareworld$, orld$w, rld$wo ... $worldThe permuterm index will have[hello$, ello$h, llo$he, lo$hel, o$hell, $hello, world$, orld$w, rld$wo, ld$wor, d$worl, $world]The associated posting list of these will be the original word that they were derived from So the final index will look like{
“hello$”: “hello”, “ello$h”: “hello”, “llo$he”: “hello”, “lo$hel”: “hello”, “o$hell”: “hello”, “$hello”: “hello”, “world$”: “world”, “orld$w”: “world”, “rld$wo”: “world”, “ld$wor”: “world”, “d$worl”: “world”, “$world”: “world” }
We store these as a *B-tree* - the same way an inverted index is stored.
How to process wild card queries
Suppose we have a wild card query like X*Y where X and Y are some strings. How do we process them? First we add $ to the end of the query to make it X*Y$ then we do circular left shift till the * is at the end. So X*Y$ -> *Y$X -> Y$X*. Then we perform a prefix search on Y$X.
Why does this work?
Remember that prefix search works well on B-trees. If we can turn any arbitrary wild card into a prefix search then we are golden.