2023-09-14

N-Grams

Definition

N-Grams are probabilistic models that predict the next word based on the last $N-1$ words.
They are also called as Language Models
By calculating the Conditional Probability of the next word we can also find the Joint-Probability of the entire sentence
Conditional probability helps us in predicting the next words
Joint probability is telling the chances that a sentence like this one give would actually occur based on the training corpus

all of a sudden I notice three guys standing on the sidewalk

This is a coherent English sentence following all the rules of grammar. This has a non-zero probability of occurring

on guys all I of notice sidewalk three a sudden standing the

This almost has no chance of occurring

Our goal is to calculate the probability $P(word | history)$. We need to find the probability of $word$ occurring given that $history$ already occurred
A simple formula is $\frac{C(history\ word)}{C(history)}$
We count up all the times that $history$ has occurred and we count the time that it was followed by $word$
The main issue is that language is very creative. Even using dataset that are millions in size won’t help with this.

We can model this as a joint probability
$P(w_1^n) = P(w_1)P(w_2|w_1)P(w_3|w_1^2)\ldots{}P(w_n|w_1^n)$ where $w_1^n$ is the sequence $(w_1, w_2, w_3, \ldots, w_n)$
We have broken down the joint probability into a product of conditional probabilities
But we still don’t know how to compute this joint probability. Our original problem was $P(w_n | w_1^n)$. Breaking this into $P(w_{n-1} | w_1^(n-2)$ doesn’t help as much as we are still plagued by the original issue. We can’t use naive counting to find this probability

Instead of trying to find out probability based on the entire history, bigram model just uses the last word
Using this we can approximate the probability that we want $P(w_n | w_1^{n-1}) \approx P(w_n| w_{n-1})$
So the entire equation becomes $P(w_1^n) \approx \prod{}P(w_i | w_{i-1})$
How do we compute this $P(w_n | w_{n-1})$ term? We use something know as a Maximum Likelihood Estimation
That is given by $P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C_{n-1}}$ . This can be thought of as “of all the times that $w_{n-1}$ occurred, how many times was it followed by $w_n$”.