..

N-Grams

Definition

  • N-Grams are probabilistic models that predict the next word based on the last $N-1$ words.
  • They are also called as Language Models
  • By calculating the Conditional Probability of the next word we can also find the Joint-Probability of the entire sentence
  • Conditional probability helps us in predicting the next words
  • Joint probability is telling the chances that a sentence like this one give would actually occur based on the training corpus

all of a sudden I notice three guys standing on the sidewalk

This is a coherent English sentence following all the rules of grammar. This has a non-zero probability of occurring

on guys all I of notice sidewalk three a sudden standing the

This almost has no chance of occurring

  • This is used in Spell checking.

Naive approach

  • Our goal is to calculate the probability $P(word | history)$. We need to find the probability of $word$ occurring given that $history$ already occurred
  • A simple formula is $\frac{C(history\ word)}{C(history)}$
  • We count up all the times that $history$ has occurred and we count the time that it was followed by $word$
  • The main issue is that language is very creative. Even using dataset that are millions in size won’t help with this.

Better approach

  • We can model this as a joint probability
  • $P(w_1^n) = P(w_1)P(w_2|w_1)P(w_3|w_1^2)\ldots{}P(w_n|w_1^n)$ where $w_1^n$ is the sequence $(w_1, w_2, w_3, \ldots, w_n)$
  • We have broken down the joint probability into a product of conditional probabilities
  • But we still don’t know how to compute this joint probability. Our original problem was $P(w_n | w_1^n)$. Breaking this into $P(w_{n-1} | w_1^(n-2)$ doesn’t help as much as we are still plagued by the original issue. We can’t use naive counting to find this probability

Bigram model

  • Instead of trying to find out probability based on the entire history, bigram model just uses the last word
  • Using this we can approximate the probability that we want $P(w_n | w_1^{n-1}) \approx P(w_n| w_{n-1})$
  • So the entire equation becomes $P(w_1^n) \approx \prod{}P(w_i | w_{i-1})$
  • How do we compute this $P(w_n | w_{n-1})$ term? We use something know as a Maximum Likelihood Estimation
  • That is given by $P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C_{n-1}}$ . This can be thought of as “of all the times that $w_{n-1}$ occurred, how many times was it followed by $w_n$”.