2023-12-26

Language Models

Refer Chapter 4. N-Grams for more details

[!question]- Why is it appropriate to model language as a probabilistic phenomenon? Human Cognition is probabilistic Language is an integral part of human cognition Hence it is only appropriate that it is probabilistic too Our world is filled with uncertainty and incomplete information Language is the same. The meaning of the word depends on its circumstances - Uncertainty

[!question]- What are language models? They are a probabilistic distribution over strings on an alphabet They try to figure out the probability of a word given some context

[!question]- What are the two main probabilities given by a language model?

Probability of a word occurring given some history

Probability of a given sentence occurring

[!question]- What are some of the uses of Language models

Context sensitive spelling correction

Natural language generation

Next word prediction in search engines

[!question]- Mathematically describe the two probabilities calculated by a Language Model $P(W) = P(w_1 w_2 \ldots w_n)$ $P(w_{n+1} | w_1 w_2 w_3 \ldots w_n)$

[!question]- Explain the connection between the joint and conditional probabilities in LM The joint probability of the sentence is the product of the conditional probabilities

[!question]- Explain why we can’t just calculate those probabilities Language is too expressive We can never really get the count for a certain sequence of words

[!question]- What is the Markov assumption? $P(w_{n+1} | w_1^n) \approx P(w_{n+1} | w_n)$ We assume that only the immediate predecessor matters This makes it much simpler to calculate the conditional probabilities

[!question]- Give the formula for conditional probability under Markov’s assumption $P(w_{n+1} | w_n)= \frac{C(w_n w_{n+1})}{C(w_n)}$

[!question]- How would you treat unknown words? Create a new token called <UNK>. Any word that is not part of the dictionary is transformed into <UNK>. Then treat that as a word as usual

[!question]- What are the two ways to evaluate a model Intrinsic Extrinsic

[!question]- Define perplexity Function of the probabilities assigned by the model $PP(w_1 w_2 \ldots w_n) =\sqrt[N]{\frac{1}{P(w_1 w_2 \ldots w_n)}}$ Where N is the order of the language model

[!question]- Give the perplexity formula for a bigram model $PP(w_1 w_2 \ldots w_n) =\sqrt[2]{\frac{1}{\prod_1^n P(w_{i+1} | w_i)}}$

[!question]- How should the perplexity be for a good model - high or low? low

[!question]- What is a branching factor? Number of words that can come after a sentence

[!question]- Explain the relationship between branching factor and perplexity Perplexity can be used to represent branching factor A high perplexity means that there are many words that can come after the sentence Hence the model is confused on which word to choose

[!question]- What is the need for smoothing? Sometime we may encounter words that are not in the training set. This will affect the MLE calculation etc… 0 frequency words is a bad thing

[!question]- What is the Laplace smoothing formula for bigram models? $P(w_{n+1} | w_n) = \frac{C(w_n w_{n+1}) + 1}{C(w_n) + V}$

[!question]- How to get the reconstituted counts? $c^*(w_n w_{n+1}) = \frac{C(w_n w_{n+1}) + 1}{C(w_n) + V} C(w_n)$

[!question]- Is Laplace smoothing used in N-Gram? No

[!question]- Where is it appropriate to use Laplace smoothing Text classification Where the number of 0 words is less

[!question]- What is backoff? Using a lower order N-gram when needed

[!question]- What is interpolation Mixing a set of different order n-grams

[!question]- What is the formula for simple interpolation $\hat{P}(w_n | w_{n-1}w_{n-2}) = \lambda_1 P(w_n | w_{n-1}w_{n-2}) | \lambda_2 P(w_n | w_{n-1}) + \lambda_3 P(w_n)$ Here we are combining trigram, bigram and unigram probabilities in a certain ratio

[!question]- What should be the sum of the $\lambda$s

So that the sum will remain a probability

[!question]- What is the formula for lambda conditionals based on context? Same as simple interpolation but the lambdas are parameterized i.e there will be different lambda values based on the context $\lambda(w_{n-2}^{n-1})$

[!question]- How do you set the $\lambda$s? Using a held out set Train n-grams on a training set Pick values of $\lambda$ so that it will fit best on a held out set basically MLE on held out set

[!question]- How do you deal with the massive scale of web language models

Threshold on the count of n-grams.

If it is lower that some value $k$ then leave that ngram

Entropy-based pruning

Efficient storage using tries

Bloom filters

Store words are indices instead of string

Quantize probability into bits instead of 4 byte float

[!question]- What is stupid backoff? It is a smoothing technique used in large web n-grams It is a way of doing backoff i.e using a lower order ngram on some siturations It is stupid because the rule of when to back off is very simple If there is a valid n-gram then use it else use $0.4*P(w_n | w_{n-k+2}^{n-1})$ Notice the +2 there. It is usually +1. We are going down an order in terms of ngram It is represented by S

[!question]- Does stupid backoff produce probabilities ? No. Since it is stupid, it doesn’t redistribute the probabilistic mass appropriately. It produces sort of a score. This is still good enough to be used in large scale Ngrams

[!question]- Give some advanced modeling techniques Discriminative models Parsers Cache