..
Expert NLP Talk
Word Embedding to transformers
- Main problem is sequence to sequence learning
- How to learn and represent
History
- Markov models
- Only need the last few elements to predict the next element in a sequence
- Don’t need the whole history
- Shannon’s theory
- Alan Turing
- Georgetown experiment
- John McCarthy terms AI
- CNN
- RNN
- Transformers
Language Model
- Probabilistic model to predict the next word given some history
- Also give the probability that a sequence can occur
Neural LM
- CNN and MLP not suitable for learning sequences
- Sequence dependency can’t be captured
- RNN used here
- Exploding and vanishing gradient problem
- LSTM and GRU introduced as solution
- Selective reading of history
- Can be thought of as gates or filters
- LSTM - 3 gates
- GRU - 2 gates
- Selective read, write and forget
Encoder-Decoder Model
- Input->IR->Output
Attention
- Captures context
- Uses a context vector to focus more on some parts
Transformers
- Self attention instead of global attention
Encoder Stack
- Convert word to embedding
- Add positional encoding
- Use
sin()
andcos()
to do this - Each word is a vector of length 512
- Use