Presentations

Natural Language Modeling

Neural networks operate on vectors.
How to represent language as a vector or a collection of vectors?
Naive solution: map each letter in the alphabet to a vector. Each vector itself can be learned by the model.
Pro: can be used for any language that shares that alphabet
Pro: small alphabet
Con: creates long sequences of embedding vectors (or tokens)
Other extreme: encoding each word
Compromise: encode often occurring combinations of tokens into one (-> compression algorithms)

LLMs trained on arbitrarily frankensteined together tokens are not great at counting words.
Not mentioned: sub-word regularization, i.e. ensure consistency when multiple encodings are possible.