Language Models
A language model encodes statistical information about one or more languages. It predicts the next token given a context. Self-supervision let language models scale from toy experiments in the 1950s to the LLMs that power ChatGPT today.
Language Models
TL;DR
A language model encodes statistical information about one or more languages. It predicts the next token given a context. Self-supervision let language models scale from toy experiments in the 1950s to the LLMs that power ChatGPT today.
The historical problem
The statistical nature of language was known for centuries. In 1905, Sherlock Holmes used letter frequency to decode stick-figure messages in "The Adventure of the Dancing Men". In 1951, Claude Shannon published "Prediction and Entropy of Printed English" with the formal math (entropy, probability over symbols) that still underlies modern language models.
For 60 years the ceiling was data, not ideas. Training any serious model required labeled data, which humans had to produce. Even ImageNet (1.2M labeled images, ~$50K in labeling costs) was tiny compared to what was needed to learn open-ended tasks like translation or summarization.
The unlock was self-supervision.
How it works
The base unit: the token
A token is the smallest unit a model processes. It can be a character, a word, or a part of a word (cook + ing). GPT-4 breaks "I can't wait to build AI applications" into 9 tokens.
For English, an average token is about 3/4 of a word. 100 tokens is roughly 75 words.
The set of all tokens the model can emit is its vocabulary. Mixtral 8x7B has a vocab of 32,000. GPT-4 has 100,256.
Why tokens and not characters or words?
- Characters are too fine (loss of meaning, long sequences)
- Words are too coarse (vocab explodes, typos break everything)
- Subwords are the sweet spot: compact vocab + meaning preserved (
cook,-ing)
Autoregressive language modeling
A language model is a completion machine. Given a prompt, it predicts one token at a time, feeding its own output back in. Prompt "To be or not to be" completes to ", that is the question."
Many tasks can be framed as completion: translation (How are you in French is ... -> Comment ca va), classification (Is this email spam? Answer: -> Likely spam), and summarization.
The outputs are open-ended (finite vocab, infinite possible sequences) and probabilistic (predictions, not guaranteed facts).
Self-supervision: the unlock
Supervised learning needs labeled examples. You show the model a cat picture labeled "cat" and it learns to map pixels to labels.
Self-supervised learning infers labels from the data itself. For a sentence "I love street food", the model learns 6 training pairs:
| Context | Target |
|---|---|
<BOS> | I |
<BOS> I | love |
<BOS> I love | street |
<BOS> I love street | food |
<BOS> I love street food | . |
<BOS> I love street food . | <EOS> |
No human labels needed. Any text on the internet becomes training data. This let models scale from millions of parameters (GPT-1, 2018, 117M params) to hundreds of billions (GPT-4, 2023, 1T+ params rumored).
Self-supervision is different from unsupervised learning. Unsupervised uses no labels at all (clustering). Self-supervised builds labels from the structure of the data.
Why bigger models need more data
A model's size is measured in parameters (trainable weights). Bigger models have more capacity to learn. If you train them on too little data, they overfit and you wasted compute. The rule of thumb (Chinchilla scaling laws, 2022): optimal training uses roughly 20 tokens per parameter.
Relevance today (2026)
Huyen's history is accurate but her 2024 framing is already stale on several points:
- "Large" is a moving target. GPT-1 with 117M params was large in 2018. In 2026, that's a mobile model. Frontier models sit at 1T-10T params. Open-source models (Llama 4, DeepSeek-V3, Mistral Large 2) hit 300B-700B.
- Tokens per parameter is being rethought. Post-Chinchilla research (2024-2026) shows you can over-train smaller models with way more tokens (Llama 3 was trained on 15T tokens for only 70B params) and get better inference economics.
- Multimodal is now the default, not an extension. Huyen treats LMMs as a later step. In 2026, GPT-5, Claude Opus 4.x, and Gemini 2 Ultra are natively multimodal from the start of training.
- Self-supervision limits are visible. The internet's text is mostly scraped. Synthetic data, RLHF, and reasoning traces are the new frontier for scaling (see Chapter 7 finetuning and chapter 10 data).
Question to keep in mind: is self-supervision still the main bottleneck, or is it quality of data and alignment signal?
Critical questions
- Why did self-supervision win over supervised learning for language? What's special about text?
- Tokens per parameter: 20 (Chinchilla) vs 200 (Llama 3 over-training). Which is right for your use case?
- If a model is autoregressive, what fails when you want to fill a blank in the middle of a sequence? (Answer: you need masked language models like BERT or bidirectional setups.)
- Why do we never see "LM" for "little model"? Is "large" in LLM meaningful, or marketing?
- Vocab size matters: bigger vocab = fewer tokens per word but bigger embedding matrix. Where's the trade-off?
Production pitfalls
- Tokenization mismatch. If your app splits text with a different tokenizer than the model, your token count budget breaks. Always use the model's official tokenizer (
tiktokenfor OpenAI,transformers.AutoTokenizerfor HF). - Non-English pricing trap. One Chinese character can be 2-3 tokens in GPT-4. French and Spanish are ~1.5x more tokens than English. Pricing per token can double for non-English apps.
- Stopping tokens forgotten. If you don't send the right
<EOS>orstopparameter, models can run on forever until hittingmax_tokens, wasting compute and latency. - Probabilistic != deterministic. Same prompt can yield different outputs. If you need determinism, set
temperature=0ANDseed=X, and even then drift can occur across model versions.
Alternatives / Comparisons
| Approach | What it is | When to prefer |
|---|---|---|
| Autoregressive LM (GPT-style) | Predict next token left to right | General-purpose generation |
| Masked LM (BERT-style) | Fill blanks in a sequence | Classification, embeddings, NER |
| Encoder-decoder (T5-style) | Seq-in, seq-out with separate encoder and decoder | Translation, summarization when input and output differ |
| Diffusion LMs (LLaDA, 2024) | Generate by iterative denoising instead of token-by-token | Research, potential for parallelism |
In 2026, autoregressive dominates. Diffusion LMs are a live research area but not yet in production.
Mini-lab
labs/language-models-from-scratch/ (to create) - train a tiny character-level LM on one of your own text files, watch it learn from pure self-supervision. Goal: feel how training pairs are constructed.
Suggested starter: Karpathy's nanoGPT or makemore repos.
Further reading
- Claude Shannon, "Prediction and Entropy of Printed English" (1951) - origin paper
- Chinchilla paper "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) - scaling laws
- Llama 3 technical report (Meta, 2024) - modern training-to-saturation recipe
- Andrej Karpathy, "Let's build GPT from scratch" (YouTube) - hands-on explanation
- Huyen, Chapter 2 of AI Engineering - deeper dive on training data and model design