CUSO WS on Large Language Models
2025-09-03
“national”, “nationalism”, “nationalist”, “nationalize”
Source: 🤗 LLM Course
Source: 🤗 LLM Course
Note that this works better for some languages than for others
[CLS]
: Represents the entire input sequence[SEP]
: Separates different segments of text[PAD]
: Used for padding sequences to the same length[UNK]
: Represents unknown or out-of-vocabulary words[MASK]
: Used for masked language modeling tasksThe following slides are based Tunstall, Von Werra, and Wolf (2022), Ch.2.
vs.
vs.
Source: Tunstall, Von Werra, and Wolf (2022), Ch. 2
\[x'_i = \sum_{j=1}^{n} w_{ij} x_j\]
Weighted average of all input embeddings
The dot product of the query and key vectors gives us the attention scores/weights
What does the dot product of two vectors indicate?
The attention scores are normalized using the “softmax” function to ensure they sum to 1
\[x'_i = \sum_{j=1}^{n} w_{ij} x_j\]
Tokenization, Attention & Inference
Encoder (e.g. BERT)
Decoder (e.g. GPT) - TOMORROW!
Source: The Illustrated Transformer
Source: Tunstall, Von Werra, and Wolf (2022), Ch. 2
Source: Devlin et al. (2019)
Source: 🤗 LLM Course Ch. 2
Output: probability distribution across tokens (“Virtual” (87%), “good” (8%), “helpful” (3%))
Tunstall, Von Werra, and Wolf (2022), Fig. 1-7
Essentially any aspect of a model that isn’t learned and needs to be chosen by the researcher.
Note: prompts are also hyperparameters!
Weights and Biases - “experiment” tracking and visualization
Fine-tuning a Transformer Model