Crash-course LLMs for Social Science
2025-09-12
“national”, “nationalism”, “nationalist”, “nationalize”
Source: 🤗 LLM Course
Source: 🤗 LLM Course
Note that this works better for some languages than for others
[CLS]: Represents the entire input sequence[SEP]: Separates different segments of text[PAD]: Used for padding sequences to the same length[UNK]: Represents unknown or out-of-vocabulary words[MASK]: Used for masked language modeling tasksThe following slides are based Tunstall, Von Werra, and Wolf (2022), Ch.2.
vs.
vs.
Source: Tunstall, Von Werra, and Wolf (2022), Ch. 2
\[x'_i = \sum_{j=1}^{n} w_{ij} x_j\]
Weighted average of all input embeddings
The dot product of the query and key vectors gives us the attention scores/weights
What does the dot product of two vectors indicate?
The attention scores are normalized using the “softmax” function to ensure they sum to 1
\[x'_i = \sum_{j=1}^{n} w_{ij} x_j\]
Tokenization, Attention & Inference
Encoder (e.g. BERT)

Decoder (e.g. GPT) - TOMORROW!

Source: The Illustrated Transformer
Source: Tunstall, Von Werra, and Wolf (2022), Ch. 2
Source: Devlin et al. (2019)
Source: 🤗 LLM Course Ch. 2

Output: probability distribution across tokens (“Virtual” (87%), “good” (8%), “helpful” (3%))
Tunstall, Von Werra, and Wolf (2022), Fig. 1-7
Essentially any aspect of a model that isn’t learned and needs to be chosen by the researcher.
Note: prompts are also hyperparameters!
Weights and Biases - “experiment” tracking and visualization
Fine-tuning a Transformer Model