Session 4: The Transformer

CUSO WS on Large Language Models

Nicolai Berk

2025-09-03

Today: Afternoon

Part I: Advanced NLP
- Advanced Tokenization
- Contextualized Embeddings
Part II: Intro to Transformers
- The Transformer Architecture
- The Huggingface Ecosystem
- Fine-tuning a Transformer Model

Tokenization

Issues with simple word tokenization

Consider the following words:

“national”, “nationalism”, “nationalist”, “nationalize”

What are potential issues with simple word tokenization here?

Any ideas how to solve these issues?

Solution: Subword Tokenization

Source: 🤗 LLM Course

Which advantages do you see with subword tokenization?

Different Tokenizers

Source: 🤗 LLM Course

How does it work? (roughly)

Start with small vocabulary of characters and special tokens
Map them in the training data (e.g. “w ##o ##r ##d”)
Combine most frequent pairs (BPE) or unexpectedly frequent ones, given constitutive terms (WordPiece)
Stop when you reach your desired vocabulary size

Note that this works better for some languages than for others

Special Tokens

[CLS]: Represents the entire input sequence
[SEP]: Separates different segments of text
[PAD]: Used for padding sequences to the same length
[UNK]: Represents unknown or out-of-vocabulary words
[MASK]: Used for masked language modeling tasks

Contextualized Embeddings

The following slides are based Tunstall, Von Werra, and Wolf (2022), Ch.2.

Issues with Classic Embeddings

Contextual Meaning

Sentence a: Time flies like an arrow

vs.

Sentence b: Fruit flies like a banana

How could we represent the different meanings?

Contextualized Embeddings

Contextual Meaning

Sentence a: Time flies like an arrow

vs.

Sentence b: Fruit flies like a banana

How do you infer the different meanings?

Attention Mechanism: Intuition

Source: Tunstall, Von Werra, and Wolf (2022), Ch. 2

Solution: Weighted Average

\[x'_i = \sum_{j=1}^{n} w_{ij} x_j\]

Weighted average of all input embeddings

\(x'_i\): Contextualized embedding of token \(i\)
\(x_j\): Embedding of token \(j\)
\(w_{ij}\): Attention weight for token \(j\) with respect to token \(i\)
\(n\): Number of tokens in the input sequence

Step 1: Create query, key, and value vectors

Query: Represents the token itself
Key: Represents the context of the token
Value: Again the token itself (more later)

Step 2: Calculate the attention scores

The dot product of the query and key vectors gives us the attention scores/weights

What does the dot product of two vectors indicate?

Finalize: Normalize & take average

Step 3: Normalize the attention scores

The attention scores are normalized using the “softmax” function to ensure they sum to 1

Step 4: Multiply the normalized scores with the value vectors and sum them up

\[x'_i = \sum_{j=1}^{n} w_{ij} x_j\]

Attention Mechanism: Example

Attention in Practice

Query, key, and value vectors are learned representations
Attention calculated for each hidden layer
Multiple attention ‘heads’ are used in parallel
Outputs combined using another learned linear transformation

Positional Encodings

We can also capture the position of each token in the sequence
Similar approach:
- Create a vector for each position in the sequence
- Add these vectors to the token embeddings
- This allows the model to understand the order of tokens

Breaktime

Tutorial I

Tokenization, Attention & Inference

Notebook

The Transformer

Some terminology

Encoder (e.g. BERT)

Converts an input sequence of tokens into a sequence of embedding vectors
Trained on masked language modeling
Tasks: Text classification, named entity recognition, extractive question answering, …

Decoder (e.g. GPT) - TOMORROW!

Uses a sequence of embedding vectors to iteratively generate an output sequence of tokens, one token at a time
Trained to predict next word
Tasks: Mainly text generation (e.g. Chatbot responses)

Transformers Overview

Encoder Block

Source: The Illustrated Transformer

The Encoder Architecture (e.g. BERT)

Source: Tunstall, Von Werra, and Wolf (2022), Ch. 2

The Encoder Architecture (e.g. BERT)

Source: Devlin et al. (2019)

Classification Heads

Source: 🤗 LLM Course Ch. 2

Classification Heads

Many Tasks, One Model

Sequence Classification
Masked Language Modeling
Multiple Choice
Token Classification
Question Answering

Additional Considerations

Dropout: prevent overfitting by randomly setting a fraction of input units to 0 during training
Layer normalization: stabilize and accelerate training by normalizing the inputs to each layer
Residual connections: add input to output (bypass layers)

Training a Transformer

What is “Training” ?

Remember from ML course: Training is the process of optimizing a model’s parameters on a specific task using labeled data.
In regression framework, this is called fitting the model to the training data.

Masked Language Modeling (MLM)

Training objective used for models like BERT
Randomly masks a percentage of input tokens
Model learns to predict the masked tokens based on context

Output: probability distribution across tokens (“Virtual” (87%), “good” (8%), “helpful” (3%))

Three Essential Steps

Forward pass: The process of passing input data through the model to obtain predictions.
Loss: A measure of how well the model’s predictions match the ground truth.
Backward pass: The process of updating model parameters based on the loss.

Full Training vs. Fine-Tuning

Read this short explainer
Form groups of 2-3
Alternate explaining to each other:
- the difference between pre-training and fine-tuning
- how does training a transformer from scratch work?
- how does fine-tuning work? How do the task and the model differ in this process?
Collect two benefits and drawbacks of each approach.

Transfer Learning/Fine-Tuning

Tunstall, Von Werra, and Wolf (2022), Fig. 1-7

Hyperparameters

Essentially any aspect of a model that isn’t learned and needs to be chosen by the researcher.

Hyperparameters

Many different ones; important for us:

Number of Epochs: The number of complete passes through the training dataset.
Batch Size: The number of training examples utilized in one iteration before updating weights.
Learning Rate: How much should the model update weights in response to loss?

More Hyperparameters

Weight Decay: Prevents overfitting by penalizing large weights.
Dropout: Ignores randomly selected neurons during training (prevents overfitting).
Seed: A value used to initialize the random number generator for reproducibility.

Note: prompts are also hyperparameters!

Understanding Training Dynamics

Read this guide on understanding learning curves (ignore the code bits)
Take the quiz
Go to this website visualizing training dynamics and apply your insights
- Make sure to select the ‘whirl’ pattern

Tracking Training Progress

Weights and Biases

Weights and Biases - “experiment” tracking and visualization

The Huggingface Ecosystem

Tutorial II

Fine-tuning a Transformer Model

Notebook

Resources

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86.

Tunstall, Lewis, Leandro Von Werra, and Thomas Wolf. 2022. Natural Language Processing with Transformers. " O’Reilly Media, Inc.".

Session 4: The Transformer

Today: Afternoon

Tokenization

Issues with simple word tokenization

Consider the following words:

What are potential issues with simple word tokenization here?

Any ideas how to solve these issues?

Solution: Subword Tokenization

Which advantages do you see with subword tokenization?

Different Tokenizers

How does it work? (roughly)

Special Tokens

Contextualized Embeddings

Issues with Classic Embeddings

Issues with Classic Embeddings

Contextual Meaning

Sentence a: Time flies like an arrow

Sentence b: Fruit flies like a banana

How could we represent the different meanings?

Contextualized Embeddings

Contextualized Embeddings

Contextual Meaning

Sentence a: Time flies like an arrow

Sentence b: Fruit flies like a banana

How do you infer the different meanings?

Attention Mechanism: Intuition

Solution: Weighted Average

Step 1: Create query, key, and value vectors

Step 2: Calculate the attention scores

Finalize: Normalize & take average

Step 3: Normalize the attention scores

Step 4: Multiply the normalized scores with the value vectors and sum them up

Attention Mechanism: Example

Attention in Practice

Positional Encodings

Breaktime

Tutorial I

The Transformer

Some terminology

Transformers Overview

Encoder Block

The Encoder Architecture (e.g. BERT)

The Encoder Architecture (e.g. BERT)

Classification Heads

Classification Heads

Many Tasks, One Model

Additional Considerations

Training a Transformer

What is “Training” ?

Masked Language Modeling (MLM)

Three Essential Steps

Full Training vs. Fine-Tuning

Transfer Learning/Fine-Tuning

Hyperparameters

Hyperparameters

Many different ones; important for us:

More Hyperparameters

Understanding Training Dynamics

Tracking Training Progress

Weights and Biases

The Huggingface Ecosystem

The Model Hub

Datasets

Tutorial II

Resources