Lecture 2: Embeddings

CUSO WS on Large Language Models

Nicolai Berk

2025-09-03

Introduction to Embeddings

Recap: Bag-of Words

Dance like nobody’s watching

dance	move	steal	like	nobody’s	watching
1	0	0	1	1	1

Recap: Bag-of Words

Move like nobody’s watching

dance	move	steal	like	nobody’s	watching
1	0	0	1	1	1
0	1	0	1	1	1

Recap: Bag-of Words

Steal like nobody’s watching

dance	move	steal	like	nobody’s	watching
1	0	0	1	1	1
0	1	0	1	1	1
0	0	1	1	1	1

Recap: Bag-of Words

Nobody’s watching: steal!

dance	move	steal	like	nobody’s	watching
1	0	0	1	1	1
0	1	0	1	1	1
0	0	1	1	1	1
0	0	1	0	1	1

What do we miss with this representation?

Limitations of BoW & Tf-IDF

Ignores semantic meaning of words/tokens
Ignores order of words/tokens
High dimensionality & sparsity

Distributional Hypothesis

Challenge: How can we measure meaning?

Distributional Hypothesis

A word’s meaning can be inferred from the context it appears in.

Firth (1957): “know a word by the company it keeps”

Example: what is “ziti”?

How can we conceptualize this?

Semantic Space

How do we put this into numbers?

Two (and a half) approaches:

Train a model to predict (word2vec)
1. a word given its context (skip-gram)
2. the context given a word (CBOW)
Co-occurrence matrix (GloVe)
- assess how often do words appear together
- minimize the difference in representations of words that appear in similar contexts

Identical result: word embeddings

`word2vec` (skip-gram)

Mikolov et al. (2013)

1. Identify word context¹

2. Build training data

3. Train a model

Result

Weight matrix = word embeddings!

Training Data and Bias

Embeddings simply reflect the data they are trained on
If biases exist in the data, they will be reflected in the embeddings
This is a problem, e.g. when using embeddings for downstream tasks
However, this also enables us to study these biases (Kozlowski, Taddy, and Evans 2019; Kroon, Trilling, and Raats 2021)

Thinking about the data we use is essential!

Working with Embeddings

Similar context \(\rightarrow\) similar embeddings

Vector Basics

Source: Math Insight

Vector Basics

Magnitude: length of a vector

\[||\mathbf{a}|| = \sqrt{a_1^2 + a_2^2 + \ldots + a_n^2}\]

Vector Addition

\[\mathbf{a} + \mathbf{b} = a_1 + b_1, a_2 + b_2 + \ldots, a_n + b_n\]

\[ \mathbf{a} + \mathbf{b} = [1,3] + [3,2] = [4,5] \]

Vector Addition

Visually

Vector Addition

Visually

Vector Addition

Visually

Basic Operations: Examples

Paris is to France as X is to Germany

\(WV_{Paris} - WV_{France} + WV_{Germany} = WV_{X}\)

Short break

Tutorial: Word Embeddings

Notebook

Advanced Embeddings

Advanced Operations: Projection

Example from Kozlowski, Taddy, and Evans (2019)

Advanced Operations: Projection

Visually

Neat interactive visualization: Math Insight

Advanced Operations: Projection

Dot Product

\[\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n\]

\[\mathbf{a} \cdot \mathbf{b} = 1 * 3 + 3 * 2 = 3 + 6 = 9\]

Advanced Operations: Projection

Normalize by magnitude of \(\mathbf{b}\)

(to get magnitude of projection)

\[\frac{9}{||\mathbf{b}||} = \frac{9}{\sqrt{3^2 + 2^2}} = \frac{9}{\sqrt{13}} \approx 2.5\]

Advanced Operations: Projection

Visually

Cosine Similarity

\[\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} \]

Seems familiar? Remember magnitude of the projection:

\[\frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{b}|} \]

Difference: normalization by magnitude of both vectors.

Hence bound in interval \([-1, 1]\)

Cosine Similarity

\[\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} \]

What is \(\theta\)?

Neat feature: \(\theta\) is the angle between our two vectors!

Cosine Similarity: Intuition

Defining Semantic Axes

Nice feature of all this algebra: you can define semantic axes simply by substracting vectors of polar opposites!

\[WV_{frenchness} = WV_{french} - WV_{german}\]

The projection of a term on this vector tells us which pole it is more associated with!

Term	Frenchness
Paris	0.298
Lausanne	0.073
Bern	-0.177
Berlin	-0.333

Document Embeddings

“The paragraph token can be thought of as another word.” (p.3)

Figure 2 from Le and Mikolov (2014)

Other Embeddings

Any collection of texts can be a document!

Paragraphs
Documents
Speakers (Rheault and Cochrane 2020)
Countries
…

Short break

Embeddings: Tutorial 2

Notebook

Closing Remarks

Tracking Changes in Meaning over Time

Chronologically trained embeddings (Rodman 2020)

Train new model on each time slice
Initialize training for time slice \(t\) with embeddings from time slice \(t-1\)

Embedding regression (Rodriguez, Spirling, and Stewart 2023)

Uses a multivariate regression framework to model embeddings.
Makes efficient use of scarce data.
Allows for hypothesis testing \(\rightarrow\) many use cases!

State of the Art Embeddings

Sentence Transformers

Based on the Transformer architecture.
Pre-trained on large corpora.
Fine-tuned for specific tasks.

Note of Caution: Validation

Many researcher degrees of freedom: seed words, hyperparameters, etc.
Risk of accommodating own biases and cherry picking

\(\rightarrow\) Validation is essential!

Precise method will depend on the task at hand.
Correlate with established measures
Gold standard remains human assessment

See you tomorrow!

References

Firth, John Rupert. 1957. Studies in Linguistic Analysis. Oxford: Wiley-Blackwell.

Kozlowski, Austin C, Matt Taddy, and James A Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings.” American Sociological Review 84 (5): 905–49.

Kroon, Anne C, Damian Trilling, and Tamara Raats. 2021. “Guilty by Association: Using Word Embeddings to Measure Ethnic Stereotypes in News Coverage.” Journalism & Mass Communication Quarterly 98 (2): 451–77.

Le, Quoc, and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” In International Conference on Machine Learning, 1188–96. PMLR.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

Rheault, Ludovic, and Christopher Cochrane. 2020. “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora.” Political Analysis 28 (1): 112–33.

Rodman, Emma. 2020. “A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors.” Political Analysis 28 (1): 87–111.

Rodriguez, Pedro L, Arthur Spirling, and Brandon M Stewart. 2023. “Embedding Regression: Models for Context-Specific Description and Inference.” American Political Science Review 117 (4): 1255–74.

Lecture 2: Embeddings

Introduction to Embeddings

Recap: Bag-of Words

Dance like nobody’s watching

Recap: Bag-of Words

Move like nobody’s watching

Recap: Bag-of Words

Steal like nobody’s watching

Recap: Bag-of Words

Nobody’s watching: steal!

What do we miss with this representation?

Limitations of BoW & Tf-IDF

Distributional Hypothesis

Challenge: How can we measure meaning?

Example: what is “ziti”?

How can we conceptualize this?

Semantic Space

How do we put this into numbers?

Two (and a half) approaches:

Identical result: word embeddings

word2vec (skip-gram)

1. Identify word context1

2. Build training data

3. Train a model

Result

Weight matrix = word embeddings!

Training Data and Bias

Thinking about the data we use is essential!

Working with Embeddings

Similar context \(\rightarrow\) similar embeddings

Vector Basics

Vector Basics

Vector Addition

Vector Addition

Visually

Vector Addition

Visually

Vector Addition

Visually

Basic Operations: Examples

Paris is to France as X is to Germany

Short break

Tutorial: Word Embeddings

Advanced Embeddings

Advanced Operations: Projection

Example from Kozlowski, Taddy, and Evans (2019)

Advanced Operations: Projection

Visually

Advanced Operations: Projection

Dot Product

Advanced Operations: Projection

Normalize by magnitude of \(\mathbf{b}\)

Advanced Operations: Projection

Visually

Cosine Similarity

Hence bound in interval \([-1, 1]\)

Cosine Similarity

What is \(\theta\)?

Cosine Similarity: Intuition

Cosine Similarity: Intuition

Cosine Similarity: Intuition

Defining Semantic Axes

Document Embeddings

Other Embeddings

Any collection of texts can be a document!

Short break

Embeddings: Tutorial 2

Closing Remarks

Tracking Changes in Meaning over Time

Chronologically trained embeddings (Rodman 2020)

Embedding regression (Rodriguez, Spirling, and Stewart 2023)

State of the Art Embeddings

Sentence Transformers

Note of Caution: Validation

See you tomorrow!

References

`word2vec` (skip-gram)

1. Identify word context¹