Lecture 2: Embeddings

CUSO WS on Large Language Models

Nicolai Berk

2025-09-03

Introduction to Embeddings

Recap: Bag-of Words


Dance like nobody’s watching

dance move steal like nobody’s watching
1 0 0 1 1 1

Recap: Bag-of Words


Move like nobody’s watching

dance move steal like nobody’s watching
1 0 0 1 1 1
0 1 0 1 1 1

Recap: Bag-of Words


Steal like nobody’s watching

dance move steal like nobody’s watching
1 0 0 1 1 1
0 1 0 1 1 1
0 0 1 1 1 1

Recap: Bag-of Words


Nobody’s watching: steal!

dance move steal like nobody’s watching
1 0 0 1 1 1
0 1 0 1 1 1
0 0 1 1 1 1
0 0 1 0 1 1


What do we miss with this representation?

Limitations of BoW & Tf-IDF


  • Ignores semantic meaning of words/tokens
  • Ignores order of words/tokens
  • High dimensionality & sparsity

Distributional Hypothesis


Challenge: How can we measure meaning?


Distributional Hypothesis

A word’s meaning can be inferred from the context it appears in.


Firth (1957): “know a word by the company it keeps”

Example: what is “ziti”?

Source: Your Dictionary

Source: Wikipedia

How can we conceptualize this?

Semantic Space

How do we put this into numbers?

Two (and a half) approaches:

  1. Train a model to predict (word2vec)
    1. a word given its context (skip-gram)
    2. the context given a word (CBOW)
  2. Co-occurrence matrix (GloVe)
    • assess how often do words appear together
    • minimize the difference in representations of words that appear in similar contexts

Identical result: word embeddings

word2vec (skip-gram)

Mikolov et al. (2013)

1. Identify word context1

2. Build training data

3. Train a model

Result

Weight matrix = word embeddings!

Training Data and Bias

Thinking about the data we use is essential!

Working with Embeddings

Similar context \(\rightarrow\) similar embeddings

Vector Basics

Source: Math Insight

Vector Basics


Magnitude: length of a vector


\[||\mathbf{a}|| = \sqrt{a_1^2 + a_2^2 + \ldots + a_n^2}\]

Vector Addition


\[\mathbf{a} + \mathbf{b} = a_1 + b_1, a_2 + b_2 + \ldots, a_n + b_n\]


\[ \mathbf{a} + \mathbf{b} = [1,3] + [3,2] = [4,5] \]

Vector Addition

Visually

Vector Addition

Visually

Vector Addition

Visually

Basic Operations: Examples


Paris is to France as X is to Germany

\(WV_{Paris} - WV_{France} + WV_{Germany} = WV_{X}\)

Short break

Tutorial: Word Embeddings

Notebook

Advanced Embeddings

Advanced Operations: Projection

Example from Kozlowski, Taddy, and Evans (2019)

Advanced Operations: Projection

Visually

Neat interactive visualization: Math Insight

Advanced Operations: Projection

Dot Product

\[\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n\]

\[\mathbf{a} \cdot \mathbf{b} = 1 * 3 + 3 * 2 = 3 + 6 = 9\]

Advanced Operations: Projection

Normalize by magnitude of \(\mathbf{b}\)

(to get magnitude of projection)


\[\frac{9}{||\mathbf{b}||} = \frac{9}{\sqrt{3^2 + 2^2}} = \frac{9}{\sqrt{13}} \approx 2.5\]

Advanced Operations: Projection

Visually

Cosine Similarity

\[\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} \]

Seems familiar? Remember magnitude of the projection:

\[\frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{b}|} \]

Difference: normalization by magnitude of both vectors.

Hence bound in interval \([-1, 1]\)

Cosine Similarity


\[\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} \]

What is \(\theta\)?

Neat feature: \(\theta\) is the angle between our two vectors!

Cosine Similarity: Intuition

Cosine Similarity: Intuition

Cosine Similarity: Intuition

Defining Semantic Axes

Nice feature of all this algebra: you can define semantic axes simply by substracting vectors of polar opposites!

\[WV_{frenchness} = WV_{french} - WV_{german}\]

The projection of a term on this vector tells us which pole it is more associated with!

Term Frenchness
Paris 0.298
Lausanne 0.073
Bern -0.177
Berlin -0.333

Document Embeddings

“The paragraph token can be thought of as another word.” (p.3)

Figure 2 from Le and Mikolov (2014)

Other Embeddings


Any collection of texts can be a document!

Short break

Embeddings: Tutorial 2

Notebook

Closing Remarks

Tracking Changes in Meaning over Time

Chronologically trained embeddings (Rodman 2020)

  • Train new model on each time slice
  • Initialize training for time slice \(t\) with embeddings from time slice \(t-1\)

Embedding regression (Rodriguez, Spirling, and Stewart 2023)

  • Uses a multivariate regression framework to model embeddings.
  • Makes efficient use of scarce data.
  • Allows for hypothesis testing \(\rightarrow\) many use cases!

State of the Art Embeddings


Sentence Transformers

  • Based on the Transformer architecture.
  • Pre-trained on large corpora.
  • Fine-tuned for specific tasks.

Note of Caution: Validation

  • Many researcher degrees of freedom: seed words, hyperparameters, etc.
  • Risk of accommodating own biases and cherry picking

\(\rightarrow\) Validation is essential!

  • Precise method will depend on the task at hand.
  • Correlate with established measures
  • Gold standard remains human assessment

See you tomorrow!

References

Firth, John Rupert. 1957. Studies in Linguistic Analysis. Oxford: Wiley-Blackwell.
Kozlowski, Austin C, Matt Taddy, and James A Evans. 2019. “The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings.” American Sociological Review 84 (5): 905–49.
Kroon, Anne C, Damian Trilling, and Tamara Raats. 2021. “Guilty by Association: Using Word Embeddings to Measure Ethnic Stereotypes in News Coverage.” Journalism & Mass Communication Quarterly 98 (2): 451–77.
Le, Quoc, and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” In International Conference on Machine Learning, 1188–96. PMLR.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.
Rheault, Ludovic, and Christopher Cochrane. 2020. “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora.” Political Analysis 28 (1): 112–33.
Rodman, Emma. 2020. “A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors.” Political Analysis 28 (1): 87–111.
Rodriguez, Pedro L, Arthur Spirling, and Brandon M Stewart. 2023. “Embedding Regression: Models for Context-Specific Description and Inference.” American Political Science Review 117 (4): 1255–74.