CUSO WS on Large Language Models
2025-09-03
Gentzkow and Shapiro (2010) measuring media bias with word prevalence
Monroe, Colaresi, and Quinn (2008) exploring differences in partisan language use
Kroon, Trilling, and Raats (2021) studying stereotypes with word embeddings
Rheault and Cochrane (2020) scaling politicians with document embeddings
Berk (2025) studying migration coverage with BERT models
Le Mens and Gallego (2025) positioning political texts with large language models
Today
Thursday
Friday
Only relevant source: github.com/nicolaiberk/llm_ws
Intro to Colab & Python Basics
dance | like | nobody | is | watching |
---|---|---|---|---|
1 | 1 | 1 | 1 | 1 |
dance | like | nobody | is | watching | move |
---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 |
0 | 1 | 1 | 1 | 1 | 1 |
dance | like | nobody | is | watching | move | I |
---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 | 0 |
0 | 1 | 1 | 1 | 0 | 1 | 0 |
0 | 1 | 1 | 0 | 0 | 0 | 1 |
Intuition: same words - similar meaning
Importance of very common terms like ‘the’, ‘news’, or ‘politics’ is the same as for ‘climate’ or ‘gaza’.
Term Frequency-Inverse Document Frequency
Idea: specificity of word given by inverse document frequency (how many documents contain the word)
\[TFIDF = \frac{f_{t,d}}{log\frac{n_t}{N}} \]
\(f_{t,d}\): Frequency of Term in Document
\(n_t\): Total Number of Documents Containing Term
\(N\): Total Number of Documents
dance | like | nobody | is | watching | move |
---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 |
0 | 1 | 1 | 1 | 1 | 1 |
What is a word?
Pandas & basic text representation