Lecture 1: Introduction to NLP & Text Representation

CUSO WS on Large Language Models

Nicolai Berk

2025-09-03

Introductions

Hi, my name is Nico

Postdoc at ETH Public Policy Group & IPL
Main focus: media and democracy
Working a lot with political text, especially news
Research areas:
- Effects of media framing
- Content moderation and hate speech
- Media pluralism & democratic backsliding
Heavy use of NLP & LLMs in all areas

And you?

Please introduce yourself:

What is your name and affiliation?
What are your research interests?
What is your experience with NLP & programming?
Why are you interested in LLMs?
Do you have a specific application in mind?

Motivation

Why Natural Language Processing (NLP)?

Political campaigns
Legislation
News articles
Social media interactions
…

Why NLP?

Research questions often require measurement from millions of documents
Reading & annotating would take a lifetime
Our (or our RA’s) definitions might be inconsistent
Not a great use of our time

We need scalable methods!

Examples

Gentzkow and Shapiro (2010) measuring media bias with word prevalence

Examples

Monroe, Colaresi, and Quinn (2008) exploring differences in partisan language use

Examples

Kroon, Trilling, and Raats (2021) studying stereotypes with word embeddings

Examples

Rheault and Cochrane (2020) scaling politicians with document embeddings

Examples

Berk (2025) studying migration coverage with BERT models

Examples

Le Mens and Gallego (2025) positioning political texts with large language models

Summary

NLP methods offer powerful ways to study language at scale
Many different methods and tasks
Focus of this course:
- Text representation (today)
- Machine learning (tomorrow)
- Transformer models (tomorrow & Friday)

Course Overview

NLP? I signed up for LLMs!

LLMs are highly complex
Intuition about them requires understanding of embeddings, machine learning, and neural networks
Course introduces each
These methods are powerful in their own right
We will also cover use of LLMs

Course Structure

Today

Morning: Intro to Python & Text Representation

Afternoon: Embeddings

Thursday

Morning: Machine Learning

Afternoon: Intro to Transformer Models

Friday

Morning: Generative Transformers

Afternoon: Using LLMs in your research/tbd

Course Structure and Conduct

Each session consists of a lecture and a hands-on coding tutorial.
It makes sense to exchange with your neighbor about coding problems.
Use of AI is explicitly encouraged - course should teach you to understand the code, not make you an expert programmer.
Please ask lots of questions & interrupt me!
Be nice!

Course Materials

Only relevant source: github.com/nicolaiberk/llm_ws

Contains

Syllabus
Links to slides
Notebooks for each session
Additional materials

Content will be added every day

Intro to Python

Why Python?

Python is a versatile programming language with many, many applications.
Simple syntax, versatile tool.
Less statistics focus than R (more ‘practical’)
Rich ecosystem of libraries for text processing, machine learning, and LLMs.

Google Colab

You can download python and use it locally on your computer
But simpler if we all use the same (better) infrastructure
Google Colab is a free cloud service where we can execute python code.
Crucial: provides access to GPUs for training LLMs.

Break

Tutorial I

Intro to Colab & Python Basics

Notebook

Representing Text

Some lingo

Corpus: a collection of texts used for analysis
Document: a single text within a corpus
Token: a single unit of text (e.g. a word or punctuation mark)
Vocabulary: the set of unique tokens in a corpus

Text analysis: how?

For most research questions, our unit of analysis is the document (though we explore alternatives)
But computers cannot read
We need to transform our text into numbers

Quantifying Text

Most Basic: Count Words

Dictionary approach: count occurrences of each token, use directly as outcome
Bag-of-Words (BoW): represents text as a vector of token counts
- Each unique token in the text becomes a feature
- Allows comparing text similarity and using it in models

Bag-of Words

How would you encode the following sentence?

Dance like nobody is watching

Bag-of Words

Dance like nobody is watching

dance	like	nobody	is	watching
1	1	1	1	1

Bag-of Words

Move like nobody is watching

dance	like	nobody	is	watching	move
1	1	1	1	1	0
0	1	1	1	1	1

Bag-of Words

I like nobody

dance	like	nobody	is	watching	move	I
1	1	1	1	1	0	0
0	1	1	1	0	1	0
0	1	1	0	0	0	1

Bag-of-Words: use cases

Intuition: same words - similar meaning

Sentiment analysis: compare use of negative and positive terms (e.g. Vader)
Compare similarity of texts
Clustering/topic modelling
Can be used as input for machine learning models (tomorrow)

Which limitations do you see with this approach?

One limitation of BoW

BoW assigns similar importance to all words
But: not all words are equally informative!

Example: news topics

Importance of very common terms like ‘the’, ‘news’, or ‘politics’ is the same as for ‘climate’ or ‘gaza’.

Solution: weight by prevalence across documents

TF-IDF

Term Frequency-Inverse Document Frequency

Idea: specificity of word given by inverse document frequency (how many documents contain the word)

\[TFIDF = \frac{f_{t,d}}{log\frac{n_t}{N}} \]

\(f_{t,d}\): Frequency of Term in Document

\(n_t\): Total Number of Documents Containing Term

\(N\): Total Number of Documents

Intuition

Dance like nobody is watching

Move like nobody is watching

dance	like	nobody	is	watching	move
1	1	1	1	1	0
0	1	1	1	1	1

Tokenization

What is a word?

How can we split our text?

Natural unit usually is the word
In some research far harder (e.g. genetics)
Most basic form: split on spaces and punctuation
Alternatives:
- stemming/lemmatization
- selection
- n-grams: include two- or three-word phrases

Stemming

Reduces words to base words by removing common suffixes and prefixes
This reduces vocabulary size
Usually based on strict rules (e.g. “running” -> “run”)
Advanced version: lemmatization (uses dictionary form)

Selection

Several ways to reduce dictionary size further

Stopword removal (e.g. “the”, “is”, “in”)
Removal of very rare words
Sometimes also very frequent words removed

Why is it useful to reduce our vocabulary size?

Why might it nevertheless be useful to retain common n-grams?

Tutorial II

Pandas & basic text representation

Notebook

Resources

Berk, Nicolai. 2025. “The Impact of Media Framing in Complex Information Environments.” Political Communication, 1–17.

Gentzkow, Matthew, and Jesse M Shapiro. 2010. “What Drives Media Slant? Evidence from US Daily Newspapers.” Econometrica 78 (1): 35–71.

Kroon, Anne C, Damian Trilling, and Tamara Raats. 2021. “Guilty by Association: Using Word Embeddings to Measure Ethnic Stereotypes in News Coverage.” Journalism & Mass Communication Quarterly 98 (2): 451–77.

Le Mens, Gaël, and Aina Gallego. 2025. “Positioning Political Texts with Large Language Models by Asking and Averaging.” Political Analysis 33 (3): 274–82.

Monroe, Burt L, Michael P Colaresi, and Kevin M Quinn. 2008. “Fightin’words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16 (4): 372–403.

Rheault, Ludovic, and Christopher Cochrane. 2020. “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora.” Political Analysis 28 (1): 112–33.

Lecture 1: Introduction to NLP & Text Representation

Introductions

Hi, my name is Nico

And you?

Please introduce yourself:

Motivation

Why Natural Language Processing (NLP)?

Social interactions happen through text:

Why NLP?

We need scalable methods!

Examples

Examples

Examples

Examples

Examples

Examples

Summary

Course Overview

NLP? I signed up for LLMs!

Course Structure

Morning: Intro to Python & Text Representation

Afternoon: Embeddings

Morning: Machine Learning

Afternoon: Intro to Transformer Models

Morning: Generative Transformers

Afternoon: Using LLMs in your research/tbd

Course Structure and Conduct

Course Materials

Contains

Content will be added every day

Intro to Python

Why Python?

Google Colab

Break

Tutorial I

Representing Text

Some lingo

Text analysis: how?

Quantifying Text

Most Basic: Count Words

Bag-of Words

How would you encode the following sentence?

Dance like nobody is watching

Bag-of Words

Dance like nobody is watching

Bag-of Words

Move like nobody is watching

Bag-of Words

I like nobody

Bag-of-Words: use cases

Which limitations do you see with this approach?

One limitation of BoW

Example: news topics

Solution: weight by prevalence across documents

TF-IDF

Intuition

Dance like nobody is watching

Move like nobody is watching

Tokenization

How can we split our text?

Stemming

Selection

Several ways to reduce dictionary size further

Why is it useful to reduce our vocabulary size?

Why might it nevertheless be useful to retain common n-grams?

Tutorial II

Resources