Lecture 1: Introduction to NLP & Text Representation

CUSO WS on Large Language Models

Nicolai Berk

2025-09-03

Introductions

Hi, my name is Nico

  • Postdoc at ETH Public Policy Group & IPL
  • Main focus: media and democracy
  • Working a lot with political text, especially news
  • Research areas:
    • Effects of media framing
    • Content moderation and hate speech
    • Media pluralism & democratic backsliding
  • Heavy use of NLP & LLMs in all areas

And you?


Please introduce yourself:

  • What is your name and affiliation?
  • What are your research interests?
  • What is your experience with NLP & programming?
  • Why are you interested in LLMs?
  • Do you have a specific application in mind?

Motivation

Why Natural Language Processing (NLP)?

Social interactions happen through text:

  • Political campaigns
  • Legislation
  • News articles
  • Social media interactions

Why NLP?


  • Research questions often require measurement from millions of documents
  • Reading & annotating would take a lifetime
  • Our (or our RA’s) definitions might be inconsistent
  • Not a great use of our time

We need scalable methods!

Examples


Gentzkow and Shapiro (2010) measuring media bias with word prevalence

Examples

Monroe, Colaresi, and Quinn (2008) exploring differences in partisan language use

Examples

Kroon, Trilling, and Raats (2021) studying stereotypes with word embeddings

Examples

Rheault and Cochrane (2020) scaling politicians with document embeddings

Examples

Berk (2025) studying migration coverage with BERT models

Examples

Le Mens and Gallego (2025) positioning political texts with large language models

Summary


  • NLP methods offer powerful ways to study language at scale
  • Many different methods and tasks
  • Focus of this course:
    • Text representation (today)
    • Machine learning (tomorrow)
    • Transformer models (tomorrow & Friday)

Course Overview

NLP? I signed up for LLMs!


  • LLMs are highly complex
  • Intuition about them requires understanding of embeddings, machine learning, and neural networks
  • Course introduces each
  • These methods are powerful in their own right
  • We will also cover use of LLMs

Course Structure

Today

Morning: Intro to Python & Text Representation

Afternoon: Embeddings

Thursday

Morning: Machine Learning

Afternoon: Intro to Transformer Models

Friday

Morning: Generative Transformers

Afternoon: Using LLMs in your research/tbd

Course Structure and Conduct

  • Each session consists of a lecture and a hands-on coding tutorial.
  • It makes sense to exchange with your neighbor about coding problems.
  • Use of AI is explicitly encouraged - course should teach you to understand the code, not make you an expert programmer.
  • Please ask lots of questions & interrupt me!
  • Be nice!

Course Materials

Only relevant source: github.com/nicolaiberk/llm_ws

Contains

  • Syllabus
  • Links to slides
  • Notebooks for each session
  • Additional materials

Content will be added every day

Intro to Python

Why Python?


  • Python is a versatile programming language with many, many applications.
  • Simple syntax, versatile tool.
  • Less statistics focus than R (more ‘practical’)
  • Rich ecosystem of libraries for text processing, machine learning, and LLMs.

Google Colab


  • You can download python and use it locally on your computer
  • But simpler if we all use the same (better) infrastructure
  • Google Colab is a free cloud service where we can execute python code.
  • Crucial: provides access to GPUs for training LLMs.

Break

Tutorial I

Intro to Colab & Python Basics

Notebook

Representing Text

Some lingo


  • Corpus: a collection of texts used for analysis
  • Document: a single text within a corpus
  • Token: a single unit of text (e.g. a word or punctuation mark)
  • Vocabulary: the set of unique tokens in a corpus

Text analysis: how?


  • For most research questions, our unit of analysis is the document (though we explore alternatives)
  • But computers cannot read
  • We need to transform our text into numbers

Quantifying Text

Most Basic: Count Words


  • Dictionary approach: count occurrences of each token, use directly as outcome
  • Bag-of-Words (BoW): represents text as a vector of token counts
    • Each unique token in the text becomes a feature
    • Allows comparing text similarity and using it in models

Bag-of Words


How would you encode the following sentence?

Dance like nobody is watching

Bag-of Words


Dance like nobody is watching

dance like nobody is watching
1 1 1 1 1

Bag-of Words


Move like nobody is watching

dance like nobody is watching move
1 1 1 1 1 0
0 1 1 1 1 1

Bag-of Words


I like nobody

dance like nobody is watching move I
1 1 1 1 1 0 0
0 1 1 1 0 1 0
0 1 1 0 0 0 1

Bag-of-Words: use cases

Intuition: same words - similar meaning

  • Sentiment analysis: compare use of negative and positive terms (e.g. Vader)
  • Compare similarity of texts
  • Clustering/topic modelling
  • Can be used as input for machine learning models (tomorrow)

Which limitations do you see with this approach?

One limitation of BoW

  • BoW assigns similar importance to all words
  • But: not all words are equally informative!

Example: news topics

Importance of very common terms like ‘the’, ‘news’, or ‘politics’ is the same as for ‘climate’ or ‘gaza’.

Solution: weight by prevalence across documents

TF-IDF

Term Frequency-Inverse Document Frequency

Idea: specificity of word given by inverse document frequency (how many documents contain the word)

\[TFIDF = \frac{f_{t,d}}{log\frac{n_t}{N}} \]

\(f_{t,d}\): Frequency of Term in Document

\(n_t\): Total Number of Documents Containing Term

\(N\): Total Number of Documents

Intuition

Dance like nobody is watching

Move like nobody is watching


dance like nobody is watching move
1 1 1 1 1 0
0 1 1 1 1 1

Tokenization

What is a word?

How can we split our text?

  • Natural unit usually is the word
  • In some research far harder (e.g. genetics)
  • Most basic form: split on spaces and punctuation
  • Alternatives:
    • stemming/lemmatization
    • selection
    • n-grams: include two- or three-word phrases

Stemming


  • Reduces words to base words by removing common suffixes and prefixes
  • This reduces vocabulary size
  • Usually based on strict rules (e.g. “running” -> “run”)
  • Advanced version: lemmatization (uses dictionary form)

Selection


Several ways to reduce dictionary size further

  • Stopword removal (e.g. “the”, “is”, “in”)
  • Removal of very rare words
  • Sometimes also very frequent words removed

Why is it useful to reduce our vocabulary size?

Why might it nevertheless be useful to retain common n-grams?

Tutorial II

Pandas & basic text representation

Notebook

Resources

Berk, Nicolai. 2025. “The Impact of Media Framing in Complex Information Environments.” Political Communication, 1–17.
Gentzkow, Matthew, and Jesse M Shapiro. 2010. “What Drives Media Slant? Evidence from US Daily Newspapers.” Econometrica 78 (1): 35–71.
Kroon, Anne C, Damian Trilling, and Tamara Raats. 2021. “Guilty by Association: Using Word Embeddings to Measure Ethnic Stereotypes in News Coverage.” Journalism & Mass Communication Quarterly 98 (2): 451–77.
Le Mens, Gaël, and Aina Gallego. 2025. “Positioning Political Texts with Large Language Models by Asking and Averaging.” Political Analysis 33 (3): 274–82.
Monroe, Burt L, Michael P Colaresi, and Kevin M Quinn. 2008. “Fightin’words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict.” Political Analysis 16 (4): 372–403.
Rheault, Ludovic, and Christopher Cochrane. 2020. “Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora.” Political Analysis 28 (1): 112–33.