CUSO WS on LLMs

Session 3: Intro to Supervised Machine Learning

Nicolai Berk

2025-09-03

Recap: Text Representation

Capture Semantic Meaning with Embeddings
Document Representations:
- Bag of Words (BoW)
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Document Embeddings

Today: how can we use these representations to measure concepts of interest?

Today

The basics of supervised machine learning
Model evaluation
Model selection
Intro to neural networks
Training neural networks

Examples

Netflix recommendations

Examples

Drug Development (Source: Catacutan et al. 2024)

Examples

ChatGPT

ML Overview

Many different forms!

Regression
Classification
Clustering/Unsupervised ML
Generative Models

Focus here on classification/supervised ML!

Supervised learning (A very precise definition):

We know stuff about some documents and want to know the same stuff about other documents.

Core Challenge: Prediction

Some Lingo

Term	Meaning
Classifier	a statistical model fitted to some data to make predictions about different data.
Training	The process of fitting the classifier to the data.
Train and test set	Datasets used to train and evaluate the classifier.
Vectorizer	A tool used to translate text into numbers.

The Classic Pipeline for Text Classification

Annotate subset.
Divide into training- and test-set.
Transform text into numerical features.
Fit model.
Predict.
Evaluate.

The Classic Pipeline

0. Annotation

We need data from which to learn.
Assign labels to documents.
Usually randomly sampled.

0. Annotation

1. Divide into train and test (and val)

Usually randomly sampled (not always!)
Customary: 90-10/80-10-10 split
Only consideration for test/validation set size: precision of metrics - for large datasets 10% test split not sensible

2. Transformation

Statistical models can only read numbers

\(\rightarrow\) we need to translate!

Classic DFM

ID	Text
1	This is a text
2	This is no text

ID	This	is	a	text	no
1	1	1	1	1	0
2	1	1	0	1	1

3. Fit model.

Many different models: OLS, logistic regression
Other common models: Naïve Bayes, SVM, XGBoost

Not delving into different models today!

4. Predict.

Use trained model to generate labels for unlabeled cases.

review	label
great movie!	?
what a bunch of cr*p	?
I lost all faith in humanity after watching this	?

5. Evaluation

Confusion Matrix

	FALSE	TRUE
FALSE	688	9
TRUE	37	266

5. Evaluation

Term	Meaning
Accuracy	How much does it get right overall?
Recall	How much of the relevant cases does it find?
Precision	How many of the found cases are relevant?
F1 Score	Weighted average of precision and recall.

Issues with Common Metrics

Issues with Common Metrics

Best Practice Evaluation

Compare against informative baselines
- Random prediction at prevalence rate
- Compare classifiers of varying complexity
Think about metric of interest (cancer detection vs. ad targeting)
Use prevalence-insensitive metrics:
- Matthew’s correlation coefficient (MCC)
- Youden’s J/Bookmakers informedness (BM)

Break

Tutorial I

Training a simple text classifier

Finding the perfect match

Mean Squared Error (MSE)

\[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The bias-variance tradeoff

Read this short explainer (stop when you reach LOESS)
Together with 1-2 colleagues, explain to each other
- What is underfitting?
- What is overfitting?
- What is bias?
- What is variance?
- How do you balance bias and variance in ML?

The bias-variance tradeoff

What is a Neural Network?

Deep Neural Networks

Source: IBM

A Single Node

\[\hat{y} = g(w_0 + \sum_{i=1}^{m} w_i x_i)\]

A Single Node

Deep Neural Networks

Source: IBM

The Activation Function

Important: non-linear! (why do you think that is?)

Training Deep Learning Models

What is “Training” ?

Remember from ML course: Training is the process of optimizing a model’s parameters on a specific task using labeled data.
In regression framework, this is called fitting the model to the training data.

Training Deep Learning Models

Read the section on backpropagation (black background) in this brief explainer
Together with your partner, explain to each other:
- What is a forward pass?
- What is loss?
- What is a backward pass?

Hackathon!

Who gets the best F1 score?