CUSO WS on LLMs

Session 3: Intro to Supervised Machine Learning

Nicolai Berk

2025-09-03

Recap: Text Representation


  • Capture Semantic Meaning with Embeddings
  • Document Representations:
    • Bag of Words (BoW)
    • Term Frequency-Inverse Document Frequency (TF-IDF)
    • Document Embeddings

Today: how can we use these representations to measure concepts of interest?

Today


  • The basics of supervised machine learning
  • Model evaluation
  • Model selection
  • Intro to neural networks
  • Training neural networks

Examples

Netflix recommendations

Examples

Drug Development (Source: Catacutan et al. 2024)

Examples

ChatGPT

ML Overview

Many different forms!

  • Regression
  • Classification
  • Clustering/Unsupervised ML
  • Generative Models

Focus here on classification/supervised ML!


Supervised learning (A very precise definition):



We know stuff about some documents and want to know the same stuff about other documents.

Core Challenge: Prediction

Some Lingo


Term Meaning
Classifier a statistical model fitted to some data to make predictions about different data.
Training The process of fitting the classifier to the data.
Train and test set Datasets used to train and evaluate the classifier.
Vectorizer A tool used to translate text into numbers.

The Classic Pipeline for Text Classification


  1. Annotate subset.
  2. Divide into training- and test-set.
  3. Transform text into numerical features.
  4. Fit model.
  5. Predict.
  6. Evaluate.

The Classic Pipeline

0. Annotation


  • We need data from which to learn.
  • Assign labels to documents.
  • Usually randomly sampled.

0. Annotation

1. Divide into train and test (and val)


  • Usually randomly sampled (not always!)
  • Customary: 90-10/80-10-10 split
  • Only consideration for test/validation set size: precision of metrics - for large datasets 10% test split not sensible

2. Transformation


Statistical models can only read numbers

\(\rightarrow\) we need to translate!

Classic DFM

ID Text
1 This is a text
2 This is no text
ID This is a text no
1 1 1 1 1 0
2 1 1 0 1 1

3. Fit model.


  • Many different models: OLS, logistic regression
  • Other common models: Naïve Bayes, SVM, XGBoost

Not delving into different models today!

4. Predict.

Use trained model to generate labels for unlabeled cases.

review label
great movie! ?
what a bunch of cr*p ?
I lost all faith in humanity after watching this ?

5. Evaluation


Confusion Matrix

FALSE TRUE
FALSE 688 9
TRUE 37 266

5. Evaluation


Term Meaning
Accuracy How much does it get right overall?
Recall How much of the relevant cases does it find?
Precision How many of the found cases are relevant?
F1 Score Weighted average of precision and recall.

Issues with Common Metrics

Issues with Common Metrics

Issues with Common Metrics

Best Practice Evaluation

  • Compare against informative baselines
    • Random prediction at prevalence rate
    • Compare classifiers of varying complexity
  • Think about metric of interest (cancer detection vs. ad targeting)
  • Use prevalence-insensitive metrics:
    • Matthew’s correlation coefficient (MCC)
    • Youden’s J/Bookmakers informedness (BM)

Break

Tutorial I

Training a simple text classifier

Finding the perfect match

Mean Squared Error (MSE)

Mean Squared Error (MSE)



\[MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

The bias-variance tradeoff

  • Read this short explainer (stop when you reach LOESS)
  • Together with 1-2 colleagues, explain to each other
    • What is underfitting?
    • What is overfitting?
    • What is bias?
    • What is variance?
    • How do you balance bias and variance in ML?

The bias-variance tradeoff

The bias-variance tradeoff

The bias-variance tradeoff

What is a Neural Network?

Deep Neural Networks

Source: IBM

A Single Node

A Single Node

\[\hat{y} = g(w_0 + \sum_{i=1}^{m} w_i x_i)\]

A Single Node

Deep Neural Networks

Source: IBM

The Activation Function

The Activation Function

Important: non-linear! (why do you think that is?)

Training Deep Learning Models

What is “Training” ?


  • Remember from ML course: Training is the process of optimizing a model’s parameters on a specific task using labeled data.
  • In regression framework, this is called fitting the model to the training data.

Training Deep Learning Models


  • Read the section on backpropagation (black background) in this brief explainer
  • Together with your partner, explain to each other:
    • What is a forward pass?
    • What is loss?
    • What is a backward pass?

Hackathon!

Who gets the best F1 score?