Session 6: LLMs for Social Science

CUSO WS on Large Language Models

Nicolai Berk

2025-09-03

Social Science With LLMs

Efficiency


You don’t always need the biggest model

  • Start with simple things
  • See how well they work
  • Use more advanced tools if necessary

Which tool when? (Annotation)


  • <10k observations: few-shot annotation
  • Depending on your budget, this number could be much higher
  • >>100k: train your own model, e.g. with synthetic annotation

Always validate with human, ideally expert annotations

Validation best practices

  • Always validate with human annotations
  • Use large samples, quantify uncertainty (e.g. bootstrap)
  • Use informed sampling when the data is imbalanced to keep the test set small (R package)
  • Think about relevant metrics, use multiple
  • Compare to benchmarks (random, simpler measures)

Reproducibility

  • Reproducibility is often very hard with LLMs
  • Proprietary models go out of fashion - stick to open source
  • Try to stick to the same infrastructure
  • Explicitly define the relevant parameters to ensure reproducibility (next slide)

Do you think reproducibility is important for LLMs?

Reproducibility Parameters


  • Seeds
  • Model version (where possible)
  • Temperature (should be 0)
  • top-k (should be 1)

Research Ethics and AI


  • Please read this Reddit post
  • Get together in groups of 2-3 and discuss
    • Do you think this research was unethical?
    • If so? Why? What were the major issues with the study?
    • Collect 2-3 arguments for your position.

Climate Impact

  • LLMs use up substantial amounts of energy and water and contribute to carbon emissions.
  • Training GPT-3 alone is estimated to have produced 552 tons of carbon dioxide. Thats 500+ flights Zurich-New York.
  • Data centers might already have 2.5 times the energy consumption of France (MIT News).
  • Use smaller models when possible - this is also economical.
  • codecarbon can track the emissions of your own training.

Security


  • Do not send you sensitive data to OpenAI
  • Talk to your university IT about provided infrastructure
  • Use your own endpoints
  • Ensure compliance with data protection regulations (e.g. servers in Europe)

Special use cases


Experimental Interventions

As you saw with the reddit study, you can use AI models in behavioral experiments. (Maybe do it better)

RAG for Archival Research

Researchers working with large amounts of archival data can digitize content and search it with LLMs.

Break

Bias

Bias


  • ML annotations are often inherently biased
  • If we use biased measures in our statistical models, our estimates will be biased as well
  • This issue is even bigger for LLMs, where the training data is often not known

Bias: Example

Do employers favor certain nationalities, holding skills constant?

  • You have a dataset of candidate profiles and whether they got an offer for a position or not.
  • You measure skill level using a GPT annotation of the candidate profiles.
  • You regress hiring decisions on applicants’ nationality and skill level.

What might be the issue here?

Bias: Example


  • Let’s assume GPT annotates Croatians as less skilled, other things equal.
  • At the same time, employers are discriminating against Croatians.
  • Depending on the strength of each bias, Croatians might be estimated to be treated equally or even less discriminated against!

Summary


  • When we use machine predictions, especially from LLMs, we might introduce unknown bias
  • This will bias our estimates and lead to faulty hypothesis testing
  • Worst case, this will lead to worse science

Thankfully, someone had an idea.

Design-based supervised learning (DSL)

Egami et al. (2024)


  • Corrects statistical estimates based on ML predictions by using annotated gold-standards
  • Two major components:

DSL - Simplified Intuition

  • In a regression, DSL corrects the estimates where the model deviates from the expert by replacing these annotations with the expert labels
  • Weigh these according to their sampling probability
  • Then regress this outcome on our predictors
  • Also works when using ML estimates as predictors

DSL - Core Adjustment

Conclusion

Takeaways

  • General understanding of how LLMs work
  • Window-shopping many tools
    • Embeddings for assessment of word and sentence meaning
    • Bag-of-words models for interpretable machine learning
    • Encoder models for (zero-shot) classification, similarity assessment, …
    • Decoder models for text generation and completion

Fin!

I learned a lot, thank you very much!

Resources

Egami, Naoki, Musashi Hinck, Brandon M Stewart, and Hanying Wei. 2024. “Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses.” Preprint from November 17: 2024.