Session 6: LLMs for Social Science

CUSO WS on Large Language Models

Nicolai Berk

2025-09-03

Efficiency

You don’t always need the biggest model

Start with simple things
See how well they work
Use more advanced tools if necessary

Which tool when? (Annotation)

<10k observations: few-shot annotation
Depending on your budget, this number could be much higher
>>100k: train your own model, e.g. with synthetic annotation

Always validate with human, ideally expert annotations

Validation best practices

Always validate with human annotations
Use large samples, quantify uncertainty (e.g. bootstrap)
Use informed sampling when the data is imbalanced to keep the test set small (R package)
Think about relevant metrics, use multiple
Compare to benchmarks (random, simpler measures)

Reproducibility

Reproducibility is often very hard with LLMs
Proprietary models go out of fashion - stick to open source
Try to stick to the same infrastructure
Explicitly define the relevant parameters to ensure reproducibility (next slide)

Do you think reproducibility is important for LLMs?

Reproducibility Parameters

Seeds
Model version (where possible)
Temperature (should be 0)
top-k (should be 1)

Research Ethics and AI

Please read this Reddit post
Get together in groups of 2-3 and discuss
- Do you think this research was unethical?
- If so? Why? What were the major issues with the study?
- Collect 2-3 arguments for your position.

Climate Impact

LLMs use up substantial amounts of energy and water and contribute to carbon emissions.
Training GPT-3 alone is estimated to have produced 552 tons of carbon dioxide. Thats 500+ flights Zurich-New York.
Data centers might already have 2.5 times the energy consumption of France (MIT News).
Use smaller models when possible - this is also economical.
codecarbon can track the emissions of your own training.

Security

Do not send you sensitive data to OpenAI
Talk to your university IT about provided infrastructure
Use your own endpoints
Ensure compliance with data protection regulations (e.g. servers in Europe)

Special use cases

Experimental Interventions

As you saw with the reddit study, you can use AI models in behavioral experiments. (Maybe do it better)

RAG for Archival Research

Researchers working with large amounts of archival data can digitize content and search it with LLMs.

Break

Bias

ML annotations are often inherently biased
If we use biased measures in our statistical models, our estimates will be biased as well
This issue is even bigger for LLMs, where the training data is often not known

Bias: Example

Do employers favor certain nationalities, holding skills constant?

You have a dataset of candidate profiles and whether they got an offer for a position or not.
You measure skill level using a GPT annotation of the candidate profiles.
You regress hiring decisions on applicants’ nationality and skill level.

What might be the issue here?

Bias: Example

Let’s assume GPT annotates Croatians as less skilled, other things equal.
At the same time, employers are discriminating against Croatians.
Depending on the strength of each bias, Croatians might be estimated to be treated equally or even less discriminated against!

Summary

When we use machine predictions, especially from LLMs, we might introduce unknown bias
This will bias our estimates and lead to faulty hypothesis testing
Worst case, this will lead to worse science

Thankfully, someone had an idea.

Design-based supervised learning (DSL)

Egami et al. (2024)

Corrects statistical estimates based on ML predictions by using annotated gold-standards
Two major components:

DSL - Simplified Intuition

In a regression, DSL corrects the estimates where the model deviates from the expert by replacing these annotations with the expert labels
Weigh these according to their sampling probability
Then regress this outcome on our predictors
Also works when using ML estimates as predictors

DSL - Core Adjustment

Conclusion

Takeaways

General understanding of how LLMs work
Window-shopping many tools
- Embeddings for assessment of word and sentence meaning
- Bag-of-words models for interpretable machine learning
- Encoder models for (zero-shot) classification, similarity assessment, …
- Decoder models for text generation and completion

Fin!

I learned a lot, thank you very much!

Resources

Egami, Naoki, Musashi Hinck, Brandon M Stewart, and Hanying Wei. 2024. “Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses.” Preprint from November 17: 2024.

Session 6: LLMs for Social Science

Social Science With LLMs

Efficiency

You don’t always need the biggest model

Which tool when? (Annotation)

Validation best practices

Reproducibility

Do you think reproducibility is important for LLMs?

Reproducibility Parameters

Research Ethics and AI

Climate Impact

Security

Special use cases

Experimental Interventions

RAG for Archival Research

Break

Bias

Bias

Bias: Example

What might be the issue here?

Bias: Example

Summary

Thankfully, someone had an idea.

Design-based supervised learning (DSL)

Two major components:

DSL - Simplified Intuition

DSL - Core Adjustment

Conclusion

Takeaways

Fin!

Resources