10. LLMs

Data Science for Economists

Irene Iodice

2026-03-01

Learning objectives

By the end of today you should be able to:

  1. Explain why economists moved from bag-of-words to embeddings to LLMs.
  2. Describe what word embeddings capture and how transformers improve on them.
  3. Use LLMs via APIs from R (the ellmer package) for classification, extraction, and annotation tasks.
  4. Design structured outputs (JSON schemas) for reproducible measurement.
  5. Apply prompt-engineering strategies to economics research problems.
  6. Describe at least two published economics papers that leverage modern NLP/LLM representations.

Motivation

Why move beyond Bag-of-Words?

  • Order and context: BoW ignores word order – “bank raises rates” vs “rates raise bank”.
  • Synonymy: “job”, “employment”, “work” treated as independent features.
  • Sparsity: high-dimensional DTM leads to curse of dimensionality.
  • Modern NLP compresses text into dense, low-dimensional vectors that encode semantic information.

Economic upside

Better text representations often improve predictive power (e.g. central-bank tone \(\to\) yields) and enable causal designs that exploit semantic shifts (e.g. narrative shocks).

Economic use cases

  • Forecasting & Nowcasting: Use news or earnings call transcripts with embeddings to predict macro variables (e.g. GDP, inflation).
  • Policy uncertainty: Dictionary vs embedding-based measures of uncertainty (Baker et al. 2016).
  • Industry competition: Similarity of 10-K filings to gauge product-market overlap (Hoberg & Phillips 2016).
  • Sentiment analysis: Gauge financial sentiment in firm filings or tweet corpora to predict returns (Ke et al. 2019).
  • Labor-market skill maps: Match job-posting text to O*NET tasks using embeddings (Hansen et al. 2021).

Why economists now mine text

  • Digital text – news, patents, 10-Ks, job ads – is exploding; numbers alone miss these “soft” signals. Gentzkow et al. 2019; Ash & Hansen 2023
  • NLP lets us turn unstructured words into structured variables we can graph, regress, and test.
  • Early tools (bag-of-words, \(n\)-grams) count words but ignore context.
  • Modern models capture meaning and nuance \(\to\) richer measures of innovation, policy tone, skills, etc.

From bag-of-words to LLMs: a leap in capability

  • Then: bag-of-words \(\approx\) word counts; good for frequency, weak on order and synonyms.
  • Now: transformer-based large language models (BERT, GPT) use attention to weigh each word in context (Vaswani et al. 2017).
  • Fine-tuned variants (FinBERT, ClimateBERT) already boost accuracy in financial and climate text. Yang et al. 2020; Webersinke et al. 2021
  • Instruction-tuned assistants (ChatGPT, Claude) make advanced NLP tasks almost turn-key for economists.

From counts to dense vectors

The problem with counting words

  • BoW ignores word order; documents are represented by token counts.
  • Simple, interpretable, widely used in economics.
  • Major limitations:
    • Synonymy: “weak” \(\neq\) “tepid”
    • Polysemy: “statistics lie” vs “cats lie”
    • No understanding of word order.

Fixes on top of BoW (topic models, n-grams, dependency parsing) help but remain brittle, high-dimensional, and context-insensitive.

Word embeddings: distributional semantics

“You shall know a word by the company it keeps.” – J. R. Firth

  • Word2Vec (Mikolov et al. 2013): CBOW & Skip-Gram architectures.
  • GloVe (Pennington et al. 2014): factorises global co-occurrence matrix.
  • Embedding dimension typically 100–300.
  • Cosine similarity captures analogies: \(\text{king} - \text{man} + \text{woman} \approx \text{queen}\).

Key idea

Move beyond which words occur to where words live in a low-dimensional space. Build a co-occurrence matrix and factorise it so that similar words sit close together.

Vector arithmetic and analogies

  • Embeddings support meaningful linear operations: \(\text{"king"} - \text{"man"} + \text{"woman"} \approx \text{"queen"}\).
  • Latent dimensions pick up interpretable traits (royalty, masculinity, age …).
  • Helps quantify bias: the “gender” direction separates stereotypically male vs female occupations, enabling post-processing debiasing (Bolukbasi et al. 2016).

Dimensionality reduction: topic models

  • LDA (Latent Dirichlet Allocation): each document is a mixture of topics; each topic a distribution over words. \[\theta_d \sim \text{Dir}(\alpha), \quad \beta_k \sim \text{Dir}(\eta)\]
  • Bayesian inference reduces overfitting in sparse, high-dimensional spaces.
  • Produces interpretable, human-readable topics.
  • Widely used in economics: e.g. Hansen et al. (2018) on Fed communications.

Other approaches

LSA (PCA on DTM), pLSA (probabilistic LSA), NMF (non-negative matrix factorisation) all share the same goal: reduce dimensionality from \(V\) (vocab size) to \(K\) (topic count). LDA adds Dirichlet priors and is the most widely adopted.

Transformers and Attention

From static to contextual embeddings

  • Static embeddings (Word2Vec, GloVe): each word \(v\) has a single vector \(\rho_v\) regardless of context.
  • Contextual embeddings (BERT, GPT): each token’s vector depends on surrounding tokens.
  • Self-attention mechanism: \[\rho'_{d,n} = \sum_{n'=1}^{N_d} w_{n,n'}\,\rho^0_{d,n'}, \quad \sum_{n'} w_{n,n'} = 1,\] where attention weights \(w_{n,n'}\) are learned.
  • Transformer stacks multiple attention layers to capture deep interactions.

Masked language models: BERT

Masked word prediction

Given “As a leading firm in the [MASK] sector, we hire highly skilled …”

  • BERT infers “technology” with high probability if context words include “software engineers”.
  • In “As a leading firm in the [MASK] sector, we hire highly skilled … petroleum engineers,” BERT instead predicts “energy” or “oil”.
  • Fine-tuned variants: FinBERT (financial text), ClimateBERT (climate policy), SciBERT (scientific papers).
  • Economics use-case: masked-LM fine-tuning detects monetary-policy events far better than dictionaries (Droubi et al. 2022).

Neural language models: architecture

  • Objective: maximise \(\sum_t \log P_\theta(w_t \mid \text{context})\) where \(\theta\) are neural weights.
  • Learns contextual embeddings for every token \(\Rightarrow\) meaning adapts to its surroundings.
  • Architectures:
    • RNN/LSTM: sequential recurrence (older, mostly superseded).
    • Transformer: self-attention layers (BERT, GPT family).
  • Scale: trained on billions of tokens; parameters range from millions to >100B.

The LLM landscape (early 2026)

Model Context Provider Notes
GPT-4.1 1M tokens OpenAI Current flagship
Claude Opus 4.6 / Sonnet 4.6 200k tokens Anthropic Strong on reasoning + code
Llama 4 128k–10M tokens Meta (open) Open weights, many sizes
DeepSeek V3 / R1 128k tokens DeepSeek (open) Reasoning at low cost
Gemini 2.5 1M+ tokens Google Multimodal

Key trend: open-weight models are closing the gap; costs collapsed 60x since 2023. Check current benchmarks — this table goes stale fast.

The ellmer package

Why API-based LLMs for economists?

  • No need to train or fine-tune – use instruction-tuned models directly.
  • Reproducible measurement at scale: classify thousands of documents with consistent prompts.
  • Cost collapse: GPT-4 (March 2023) cost ~$30 per 1M input tokens; by 2026 equivalent models cost <$0.50 per 1M tokens – a 60x reduction in 3 years.
  • DeepSeek R1 (open-weight, Jan 2025): reasoning-capable model at ~10% the cost of GPT-4o.

But watch out

API outputs are stochastic (set temperature = 0 for near-deterministic results), models update without notice, and costs can surprise at scale.

The ellmer package: tidyverse-native LLM interface

# install.packages("ellmer")
library(ellmer)

# Connect to any provider via OpenRouter, OpenAI, Anthropic, etc.
chat <- chat_openai(
  model = "gpt-4.1-mini",
  system_prompt = "You are a helpful economics research assistant."
)

# Simple text query
chat$chat("Summarize the main argument of Acemoglu et al. 2001 in two sentences.")

ellmer supports OpenAI, Anthropic, OpenRouter (access to 100+ models), Ollama (local models), and more – all with the same interface.

Batch classification with ellmer

library(ellmer)
library(purrr)

classify_sentiment <- function(text) {
  chat <- chat_openai(model = "gpt-4.1-mini")
  chat$chat(paste0(
    "Classify the sentiment of this central bank statement as ",
    "'hawkish', 'dovish', or 'neutral'. ",
    "Return ONLY the label.\n\nText: ", text
  ))
}

# Apply to a data frame of speeches
speeches$sentiment <- map_chr(speeches$text, classify_sentiment)

Cost estimate

10,000 short paragraphs with gpt-4.1-mini \(\approx\) $0.30. With DeepSeek R1 via OpenRouter \(\approx\) $0.05.

JSON schemas for reproducible measurement

Why structured outputs?

  • Free-text LLM responses are hard to parse programmatically.
  • Structured outputs constrain the model to return valid JSON matching a schema.
  • Benefits for economists:
    • Consistent variable types across thousands of documents.
    • Direct integration into data pipelines (no regex parsing).
    • Reproducible and auditable measurement.

Structured output with ellmer: sentiment + entities

library(ellmer)

# Define the output type
type_analysis <- type_object(
  sentiment = type_enum("hawkish", "dovish", "neutral",
    .description = "Overall monetary policy sentiment"),
  confidence = type_number(.description = "Confidence score 0-1"),
  key_entities = type_array(items = type_string(),
    .description = "Named entities mentioned (central banks, countries, etc.)"),
  summary = type_string(.description = "One-sentence summary")
)

chat <- chat_openai(model = "gpt-4.1-mini")
result <- chat$extract_data(
  "The ECB kept rates unchanged but signaled that inflation risks remain tilted
   to the upside, suggesting further tightening may be needed in Q3.",
  type = type_analysis
)

What you get back

str(result)
#> List of 4
#>  $ sentiment    : chr "hawkish"
#>  $ confidence   : num 0.85
#>  $ key_entities : chr [1:2] "ECB" "Q3"
#>  $ summary      : chr "The ECB held rates steady but signaled upside
#>                         inflation risks and potential further tightening."

This is a native R list – no parsing needed. Bind rows across documents to get a tidy data frame.

Entity extraction at scale

type_trade_event <- type_object(
  event_type = type_enum("tariff", "sanction", "quota", "subsidy", "other"),
  countries = type_array(items = type_string()),
  products = type_array(items = type_string()),
  direction = type_enum("restrictive", "liberalizing", "neutral"),
  date_mentioned = type_string(.description = "Date if mentioned, else 'NA'")
)

# Process 5,000 news articles
results <- map(articles$text, \(txt) {
  chat <- chat_openai(model = "gpt-4.1-mini")
  chat$extract_data(txt, type = type_trade_event)
})

trade_events <- bind_rows(results)

This replaces weeks of manual coding with hours of API calls and pennies of cost.

Designing prompts that measure what you mean

Prompt engineering principles

  1. Be specific about the task: “Classify sentiment” is vague; “Classify the monetary policy stance as hawkish, dovish, or neutral based on forward guidance language” is precise.
  2. Provide examples (few-shot): include 2–5 labelled examples in the prompt to anchor the model’s calibration.
  3. Define edge cases: what counts as “neutral”? What if the text discusses both tightening and easing?
  4. Request structured output: always constrain the response format (JSON, enum, etc.).
  5. Set temperature to 0: for measurement tasks, minimise randomness.

Few-shot prompting for economics

system_prompt <- "
You classify FOMC statements by monetary policy stance.

Examples:
- 'The Committee decided to raise the target range for the federal funds
  rate to 5 to 5-1/4 percent.' -> hawkish
- 'The Committee decided to lower the target range by 50 basis points.'
  -> dovish
- 'The Committee decided to maintain the target range.' -> neutral

Classify the following statement. Return ONLY the label.
"

chat <- chat_openai(model = "gpt-4.1", system_prompt = system_prompt)
chat$chat(new_statement)

Chain-of-thought for complex coding tasks

system_prompt <- "
You are an expert trade policy analyst. For each news article:
1. Identify whether a trade policy event is described.
2. If yes, determine the type (tariff, sanction, quota, subsidy, other).
3. Identify affected countries and products.
4. Assess whether the measure is restrictive or liberalizing.

Think step-by-step before providing your final answer.
Return your analysis as JSON.
"

Chain-of-thought prompting improves accuracy on multi-step reasoning tasks by 10–30% (Wei et al. 2022). The model “shows its work” before committing to an answer.

Validation: LLM labels vs human coders

  • Always validate on a human-coded holdout set (100–500 documents).
  • Report inter-rater agreement (Cohen’s \(\kappa\)) between LLM and human labels.
  • Compare against dictionary-based baselines (e.g. Loughran-McDonald for finance sentiment).
  • Recent findings: GPT-4-class models match or exceed median crowd-worker accuracy on many classification tasks (Gilardi et al. 2023).

Reproducibility

Model versions change. Always log the exact model ID, prompt text, temperature, and date of API calls. Pin model versions where possible (e.g. gpt-4.1-2024-08-06).

Four measurement problems

Problem I: Measuring document similarity

  • Goal: Quantify how “close” two documents are in meaning.
  • Approaches:
    1. BoW-based: raw counts or tf-idf \(\to\) cosine.
    2. Embeddings: average word vectors \(\to\) cosine.
    3. Topic model sharing: cosine on topic loadings (LDA).
    4. LLM embeddings: use text-embedding-3-large (OpenAI) or sentence-transformers for 768+ dim vectors.
  • Economics examples:
    • Industry overlap: Hoberg & Phillips (2010, 2016) use BoW & tf-idf on product-descriptions in 10-K.
    • Patent novelty: Kelly et al. (2021).
    • Syllabi vs research: Biasi & Ma (2022).

Problem II: Concept detection

  • Goal: Detect presence or intensity of an economic concept in text (e.g. policy uncertainty, sentiment, skills).
  • Methods:
    1. Dictionaries / pattern matching: Baker et al. (2016) – economic + uncertainty + policy term sets.
    2. Embedding-augmented lexicons: Seed sets \(\to\) nearest neighbors in embedding space \(\to\) expanded term set (Hanley & Hoberg 2019).
    3. LLM zero-shot classification: Ask the model directly – “Does this paragraph discuss economic policy uncertainty? Yes/No.” No training data needed.
    4. Supervised classification: Human-annotated sample \(\to\) train BERT or fine-tune \(\to\) scale up (Hansen et al. 2023).

Problem III: How concepts relate

  • Goal: Quantify co-occurrence or semantic association between concepts (e.g. gender and emotion, class and politics).
  • Methods:
    1. Local co-occurrence counts: Count windows with terms from two dictionaries.
    2. WEAT (Embedding association test): Project words onto axes defined by attribute sets (Caliskan et al. 2017).
    3. Syntactic patterns: Extract dependency triples (actor–verb–patient) to capture directed relationships.
    4. LLM extraction: Ask the model to identify relationships and output structured triples.
  • Economics examples:
    • Gender attitudes: Use WEAT on judge opinions – Ash et al. (2020b).
    • Narrative networks: Dependency triples among Congressional speeches (Ash et al. 2023).

Problem IV: Associating text with metadata

  • Goal: Use document text to predict or explain metadata (e.g. political bias, firm returns).
  • Methods:
    1. Supervised BoW: LASSO/logistic on term counts (Gentzkow & Shapiro 2010).
    2. Topical regression: Structural topic models regressing topics on covariates (Roberts et al. 2014).
    3. LLM annotation: Use LLMs to generate labels, then regress as usual – separates measurement from inference.
  • Economics examples:
    • Media slant: Train on Congressional speeches \(\to\) predict newspaper article ideology (Widmer et al. 2020).
    • Wage prediction: BERT on job postings to predict salaries (Bana 2022).

Econometric considerations

LLM-generated variables introduce unique challenges for causal inference:

  • Measurement error: Text-derived measures carry sampling & model uncertainty that downstream regressions ignore. Bootstrap over prompt variants to estimate sensitivity.
  • Prompt sensitivity: Small wording changes shift classification distributions by 5–15%. Always run sensitivity analyses.
  • Model drift: gpt-4 in 2024 \(\neq\) gpt-4.1 in 2026. Pin model versions and log everything.
  • Validation: Human-annotated holdouts remain essential.

For a thorough treatment, see Ash & Hansen (2023), Text Algorithms in Economics, AER.

Application: AI-Generated Production Networks

AIPNET: AI-Generated Production Networks (Fetzer et al. 2024)

  • Goal: Construct a granular input-output network over 5,000 product nodes using LLMs
  • Method: Two-step “build-prune” pipeline:
    1. Build: Use LLMs to classify whether product \(i\) is an input to product \(j\) (25M pairs)
    2. Prune: Re-evaluate candidates, enforce consistency, threshold at calibrated \(\tau\)
  • Output: Directed adjacency matrix with ~1.2M edges — much finer than traditional sector-level IO tables

AIPNET: Application to the 2017 Qatar Blockade

  • Trace downstream “neighbors” of Qatar’s top exports in the AIPNET product space
  • Construct exposure index → regress on downstream export changes
  • Result: 1 SD higher exposure → 2.3% decline in downstream exports next quarter (\(p<0.01\))
  • Policy relevance: simulate counterfactual shock propagation at the product level

AIPNET: Key Takeaways

  • LLMs can construct economic networks that traditional IO tables miss
  • Fine-grained product linkages enable micro-level shock tracing
  • Extends to sanctions, supply chain disruptions, industrial policy analysis

Hallucination, Reproducibility, Cost

LLMs hallucinate

  • LLMs generate plausible-sounding but factually incorrect text – hallucination.
  • In economics research context:
    • A model might invent citations, statistics, or causal relationships.
    • Classification labels may be confident but wrong for domain-specific jargon.
  • Mitigation:
    • Use structured outputs to constrain responses to valid categories.
    • Validate against human-coded samples.
    • Use retrieval-augmented generation (RAG) to ground responses in source documents.

Reproducibility challenges

  • Model drift: OpenAI, Anthropic, and others update models without notice. gpt-4 in January 2024 \(\neq\) gpt-4 in January 2025.
  • Stochasticity: Even at temperature = 0, outputs may vary slightly across API calls.
  • Prompt sensitivity: Small wording changes can shift classification distributions by 5–15%.
  • Best practices:
    • Pin model versions (e.g. gpt-4.1-2024-08-06).
    • Log all prompts, model IDs, timestamps, and raw outputs.
    • Run sensitivity analyses across prompt variants.
    • Consider open-weight models (Llama, DeepSeek) for full reproducibility.

The price collapse: making LLMs accessible

Date Model Cost per 1M input tokens
Mar 2023 GPT-4 ~$30.00
Nov 2023 GPT-4 Turbo ~$10.00
May 2024 GPT-4o ~$2.50
Jul 2024 GPT-4o-mini ~$0.15
Jan 2025 DeepSeek R1 ~$0.55
Apr 2025 GPT-4.1-mini ~$0.10
Feb 2026 Frontier models ~$0.10–$2.00
  • 60x cost reduction in 3 years for frontier-quality models.
  • Processing 100,000 documents (500 tokens each) now costs $5–$25 instead of $1,500.
  • Open-weight models (DeepSeek R1, Llama) can run locally for zero marginal cost.

When not to use LLMs

  • Simple keyword counting: If a dictionary works, it is cheaper, faster, and fully reproducible.
  • High-stakes causal inference: LLM-generated variables introduce opaque measurement error.
  • Sensitive data: Sending data to external APIs may violate data protection agreements (GDPR, IRB).
  • When you need a paper trail: Regulators and reviewers may question black-box measurements.

Rule of thumb

Use LLMs for annotation and measurement (replacing human coders), not as a substitute for econometric identification.

RAG: Retrieval-Augmented Generation

What is RAG?

Problem: LLMs hallucinate facts and can’t access your private data.

Solution: Retrieval-Augmented Generation — retrieve relevant source documents, then pass them to the LLM alongside your question.

  1. Index your corpus (e.g., ECB speeches, 10-K filings) into a vector database
  2. Retrieve the top-\(k\) most similar chunks for each query
  3. Generate an answer grounded in the retrieved text

Why this matters for economists:

  • LLM answers are traceable to specific source paragraphs
  • Reduces hallucination — the model “quotes” rather than invents
  • Works with private, proprietary, or very recent data that the LLM wasn’t trained on

Validation: Cohen’s Kappa in R

Always validate LLM labels against human coders. Cohen’s \(\kappa\) measures inter-rater agreement beyond chance:

library(irr)

# human_labels and llm_labels are character/factor vectors of same length
human_labels <- c("hawkish", "dovish", "neutral", "hawkish", "dovish")
llm_labels   <- c("hawkish", "dovish", "hawkish", "hawkish", "dovish")

# Compute Cohen's kappa
kappa_result <- kappa2(cbind(human_labels, llm_labels))
kappa_result
#> Cohen's Kappa for 2 Raters
#>  Kappa = 0.615
#>  z = 1.94, p-value = 0.052
\(\kappa\) Interpretation
< 0.20 Poor
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Substantial
> 0.80 Almost perfect

Understanding LLMs by building one

The idea: demystifying the black box

  • The API you called in the previous exercises? Inside it’s just next-token prediction on text, with more parameters and more data.
  • We’ll train a tiny character-level transformer on ECB speeches using R + torch
  • Same architecture as GPT — just 1000x smaller
  • Inspired by Karpathy’s nanochat: “understand by building”

What happens inside an LLM?

  1. Tokenize: split text into tokens (here: individual characters)
  2. Embed: map each token to a dense vector
  3. Transform: stack attention blocks — each token attends to previous tokens to build context
  4. Predict: linear layer outputs probability distribution over next token
  5. Sample: pick the next token, append, repeat

Training = adjust weights so the model gets better at predicting the next character in ECB speeches.

Mini-LLM architecture

Input: "The ECB has decided to maintain intere"
  → Character embedding (128 dimensions)
  → 2 Transformer blocks (4 attention heads each)
  → Linear projection → softmax
  → Predict next character: "s" (for "interest")
  • ~500K parameters (vs ~175B for GPT-3)
  • Trains in ~5–10 minutes on a laptop CPU
  • After training, generates vaguely ECB-sounding text

Running the exercise

# Prerequisites (install once):
# install.packages("torch")
# torch::install_torch()

# Run the mini-LLM training script:
source("code/05-mini_llm.R")

# After training, generate text:
generate(model, start_str = "The ECB has decided",
         max_tokens = 200, temperature = 0.8)
#> "The ECB has decided to maintain interest rates at their
#>  present levels. The current monetary policy stance remains
#>  accommodative and the inflation expectations..."

See code/05-mini_llm.R for the full implementation.

Pedagogical payoff

  • The API you called earlier? Inside it’s this same loop — next-token prediction, with more parameters and more data
  • The transformer architecture is identical whether you have 500K or 175B parameters
  • Understanding the mechanics helps you:
    • Design better prompts (you know what the model is actually doing)
    • Evaluate when LLMs will struggle (rare tokens, out-of-distribution text)
    • Appreciate why RAG and fine-tuning work

Recap & Further Directions

Key takeaways

  1. Modern NLP moved from BoW \(\to\) static embeddings \(\to\) contextual embeddings \(\to\) instruction-tuned LLMs.
  2. LLMs via APIs (ellmer, OpenRouter) make classification, extraction, and annotation almost turn-key.
  3. Structured outputs (JSON schemas) ensure reproducible, parseable measurements at scale.
  4. Prompt engineering is the new “feature engineering” – specificity, few-shot examples, and chain-of-thought matter.
  5. Four core measurement tasks: document similarity, concept detection, concept relationships, text-metadata mapping.
  6. Costs have collapsed 60x in 3 years; open-weight models enable local, reproducible deployment.
  7. Validation against human-coded samples remains essential.

Further reading and resources

  • Review articles:
    • Ash & Hansen (2023), Text Algorithms in Economics, AER.
    • Gentzkow et al. (2019), Text as Data, JEL.
  • Software:
    • ellmer – tidyverse-native LLM interface for R.
    • text2vec, quanteda – traditional NLP in R.
    • spacyr, udpipe – linguistic annotation.
  • LLM access:
    • OpenRouter – single API for 100+ models.
    • Ollama – run open-weight models locally.
  • Key application paper:
    • Fetzer et al. (2024), AI-Generated Production Networks.
  • Argyle et al. (2025), “Large Language Models: An Applied Econometric Framework”, NBER WP 33344.