10. LLMs

Data Science for Economists

Julian Hinz

Irene Iodice

2026-07-01

Learning objectives

By the end of today you should be able to:

Explain why economists moved from bag-of-words to embeddings to LLMs.
Describe what word embeddings capture and how transformers improve on them.
Use LLMs via APIs from R (the ellmer package) for classification, extraction, and annotation tasks.
Design structured outputs (JSON schemas) for reproducible measurement.
Apply prompt-engineering strategies to economics research problems.
Describe at least two published economics papers that leverage modern NLP/LLM representations.

Motivation

Why move beyond Bag-of-Words?

Order and context: BoW ignores word order – “bank raises rates” vs “rates raise bank”.
Synonymy: “job”, “employment”, “work” treated as independent features.
Sparsity: high-dimensional DTM leads to curse of dimensionality.
Modern NLP compresses text into dense, low-dimensional vectors that encode semantic information.

Economic upside

Better text representations often improve predictive power (e.g. central-bank tone $\to$ yields) and enable causal designs that exploit semantic shifts (e.g. narrative shocks).

Economic use cases

Forecasting & Nowcasting: Use news or earnings call transcripts with embeddings to predict macro variables (e.g. GDP, inflation).
Policy uncertainty: Dictionary vs embedding-based measures of uncertainty (Baker et al. 2016).
Industry competition: Similarity of 10-K filings to gauge product-market overlap (Hoberg & Phillips 2016).
Sentiment analysis: Gauge financial sentiment in firm filings or tweet corpora to predict returns (Ke et al. 2019).
Labor-market skill maps: Match job-posting text to O*NET tasks using embeddings (Hansen et al. 2021).

Why economists now mine text

Digital text – news, patents, 10-Ks, job ads – is exploding; numbers alone miss these “soft” signals. Gentzkow et al. 2019; Ash & Hansen 2023
NLP lets us turn unstructured words into structured variables we can graph, regress, and test.
Early tools (bag-of-words, $n$-grams) count words but ignore context.
Modern models capture meaning and nuance $\to$ richer measures of innovation, policy tone, skills, etc.

From bag-of-words to LLMs: a leap in capability

Then: bag-of-words $\approx$ word counts; good for frequency, weak on order and synonyms.
Now: transformer-based large language models (BERT, GPT) use attention to weigh each word in context (Vaswani et al. 2017).
Fine-tuned variants (FinBERT, ClimateBERT) already boost accuracy in financial and climate text. Yang et al. 2020; Webersinke et al. 2021
Instruction-tuned assistants (ChatGPT, Claude) make advanced NLP tasks almost turn-key for economists.

From counts to dense vectors

The problem with counting words

BoW ignores word order; documents are represented by token counts.
Simple, interpretable, widely used in economics.
Major limitations:
- Synonymy: “weak” $\neq$ “tepid”
- Polysemy: “statistics lie” vs “cats lie”
- No understanding of word order.

Fixes on top of BoW (topic models, n-grams, dependency parsing) help but remain brittle, high-dimensional, and context-insensitive.

Word embeddings: distributional semantics

“You shall know a word by the company it keeps.” – J. R. Firth

Word2Vec (Mikolov et al. 2013): CBOW & Skip-Gram architectures.
GloVe (Pennington et al. 2014): factorises global co-occurrence matrix.
Embedding dimension typically 100–300.
Cosine similarity captures analogies: $\text{king} - \text{man} + \text{woman} \approx \text{queen}$.

Key idea

Move beyond which words occur to where words live in a low-dimensional space. Build a co-occurrence matrix and factorise it so that similar words sit close together.

Vector arithmetic and analogies

Embeddings support meaningful linear operations: $\text{"king"} - \text{"man"} + \text{"woman"} \approx \text{"queen"}$.
Latent dimensions pick up interpretable traits (royalty, masculinity, age …).
Helps quantify bias: the “gender” direction separates stereotypically male vs female occupations, enabling post-processing debiasing (Bolukbasi et al. 2016).

Where words live: the embedding space

Every token is a point in a high-dimensional space; project down to 2D and semantic neighbourhoods appear.
Note the economics cluster — economy, growth, income, investment, money — sitting together, far from king / queen or city / town.
This geometry is what lets LLMs generalise: nearby words behave alike, so meaning is interpolated, not looked up.

Source: S. Wolfram, What Is ChatGPT Doing…? (writings.stephenwolfram.com, 2023)

Dimensionality reduction: topic models

LDA (Latent Dirichlet Allocation): each document is a mixture of topics; each topic a distribution over words. \[\theta_d \sim \text{Dir}(\alpha), \quad \beta_k \sim \text{Dir}(\eta)\]
Bayesian inference reduces overfitting in sparse, high-dimensional spaces.
Produces interpretable, human-readable topics.
Widely used in economics: e.g. Hansen et al. (2018) on Fed communications.

Other approaches

LSA (PCA on DTM), pLSA (probabilistic LSA), NMF (non-negative matrix factorisation) all share the same goal: reduce dimensionality from $V$ (vocab size) to $K$ (topic count). LDA adds Dirichlet priors and is the most widely adopted.

Transformers and Attention

How an LLM actually works: one pipeline

Strip away the hype and every LLM does one thing: given the text so far, predict the next token — then append it and repeat.

Tokenize text into integer IDs
Embed each token as a dense vector (+ position)
Transformer blocks (attention + FFN), repeated $N$ times, mix context across tokens
Unembed + softmax $\rightarrow$ a probability over the whole vocabulary
Sample one token, append, and loop

. . .

“What ChatGPT is always fundamentally trying to do is produce a reasonable continuation of whatever text it’s got so far.” — Stephen Wolfram

Source: 0xkato, How LLMs Actually Work (0xkato.xyz)

Next-token prediction, concretely

Ask a model to continue *“The best thing about AI is its ability to ___“* and it returns a probability for every word in its vocabulary:

learn — 4.5%
predict — 3.5%
make — 3.2%
understand — 3.1%
do — 2.9%

Ranked probabilities fall off as a power law — a few plausible words, then a long tail.

Source: S. Wolfram, What Is ChatGPT Doing…? (writings.stephenwolfram.com, 2023)

Sampling and temperature

The model gives a distribution; to generate text it samples a word, appends it, and repeats — “a biased coin flip, at 100,000-way scale.”
Temperature controls the gamble:
- 0 — always take the top word $\rightarrow$ deterministic, but soon repetitive.
- ~0.8 — sometimes pick lower-ranked words $\rightarrow$ more fluent and varied.
Same prompt, temperature 0.8, five runs — five different continuations:

Source: S. Wolfram, What Is ChatGPT Doing…? (writings.stephenwolfram.com, 2023)

Tokenization: the model never sees letters

Text is first split into subword tokens, each mapped to an integer ID.
Common words are a single token; rare words fragment (e.g. “token” + “ization”).
The model only ever sees token IDs, never raw characters — which is why it fumbles letter-level tasks (famously: counting the r’s in “strawberry”).
Practical upshot: context windows and API costs are counted in tokens — roughly ¾ of a word in English.

Source: 0xkato, How LLMs Actually Work (0xkato.xyz)

From static to contextual embeddings

Static (Word2Vec, GloVe): each word has a single vector $\rho_v$, regardless of context.
Contextual (BERT, GPT): each token’s vector depends on its neighbours.
Self-attention lets each token look back and pull in the tokens that matter: \[\rho'_{d,n} = \sum_{n'=1}^{N_d} w_{n,n'}\,\rho^0_{d,n'}, \quad \sum_{n'} w_{n,n'} = 1,\] with learned weights $w_{n,n'}$.
To resolve “was”, the model attends back to its subject “cat” — context, not position, drives meaning.

Source: 0xkato, How LLMs Actually Work (0xkato.xyz)

Inside a transformer block

Each block has two sub-layers, and every token’s vector flows down a residual stream — each sub-layer adds a correction rather than overwriting:

Attention — mix in information from other tokens.
Feed-forward (FFN) — process each token on its own, expanding then compressing to store learned patterns.

Source: 0xkato (0xkato.xyz)

Multiple attention heads, in parallel

Attention runs as several heads at once, each with its own learned projections.
Heads specialise — one tracks grammar, another links pronouns to their referents, another encodes position — and no one assigns these roles: they emerge during training.
Each head’s output is concatenated and mixed back into the token’s vector.
Here, the heads processing the token “was” attend back over the sentence in different ways.

Source: 0xkato, How LLMs Actually Work (0xkato.xyz)

Masked language models: BERT

Masked word prediction

Given “As a leading firm in the [MASK] sector, we hire highly skilled …”

BERT infers “technology” with high probability if context words include “software engineers”.
In “As a leading firm in the [MASK] sector, we hire highly skilled … petroleum engineers,” BERT instead predicts “energy” or “oil”.

Fine-tuned variants: FinBERT (financial text), ClimateBERT (climate policy), SciBERT (scientific papers).
Economics use-case: masked-LM fine-tuning detects monetary-policy events far better than dictionaries (Droubi et al. 2022).

Neural language models: architecture

Objective: maximise $\sum_t \log P_\theta(w_t \mid \text{context})$ where $\theta$ are neural weights.
Learns contextual embeddings for every token $\Rightarrow$ meaning adapts to its surroundings.
Architectures:
- RNN/LSTM: sequential recurrence (older, mostly superseded).
- Transformer: self-attention layers (BERT, GPT family).
Scale: trained on billions of tokens; parameters range from millions to >100B.

The LLM landscape (early 2026)

Model	Provider	Context	Open?
GPT-5.2	OpenAI	~400k tokens	No
Claude Opus / Sonnet 4.6	Anthropic	up to ~1M tokens	No
Gemini 3 Pro	Google	~1M tokens	No
GLM-5 / GLM-4.7	Z.ai (Zhipu AI)	~200k tokens	Yes
MiniMax-M2.5	MiniMax	~200k tokens	Yes
Kimi-K2 / K2.5	Moonshot AI	~256k tokens	Yes

Key trend: open-weight models are closing the gap; costs collapsed 60x since 2023. Check current benchmarks — this table goes stale fast.

The ellmer package

Why API-based LLMs for economists?

No need to train or fine-tune – use instruction-tuned models directly.
Reproducible measurement at scale: classify thousands of documents with consistent prompts.
Cost collapse: GPT-4 (March 2023) cost ~$30 per 1M input tokens; by 2026 equivalent models cost <$0.50 per 1M tokens – a 60x reduction in 3 years.
DeepSeek R1 (open-weight, Jan 2025): reasoning-capable model at ~10% the cost of GPT-4o.

But watch out

API outputs are stochastic (set temperature = 0 for near-deterministic results), models update without notice, and costs can surprise at scale.

The `ellmer` package: tidyverse-native LLM interface

ellmer supports OpenAI, Anthropic, OpenRouter (access to 100+ models), Ollama (local models), and more – all with the same interface.

We use OpenRouter – one API key gives access to models from OpenAI, Anthropic, Google, and open-weight providers.

Setup: OpenRouter API key

Create a free account at openrouter.ai and generate an API key.
Store the key in your .Renviron file (R reads this on startup):

# Run once to open the file:
usethis::edit_r_environ()

# Add this line and save:
# OPENROUTER_API_KEY=sk-or-v1-...

Restart R. Verify it works:

Sys.getenv("OPENROUTER_API_KEY")
#> "sk-or-v1-..."

ellmer picks up the key automatically – no need to pass it in code.

First call with `ellmer`

# install.packages("ellmer")
library(ellmer)

chat <- chat_openrouter(
  model = "openai/gpt-4.1-mini",
  system_prompt = "You are a helpful economics research assistant."
)

# Simple text query
chat$chat("Summarize the main argument of Acemoglu et al. 2001 in two sentences.")

Batch classification with `ellmer`

library(ellmer)
library(purrr)

classify_sentiment <- function(text) {
  chat <- chat_openrouter(model = "z-ai/glm-4.5-air:free")
  chat$chat(paste0(
    "Classify the sentiment of this central bank statement as ",
    "'hawkish', 'dovish', or 'neutral'. ",
    "Return ONLY the label.\n\nText: ", text
  ))
}

# Apply to a data frame of speeches
speeches$sentiment <- map_chr(speeches$text, classify_sentiment)

Cost estimate

Free-tier models like z-ai/glm-4.5-air:free have zero cost but rate limits. Paid models: 10,000 short paragraphs with gpt-4.1-mini $\approx$ $0.30, with DeepSeek R1 $\approx$ $0.05.

JSON schemas for reproducible measurement

Why structured outputs?

Free-text LLM responses are hard to parse programmatically.
Structured outputs constrain the model to return valid JSON matching a schema.
Benefits for economists:
- Consistent variable types across thousands of documents.
- Direct integration into data pipelines (no regex parsing).
- Reproducible and auditable measurement.

Structured output with `ellmer`: sentiment + entities

library(ellmer)

# Define the output type
type_analysis <- type_object(
  sentiment = type_enum("hawkish", "dovish", "neutral",
    .description = "Overall monetary policy sentiment"),
  confidence = type_number(.description = "Confidence score 0-1"),
  key_entities = type_array(items = type_string(),
    .description = "Named entities mentioned (central banks, countries, etc.)"),
  summary = type_string(.description = "One-sentence summary")
)

chat <- chat_openrouter(model = "z-ai/glm-4.5-air:free")
result <- chat$extract_data(
  "The ECB kept rates unchanged but signaled that inflation risks remain tilted
   to the upside, suggesting further tightening may be needed in Q3.",
  type = type_analysis
)

What you get back

str(result)
#> List of 4
#>  $ sentiment    : chr "hawkish"
#>  $ confidence   : num 0.85
#>  $ key_entities : chr [1:2] "ECB" "Q3"
#>  $ summary      : chr "The ECB held rates steady but signaled upside
#>                         inflation risks and potential further tightening."

This is a native R list – no parsing needed. Bind rows across documents to get a tidy data frame.

Entity extraction at scale

type_trade_event <- type_object(
  event_type = type_enum("tariff", "sanction", "quota", "subsidy", "other"),
  countries = type_array(items = type_string()),
  products = type_array(items = type_string()),
  direction = type_enum("restrictive", "liberalizing", "neutral"),
  date_mentioned = type_string(.description = "Date if mentioned, else 'NA'")
)

# Process 5,000 news articles
results <- map(articles$text, \(txt) {
  chat <- chat_openrouter(model = "z-ai/glm-4.5-air:free")
  chat$extract_data(txt, type = type_trade_event)
})

trade_events <- bind_rows(results)

This replaces weeks of manual coding with hours of API calls and pennies of cost.

Designing prompts that measure what you mean

Prompt engineering principles

Be specific about the task: “Classify sentiment” is vague; “Classify the monetary policy stance as hawkish, dovish, or neutral based on forward guidance language” is precise.
Provide examples (few-shot): include 2–5 labelled examples in the prompt to anchor the model’s calibration.
Define edge cases: what counts as “neutral”? What if the text discusses both tightening and easing?
Request structured output: always constrain the response format (JSON, enum, etc.).
Set temperature to 0: for measurement tasks, minimise randomness.

Few-shot prompting for economics

system_prompt <- "
You classify FOMC statements by monetary policy stance.

Examples:
- 'The Committee decided to raise the target range for the federal funds
  rate to 5 to 5-1/4 percent.' -> hawkish
- 'The Committee decided to lower the target range by 50 basis points.'
  -> dovish
- 'The Committee decided to maintain the target range.' -> neutral

Classify the following statement. Return ONLY the label.
"

chat <- chat_openrouter(model = "z-ai/glm-4.5-air:free", system_prompt = system_prompt)
chat$chat(new_statement)

Chain-of-thought for complex coding tasks

system_prompt <- "
You are an expert trade policy analyst. For each news article:
1. Identify whether a trade policy event is described.
2. If yes, determine the type (tariff, sanction, quota, subsidy, other).
3. Identify affected countries and products.
4. Assess whether the measure is restrictive or liberalizing.

Think step-by-step before providing your final answer.
Return your analysis as JSON.
"

Chain-of-thought prompting improves accuracy on multi-step reasoning tasks by 10–30% (Wei et al. 2022). The model “shows its work” before committing to an answer.

Validation: LLM labels vs human coders

Always validate on a human-coded holdout set (100–500 documents).
Report inter-rater agreement (Cohen’s $\kappa$) between LLM and human labels.
Compare against dictionary-based baselines (e.g. Loughran-McDonald for finance sentiment).
Recent findings: GPT-4-class models match or exceed median crowd-worker accuracy on many classification tasks (Gilardi et al. 2023).

Reproducibility

Model versions change. Always log the exact model ID, prompt text, temperature, and date of API calls. Pin model versions where possible (e.g. gpt-4.1-2024-08-06).

Four measurement problems

Problem I: Measuring document similarity

Goal: Quantify how “close” two documents are in meaning.
Approaches:
1. BoW-based: raw counts or tf-idf $\to$ cosine.
2. Embeddings: average word vectors $\to$ cosine.
3. Topic model sharing: cosine on topic loadings (LDA).
4. LLM embeddings: use text-embedding-3-large (OpenAI) or sentence-transformers for 768+ dim vectors.
Economics examples:
- Industry overlap: Hoberg & Phillips (2010, 2016) use BoW & tf-idf on product-descriptions in 10-K.
- Patent novelty: Kelly et al. (2021).
- Syllabi vs research: Biasi & Ma (2022).

Problem II: Concept detection

Goal: Detect presence or intensity of an economic concept in text (e.g. policy uncertainty, sentiment, skills).
Methods:
1. Dictionaries / pattern matching: Baker et al. (2016) – economic + uncertainty + policy term sets.
2. Embedding-augmented lexicons: Seed sets $\to$ nearest neighbors in embedding space $\to$ expanded term set (Hanley & Hoberg 2019).
3. LLM zero-shot classification: Ask the model directly – “Does this paragraph discuss economic policy uncertainty? Yes/No.” No training data needed.
4. Supervised classification: Human-annotated sample $\to$ train BERT or fine-tune $\to$ scale up (Hansen et al. 2023).

Problem III: How concepts relate

Goal: Quantify co-occurrence or semantic association between concepts (e.g. gender and emotion, class and politics).
Methods:
1. Local co-occurrence counts: Count windows with terms from two dictionaries.
2. WEAT (Embedding association test): Project words onto axes defined by attribute sets (Caliskan et al. 2017).
3. Syntactic patterns: Extract dependency triples (actor–verb–patient) to capture directed relationships.
4. LLM extraction: Ask the model to identify relationships and output structured triples.
Economics examples:
- Gender attitudes: Use WEAT on judge opinions – Ash et al. (2020b).
- Narrative networks: Dependency triples among Congressional speeches (Ash et al. 2023).

Problem IV: Associating text with metadata

Goal: Use document text to predict or explain metadata (e.g. political bias, firm returns).
Methods:
1. Supervised BoW: LASSO/logistic on term counts (Gentzkow & Shapiro 2010).
2. Topical regression: Structural topic models regressing topics on covariates (Roberts et al. 2014).
3. LLM annotation: Use LLMs to generate labels, then regress as usual – separates measurement from inference.
Economics examples:
- Media slant: Train on Congressional speeches $\to$ predict newspaper article ideology (Widmer et al. 2020).
- Wage prediction: BERT on job postings to predict salaries (Bana 2022).

Econometric considerations

LLM-generated variables introduce unique challenges for causal inference:

Measurement error: Text-derived measures carry sampling & model uncertainty that downstream regressions ignore. Bootstrap over prompt variants to estimate sensitivity.
Prompt sensitivity: Small wording changes shift classification distributions by 5–15%. Always run sensitivity analyses.
Model drift: gpt-4 in 2024 $\neq$ gpt-4.1 in 2026. Pin model versions and log everything.
Validation: Human-annotated holdouts remain essential.

For a thorough treatment, see Ash & Hansen (2023), Text Algorithms in Economics, AER.

Application: AI-Generated Production Networks

AIPNET: AI-Generated Production Networks (Fetzer et al. 2024)

Goal: Construct a granular input-output network over 5,000 product nodes using LLMs
Method: Two-step “build-prune” pipeline:
1. Build: Use LLMs to classify whether product $i$ is an input to product $j$ (25M pairs)
2. Prune: Re-evaluate candidates, enforce consistency, threshold at calibrated $\tau$
Output: Directed adjacency matrix with ~1.2M edges — much finer than traditional sector-level IO tables

AIPNET: Application to the 2017 Qatar Blockade

Trace downstream “neighbors” of Qatar’s top exports in the AIPNET product space
Construct exposure index → regress on downstream export changes
Result: 1 SD higher exposure → 2.3% decline in downstream exports next quarter ($p<0.01$)
Policy relevance: simulate counterfactual shock propagation at the product level

AIPNET: Key Takeaways

LLMs can construct economic networks that traditional IO tables miss
Fine-grained product linkages enable micro-level shock tracing
Extends to sanctions, supply chain disruptions, industrial policy analysis

Hallucination, Reproducibility, Cost

LLMs hallucinate

LLMs generate plausible-sounding but factually incorrect text – hallucination.
In economics research context:
- A model might invent citations, statistics, or causal relationships.
- Classification labels may be confident but wrong for domain-specific jargon.
Mitigation:
- Use structured outputs to constrain responses to valid categories.
- Validate against human-coded samples.
- Use retrieval-augmented generation (RAG) to ground responses in source documents.

Reproducibility challenges

Model drift: OpenAI, Anthropic, and others update models without notice. gpt-4 in January 2024 $\neq$ gpt-4 in January 2025.
Stochasticity: Even at temperature = 0, outputs may vary slightly across API calls.
Prompt sensitivity: Small wording changes can shift classification distributions by 5–15%.
Best practices:
- Pin model versions (e.g. gpt-4.1-2024-08-06).
- Log all prompts, model IDs, timestamps, and raw outputs.
- Run sensitivity analyses across prompt variants.
- Consider open-weight models (Llama, DeepSeek) for full reproducibility.

The price collapse: making LLMs accessible

Date	Model	Cost per 1M input tokens
Mar 2023	GPT-4	~$30.00
Nov 2023	GPT-4 Turbo	~$10.00
May 2024	GPT-4o	~$2.50
Jul 2024	GPT-4o-mini	~$0.15
Jan 2025	DeepSeek R1	~$0.55
Apr 2025	GPT-4.1-mini	~$0.10
Feb 2026	Frontier models	~$0.10–$2.00

60x cost reduction in 3 years for frontier-quality models.
Processing 100,000 documents (500 tokens each) now costs $5–$25 instead of $1,500.
Open-weight models (DeepSeek R1, Llama) can run locally for zero marginal cost.

When not to use LLMs

Simple keyword counting: If a dictionary works, it is cheaper, faster, and fully reproducible.
High-stakes causal inference: LLM-generated variables introduce opaque measurement error.
Sensitive data: Sending data to external APIs may violate data protection agreements (GDPR, IRB).
When you need a paper trail: Regulators and reviewers may question black-box measurements.

Rule of thumb

Use LLMs for annotation and measurement (replacing human coders), not as a substitute for econometric identification.

RAG: Retrieval-Augmented Generation

What is RAG?

Problem: LLMs hallucinate facts and can’t access your private data.

Solution: Retrieval-Augmented Generation — retrieve relevant source documents, then pass them to the LLM alongside your question.

Index your corpus (e.g., ECB speeches, 10-K filings) into a vector database
Retrieve the top-$k$ most similar chunks for each query
Generate an answer grounded in the retrieved text

Why this matters for economists:

LLM answers are traceable to specific source paragraphs
Reduces hallucination — the model “quotes” rather than invents
Works with private, proprietary, or very recent data that the LLM wasn’t trained on

Validation: Cohen’s Kappa in R

Always validate LLM labels against human coders. Cohen’s $\kappa$ measures inter-rater agreement beyond chance:

library(irr)

# human_labels and llm_labels are character/factor vectors of same length
human_labels <- c("hawkish", "dovish", "neutral", "hawkish", "dovish")
llm_labels   <- c("hawkish", "dovish", "hawkish", "hawkish", "dovish")

# Compute Cohen's kappa
kappa_result <- kappa2(cbind(human_labels, llm_labels))
kappa_result
#> Cohen's Kappa for 2 Raters
#>  Kappa = 0.615
#>  z = 1.94, p-value = 0.052

$\kappa$	Interpretation
< 0.20	Poor
0.21–0.40	Fair
0.41–0.60	Moderate
0.61–0.80	Substantial
> 0.80	Almost perfect

Understanding LLMs by building one

The idea: demystifying the black box

The API you called in the previous exercises? Inside it’s just next-token prediction on text, with more parameters and more data.
We’ll train a tiny character-level transformer on ECB speeches using R + torch
Same architecture as GPT — just 1000x smaller
Inspired by Karpathy’s nanochat: “understand by building”

What happens inside an LLM?

Tokenize: split text into tokens (here: individual characters)
Embed: map each token to a dense vector
Transform: stack attention blocks — each token attends to previous tokens to build context
Predict: linear layer outputs probability distribution over next token
Sample: pick the next token, append, repeat

Training = adjust weights so the model gets better at predicting the next character in ECB speeches.

Mini-LLM architecture

Input: "The ECB has decided to maintain intere"
  → Character embedding (128 dimensions)
  → 2 Transformer blocks (4 attention heads each)
  → Linear projection → softmax
  → Predict next character: "s" (for "interest")

~500K parameters (vs ~175B for GPT-3)
Trains in ~5–10 minutes on a laptop CPU
After training, generates vaguely ECB-sounding text

Running the exercise

# Prerequisites (install once):
# install.packages("torch")
# torch::install_torch()

# Run the mini-LLM training script:
source("code/05-mini_llm.R")

# After training, generate text:
generate(model, start_str = "The ECB has decided",
         max_tokens = 200, temperature = 0.8)
#> "The ECB has decided to maintain interest rates at their
#>  present levels. The current monetary policy stance remains
#>  accommodative and the inflation expectations..."

See code/05-mini_llm.R for the full implementation.

Pedagogical payoff

The API you called earlier? Inside it’s this same loop — next-token prediction, with more parameters and more data
The transformer architecture is identical whether you have 500K or 175B parameters
Understanding the mechanics helps you:
- Design better prompts (you know what the model is actually doing)
- Evaluate when LLMs will struggle (rare tokens, out-of-distribution text)
- Appreciate why RAG and fine-tuning work

Recap & Further Directions

Key takeaways

Modern NLP moved from BoW $\to$ static embeddings $\to$ contextual embeddings $\to$ instruction-tuned LLMs.
LLMs via APIs (ellmer, OpenRouter) make classification, extraction, and annotation almost turn-key.
Structured outputs (JSON schemas) ensure reproducible, parseable measurements at scale.
Prompt engineering is the new “feature engineering” – specificity, few-shot examples, and chain-of-thought matter.
Four core measurement tasks: document similarity, concept detection, concept relationships, text-metadata mapping.
Costs have collapsed 60x in 3 years; open-weight models enable local, reproducible deployment.
Validation against human-coded samples remains essential.

10. LLMs

Learning objectives

Why move beyond Bag-of-Words?

Economic use cases

Why economists now mine text

From bag-of-words to LLMs: a leap in capability

The problem with counting words

Word embeddings: distributional semantics

Vector arithmetic and analogies

Where words live: the embedding space

Dimensionality reduction: topic models

How an LLM actually works: one pipeline

Next-token prediction, concretely

Sampling and temperature

Tokenization: the model never sees letters

From static to contextual embeddings

Inside a transformer block

Multiple attention heads, in parallel

Masked language models: BERT

Neural language models: architecture

The LLM landscape (early 2026)

Why API-based LLMs for economists?

The ellmer package: tidyverse-native LLM interface

Setup: OpenRouter API key

First call with ellmer

Batch classification with ellmer

Why structured outputs?

Structured output with ellmer: sentiment + entities

What you get back

Entity extraction at scale

Prompt engineering principles

Few-shot prompting for economics

Chain-of-thought for complex coding tasks

Validation: LLM labels vs human coders

Problem I: Measuring document similarity

Problem II: Concept detection

Problem III: How concepts relate

Problem IV: Associating text with metadata

Econometric considerations

AIPNET: AI-Generated Production Networks (Fetzer et al. 2024)

AIPNET: Application to the 2017 Qatar Blockade

AIPNET: Key Takeaways

LLMs hallucinate

Reproducibility challenges

The price collapse: making LLMs accessible

When not to use LLMs

What is RAG?

Validation: Cohen’s Kappa in R

The idea: demystifying the black box

What happens inside an LLM?

Mini-LLM architecture

Running the exercise

Pedagogical payoff

Key takeaways

Further reading and resources

The `ellmer` package: tidyverse-native LLM interface

First call with `ellmer`

Batch classification with `ellmer`

Structured output with `ellmer`: sentiment + entities