Data Science for Economists
2026-03-01
By the end of today you should be able to:
It has become increasingly affordable to store and process vast quantities of digital text, triggering an explosion of empirical research that leverages text as data.
Historical cost of computer memory and storage
Finance — Tetlock (2007): pessimism in the WSJ “Abreast of the Market” column predicts next-day stock market declines and subsequent reversals
Labour — Hershbein & Kahn (2018): job posting text shows skill requirements rose faster in cities hit hardest by the Great Recession
Political economy — Gentzkow & Shapiro (2010): compare newspaper text to congressional speech to measure media slant; find strong demand-side pressure from readers
Macroeconomics — Baker, Bloom & Davis (2016): newspaper keyword counts measure Economic Policy Uncertainty \(\rightarrow\) Application 1
Macroeconomics / finance — Bybee, Kelly, Manela & Xiu (2024): topic model applied to WSJ articles tracks business-cycle themes in real time
Public finance / surveys — Ferrario & Stantcheva (2022): open-ended survey responses reveal people’s first-order concerns about tax policy
Industrial Organisation — Hoberg & Phillips (2016): cosine similarity of 10-K product descriptions defines dynamic industry boundaries \(\rightarrow\) Application 2
Google Trends: Ukraine
Google Trends: US abortion
Google Flu Trends – overprediction due to media-driven search behaviour
Why is text data hard?
Imagine a document with \(w\) tokens, each drawn from a vocabulary of \(p\) distinct words.
How many unique documents exist?
\[p^w\]
This is an astronomically large number (a combinatorial explosion!).
Implication: Text data is high-dimensional. To analyse it, we must:
Source: Kenneth Benoit, Course on Quantitative Text Analysis (TCD 2016)
How we represent text has evolved from simple counts to rich contextual meaning:
| Representation | Idea | What it captures |
|---|---|---|
| Bag-of-Words | Word counts per document | Presence / frequency |
| tf–idf | Reweight by cross-doc rarity | Distinctiveness |
| Word embeddings (Word2Vec, GloVe) | Dense vector from co-occurrence | Semantic similarity |
| Contextual embeddings (ELMo, BERT) | Vector depends on sentence context | Polysemy, syntax |
| Transformers / LLMs (GPT, LLaMA) | Attention over full context | Language generation, reasoning |
Today: BoW and tf–idf – the workhorses of economics text analysis.
In future classes: word embeddings and LLMs – why they dominate NLP but require more care in causal research.
Each step gains expressiveness but trades off interpretability and computational cost.
Representing Text as Data
What would you have done differently?
To turn off either behaviour explicitly:
Without lowercasing, “Economies” and “economies” would be counted as different vocabulary items – inflating the DTM unnecessarily.
Note on punctuation: stripping it is safe for most BoW applications, but in some domains punctuation carries meaning – e.g. $, %, and # in financial filings, or emoticons (:), :() in social media sentiment analysis. Always check whether removing it discards signals relevant to your research question.
Tokenisation breaks text into smaller units – words, characters, or n-grams.
Stop-word removal strips high-frequency, low-information words using pre-built dictionaries:
Note: “capital gains tax” is a trigram – detecting multi-word expressions requires collocation methods (statistical tests of independence).
library(tidytext)
doc <- tibble(text = "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.")
tokens <- doc |> unnest_tokens(word, text)
nrow(tokens)
#> [1] 23
# Which tokens are stop words?
tokens |> semi_join(stop_words, by = "word") |> pull(word)
#> [1] "it" "is" "a" "that" "a" "in" "of" "a"
#> [9] "must" "be" "in" "want" "of" "a"
# What remains after anti_join?
tokens |> anti_join(stop_words, by = "word") |> pull(word)
#> [1] "truth" "universally" "acknowledged" "single"
#> [5] "man" "possession" "good" "fortune"
#> [9] "wife"23 tokens \(\rightarrow\) 14 stop words removed \(\rightarrow\) 9 content words. The signal-to-noise ratio improves substantially.
Instead of single words (unigrams), tokenise into sequences of \(n\) consecutive words:
doc <- tibble(text = "capital gains tax reduces investment and economic growth")
doc |> unnest_tokens(bigram, text, token = "ngrams", n = 2) |> pull(bigram)
#> [1] "capital gains" "gains tax" "tax reduces"
#> [4] "reduces investment" "investment and" "and economic"
#> [7] "economic growth"
doc |> unnest_tokens(trigram, text, token = "ngrams", n = 3) |> pull(trigram)
#> [1] "capital gains tax" "gains tax reduces"
#> [3] "tax reduces investment" ...Tradeoff: larger \(n\) captures more context but multiplies vocabulary size \(\rightarrow\) sparser DTM and slower computation.
Multi-word expressions like “capital gains tax”: not all trigrams are meaningful – “tax reduces investment” is not a concept. Use collocation statistics to keep only sequences that co-occur more than chance predicts:
library(quanteda); library(quanteda.textstats)
tokens(corp) |> textstat_collocations(min_count = 5, size = 2:3)
# returns λ (log-likelihood ratio) and z-score for each candidate collocation
# lambda ≈ log(observed co-occurrence / expected co-occurrence under independence)
# z = lambda / standard error(lambda)
# collocation count count_nested length lambda z
# 1 free trade 85 0 2 9.42156 18.302
# 2 market access 72 0 2 8.73421 16.947
toks <- tokens(corp) |>
tokens_compound(pattern = collocs)
# capital_return_taxRule of thumb: start with unigrams; add bigrams/trigrams only for well-attested collocations (e.g. \(z > 3\)).
Both reduce words to a base form:
Stemming – a crude heuristic that chops off word endings (Porter, 1980):
library(tidytext)
library(textstem)
doc <- tibble(text = "The economists were studying the economies of countries that had been economically restructured.")
tokens <- doc |> unnest_tokens(word, text)
tokens |> pull(word)
#> [1] "the" "economists" "were" "studying"
#> [5] "the" "economies" "of" "countries"
#> [9] "that" "had" "been" "economically"
#> [13] "restructured"
# Lemmatise: collapse inflected forms to their dictionary root
tokens |> mutate(lemma = lemmatize_words(word)) |> select(word, lemma)
#> word lemma
#> economists economist # plural → singular
#> were be # past → base verb
#> studying study # gerund → base verb
#> economies economy # plural → singular
#> countries country # plural → singular
#> economically economic # adverb → adjective
#> restructured restructure # past participle → base verb7 of 13 tokens change form – the vocabulary shrinks and variants of the same concept are now counted together.
When text is represented as the bag (multiset) of its words:
Example: Two movie reviews
BoW_R1 = {This:1, movie:1, is:2, spooky:1, and:1, original:1}BoW_R2 = {This:1, movie:1, is:1, original:1, but:1, long:1}| This | movie | is | spooky | and | original | but | long | |
|---|---|---|---|---|---|---|---|---|
| BoW_R1 | 1 | 1 | 2 | 1 | 1 | 1 | 0 | 0 |
| BoW_R2 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
New words \(\Rightarrow\) larger vocabulary \(\Rightarrow\) higher dimensionality \(\Rightarrow\) pre-processing matters.
Goal: down-weight words common across all documents; up-weight distinctive words.
For word \(j\) in document \(i\):
\[\text{tf-idf}_{ij} = tf_{ij} \times idf_j\]
Worked example: 100 party manifestos, 1 000 words each. Document 1 contains “inequality” 16 times; 40 manifestos mention it.
High tf-idf \(\Leftrightarrow\) high within-document frequency and low cross-document frequency \(\rightarrow\) filters out common terms.
library(tidytext)
library(dplyr)
# From a tidy text data frame with columns: document, word
word_counts <- tidy_books |>
count(document, word, sort = TRUE)
# Compute tf-idf
tf_idf <- word_counts |>
bind_tf_idf(word, document, n)
# Top distinctive words per document
tf_idf |> group_by(document) |> slice_max(tf_idf, n = 5)After tokenisation, stop-word removal, stemming/lemmatisation, and (optionally) tf-idf weighting, your corpus becomes a document–term matrix:
\[\mathbf{C}_{n \times p} = \begin{pmatrix} c_{11} & c_{12} & \cdots & c_{1p} \\ c_{21} & c_{22} & \cdots & c_{2p} \\ \vdots & & \ddots & \vdots \\ c_{n1} & c_{n2} & \cdots & c_{np} \end{pmatrix}\]
This matrix is the starting point for all downstream analysis: dictionaries, regression, clustering, topic models.
The quanteda package (Benoit et al.) provides an integrated, high-performance pipeline for text analysis in R:
Key features:
See: quanteda.io
Statistical Methods
Three families of methods – today covers 1 and 2; last class covers 3:
\(\rightarrow\) Methods 3 and 4 from Gentzkow et al. (generative models, word embeddings) are covered in the last class.
Classifying documents when categories are known:
Regex = algebraic notation for pattern-matching in strings. Essential for counting keyword occurrences and text cleaning. Key patterns: [0-9] (digits), ^ (start), $ (end), .* (any characters).
library(stringr)
x <- c("GDP grew by 2.3% in Q1 2020",
"Unemployment rate: 4.1%",
"Federal Reserve raised rates",
"Vote on the bill: 234 in favour")
str_detect(x, "\\d+\\.?\\d*%") # find percentage figures
#> [1] TRUE TRUE FALSE FALSE
str_extract(x, "\\d{4}") # extract 4-digit years
#> [1] "2020" NA NA NA
str_replace(x, "Federal Reserve", "Fed") # normalise an entity name
#> [1] "GDP grew by 2.3% in Q1 2020"
#> [2] "Unemployment rate: 4.1%"
#> [3] "Fed raised rates"
#> [4] "Vote on the bill: 234 in favour"Reference: MIT regex cheatsheet (PDF)
| Dictionary | Best for | Output |
|---|---|---|
| Harvard IV / General Inquirer | Classic content analysis | sentiment, social, political categories |
| Loughran–McDonald | Financial / corporate text | negative, uncertainty, litigious… |
| Bing / Liu | General sentiment | positive / negative |
| NRC Emotion Lexicon | Emotions | anger, joy, fear, trust… |
| LIWC | Psychology / social science | psycholinguistic categories |
Some dictionaries are available directly in R:
# install.packages("textdata") # needed for afinn and nrc
library(tidytext); library(janeaustenr)
get_sentiments("bing") # bundled with tidytext
get_sentiments("afinn") # requires textdata
get_sentiments("nrc") # requires textdata
# Apply with a join
austen_books() |>
unnest_tokens(word, text) |>
inner_join(get_sentiments("bing"), by = "word") |>
count(book, sentiment) |>
pivot_wider(names_from = sentiment, values_from = n,
values_fill = 0) |>
mutate(net_sentiment = positive - negative)\(\Rightarrow\) Always match the dictionary to your domain – see next slide.
Loughran and McDonald (2010) used the Harvard-IV-4 TagNeg (H4N) dictionary to classify sentiment for firms’ 10-K filings: three-quarters of the “negative” words were typically not negative in a financial context – e.g. cancer, tax, cost, capital, board, liability, foreign.
\(\Rightarrow\) Always validate dictionaries against your specific domain.
Example: Contingency tables on keyword use in Parliament meetings
| Government | Opposition | |
|---|---|---|
| labor flexibility | 100 | 20 |
| environment | 115 | 25 |
Test independence using \(\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\), where expected frequencies assume keywords are independent of group.
Text-based variables are rarely “ground truth” — they are constructed measures.
Does the text measure capture the concept we claim it captures?
Face validity – inspect top words / top-scoring documents. Do they actually look like the concept?
Human validation – hand-code a random sample; compare machine labels to human labels. Report accuracy, precision, recall, or F1.
Robustness – vary dictionaries, stop-word lists, stemming, n-grams, thresholds. Do substantive conclusions change?
External validation – compare the text measure to an independent benchmark: surveys, expert ratings, known events, market responses, official statistics.
Example: Baker, Bloom & Davis (2016) validate EPU in two ways: - Human coders re-read a sample of articles and confirm whether they match the keyword criteria (human validation) - EPU spikes around recessions, elections, and policy events (external validation)
\(\Rightarrow\) Validation should be tied to the research question, not just to prediction accuracy.
Compare machine labels to human labels using a confusion matrix:
| Human: positive | Human: negative | |
|---|---|---|
| Machine: positive | TP | FP |
| Machine: negative | FN | TN |
Key tradeoff: a more permissive keyword list \(\uparrow\) recall but \(\downarrow\) precision (more false alarms); a stricter list does the opposite. Which matters more depends on your research question.
Dictionary methods use researcher-defined keywords. Supervised methods learn from labelled documents:
Common models:
Workflow:
label documents \(\rightarrow\) split train/test \(\rightarrow\) estimate \(\hat{f}\) \(\rightarrow\) validate on held-out set \(\rightarrow\) apply to unlabelled corpus
\(\Rightarrow\) The model learns which words predict the labels, but the labels still come from humans.
| This | movie | is | spooky | and | original | but | long | |
|---|---|---|---|---|---|---|---|---|
| BoW_R1 | 1 | 1 | 2 | 1 | 1 | 1 | 0 | 0 |
| BoW_R2 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
Define \(a = |BoW_{R1} \cap BoW_{R2}|\), \(b = |BoW_{R1}| - a\), \(c = |BoW_{R2}| - a\).
Cosine similarity: normalises by length – robust when documents differ in size \[s_{\cos} = \frac{a}{\sqrt{(a+b)(a+c)}}\]
Jaccard similarity: treats each unique word equally – penalises length differences \[s_{\text{jacc}} = \frac{a}{a+b+c}\]
How much in our example? What changes after stemming?
Goal: find latent topics in a corpus without pre-defined categories.
Key idea (LDA): assume each document is a mixture of topics, and each topic is a probability distribution over words. Infer the hidden structure from the observed word counts.
Example: given a corpus of news articles, LDA might discover:
Each article gets a vector of topic proportions, e.g. 70% monetary policy, 20% politics, 10% trade.
In economics: track what central banks talk about over time, characterise media coverage, discover policy themes without imposing categories.
Key limitation: topics are just word distributions – labelling them requires human judgment.
LDA simultaneously estimates two matrices from word co-occurrence patterns alone:
\(\boldsymbol{\beta}\): topic–word matrix – the probability each word belongs to each topic:
| trade | tariff | bank | rate | election | |
|---|---|---|---|---|---|
| Topic 1 | 0.042 | 0.031 | 0.001 | 0.002 | 0.001 |
| Topic 2 | 0.003 | 0.002 | 0.038 | 0.029 | 0.001 |
| Topic 3 | 0.002 | 0.001 | 0.003 | 0.002 | 0.041 |
You label each topic by inspecting its top words – this step requires human judgment.
\(\boldsymbol{\gamma}\): document–topic matrix – the proportion of each topic in each document:
| Topic 1 (trade) | Topic 2 (finance) | Topic 3 (politics) | |
|---|---|---|---|
| Article 1 | 0.70 | 0.20 | 0.10 |
| Article 2 | 0.05 | 0.85 | 0.10 |
| Article 3 | 0.15 | 0.10 | 0.75 |
\(\boldsymbol{\gamma}\) is your new feature matrix – use it as an input to regression or to track topic prevalence over time.
library(quanteda); library(topicmodels); library(tidytext)
# Build DTM
dfmat <- corpus(my_texts) |>
tokens(remove_punct = TRUE) |>
tokens_remove(stopwords("en")) |>
dfm() |> dfm_trim(min_termfreq = 5)
# Fit LDA -- k is the number of topics (your choice)
lda_model <- LDA(convert(dfmat, to = "topicmodels"), k = 10,
control = list(seed = 42))
# Top words per topic
terms(lda_model, 10)
# Or extract into a tidy tibble for ggplot2
tidy(lda_model, matrix = "beta") |> # word–topic probabilities
group_by(topic) |>
slice_max(beta, n = 5)
tidy(lda_model, matrix = "gamma") # document–topic mixturescast_dtm(document, word, n) converts a tidy word-count tibble directly to a DTM if you prefer the tidytext pipeline over quanteda.
Applications
Two examples of text-based classification in economics:
Can we measure policy uncertainty in the US? How does it look, and does it matter?
Baker, Bloom & Davis (2016)
Let \(i\) be a country–month pair, \(j\) a newspaper, and \(a\) an article (\(j = 1, \ldots, n\); \(a = 1, \ldots, m_j\)).
Computer-coded vs. human-coded EPU
Theory: \(\uparrow\) Uncertainty \(\rightarrow\) \(\uparrow\) real option to wait \(\rightarrow\) \(\downarrow\) investment (Bernanke 1983, Dixit & Pindyck 1994)
Regression: Firm-level estimates exploit differences in industry exposure to government:
\[Y_{it} = \text{FE}_i + \text{FE}_t + \beta\, (INT_i \times EPU_t) + \alpha\, (INT_i \times GS_t) + \varepsilon_{it}\]
where \(INT_i\) measures firm \(i\)’s dependence on government contracts, and \(Y_{it}\) is investment or hiring.
How to define a product market that is endogenous to firms’ choices?
Data: Web-crawl of 50,673 firm annual 10-K filings; use the product description section.
Pre-processing:
Let \(P^i \in \{0,1\}^K\) be a binary vector representation of the product description for firm \(i\), where \(K = |BoW_1 \cup \ldots \cup BoW_{50{,}673}|\).
Pairwise cosine similarity:
\[S_C(P^i, P^j) = \cos(\theta) = \frac{\sum_{k=1}^K P^i_k P^j_k}{\sqrt{\sum_{k=1}^K (P^i_k)^2}\;\sqrt{\sum_{k=1}^K (P^j_k)^2}} = \frac{\text{words in common}}{\text{normalised by length}}\]
Alternative: define \(P^i \in \mathbb{R}^K\) using tf-idf weights.
Text-based product similarity allows Hoberg & Phillips to test whether firms change their product-market position over time.
Sutton-style mechanism:
Advertising and R&D are endogenous sunk costs that differentiate products and soften competition.
Schumpeterian mechanism:
R&D creates new products and technologies, allowing firms to escape existing competitive pressure.
Prediction:
\[ \text{Advertising/R\&D today} \rightarrow \downarrow \text{future similarity} \rightarrow \downarrow \text{number of close rivals} \rightarrow \uparrow \text{profitability} \]
Can business news measure the state of the economy in real time?
Bybee, Kelly, Manela & Xiu (2024)
Data:
around 800,000 Wall Street Journal articles, 1984–2017
Method:
estimate a topic model and measure monthly news attention to each topic.
\(\Rightarrow\) Unstructured news becomes macroeconomic time series.
Figure: Topic taxonomy from Bybee et al. (2024).
LDA produces:
Topic–word distributions
e.g. Fed, rates, inflation
\(\rightarrow\) monetary policy
Article–topic distributions
e.g. article = 70% recession, 20% credit, 10% trade
Aggregate article-topic shares by month:
\[ \text{Attention}_{kt} = \frac{1}{N_t} \sum_{i \in t} \gamma_{ik} \]
\(\Rightarrow\) Topic modelling measures what the news is about, not only whether it is positive or negative.