06. Text as Data

Data Science for Economists

Irene Iodice

2026-03-01

Learning objectives

By the end of today you should be able to:

Tokenise raw text and build a Bag-of-Words document–term matrix in R.
Compute and interpret tf–idf weights.
Apply a domain-specific sentiment dictionary and evaluate its accuracy.
Explain two research designs that use dictionary counts or cosine similarity in economics.

Why text?

It has become increasingly affordable to store and process vast quantities of digital text, triggering an explosion of empirical research that leverages text as data.

Historical cost of computer memory and storage

Examples in economics

Finance — Tetlock (2007): pessimism in the WSJ “Abreast of the Market” column predicts next-day stock market declines and subsequent reversals
Labour — Hershbein & Kahn (2018): job posting text shows skill requirements rose faster in cities hit hardest by the Great Recession
Political economy — Gentzkow & Shapiro (2010): compare newspaper text to congressional speech to measure media slant; find strong demand-side pressure from readers
Macroeconomics — Baker, Bloom & Davis (2016): newspaper keyword counts measure Economic Policy Uncertainty $\rightarrow$ Application 1
Macroeconomics / finance — Bybee, Kelly, Manela & Xiu (2024): topic model applied to WSJ articles tracks business-cycle themes in real time
Public finance / surveys — Ferrario & Stantcheva (2022): open-ended survey responses reveal people’s first-order concerns about tax policy
Industrial Organisation — Hoberg & Phillips (2016): cosine similarity of 10-K product descriptions defines dynamic industry boundaries $\rightarrow$ Application 2

Text as Data – Strengths

“Always on”

Google Trends: Ukraine

Text as Data – Strengths (cont.)

“Non-Reactive”

Google Trends: US abortion

Text as Data – Weaknesses

Incomplete
Inaccessible or sensitive
Non-representative
Confounding

Google Flu Trends – overprediction due to media-driven search behaviour

Check here the Nature article

Check here the article

Dimensionality challenge

Why is text data hard?

Imagine a document with $w$ tokens, each drawn from a vocabulary of $p$ distinct words.

How many unique documents exist?

\[p^w\]

This is an astronomically large number (a combinatorial explosion!).

Implication: Text data is high-dimensional. To analyse it, we must:

engineer features (e.g., BoW, tf–idf);
reduce noise (stop words, stemming);
select relevant signals.

Texts $\rightarrow$ Feature matrix $\rightarrow$ Analysis

Source: Kenneth Benoit, Course on Quantitative Text Analysis (TCD 2016)

Roadmap

Representing Text
- Pre-processing: tokenisation, normalisation, stop words, n-grams, lemmatisation
- Bag-of-Words, tf-idf, Document-Term Matrix
Statistical Methods
- Dictionary-based classification
- Supervised text classification
- Unsupervised: similarity, clustering, topic modelling
Applications in Economics
- Baker, Bloom & Davis (2016): Economic Policy Uncertainty
- Hoberg & Phillips (2016): Product markets from 10-K filings

From words to meaning: a progression

How we represent text has evolved from simple counts to rich contextual meaning:

Representation	Idea	What it captures
Bag-of-Words	Word counts per document	Presence / frequency
tf–idf	Reweight by cross-doc rarity	Distinctiveness
Word embeddings (Word2Vec, GloVe)	Dense vector from co-occurrence	Semantic similarity
Contextual embeddings (ELMo, BERT)	Vector depends on sentence context	Polysemy, syntax
Transformers / LLMs (GPT, LLaMA)	Attention over full context	Language generation, reasoning

Today: BoW and tf–idf – the workhorses of economics text analysis.

In future classes: word embeddings and LLMs – why they dominate NLP but require more care in causal research.

Each step gains expressiveness but trades off interpretability and computational cost.

Representing Text as Data

How to represent news?

library(wordcloud)

Source: Financial Times Blog, 24 March 2020

What would you have done differently?

Processing pipeline

Split corpus into documents (e.g. one article per day).
Tokenise documents (words, n-grams, or characters).
Normalise and reduce: lower-case, remove punctuation/stop words, stemming or lemmatisation.

Normalisation: lower-case and punctuation removal

library(tidytext)

raw <- tibble(text = "The ECONOMISTS were studying the Economies of Countries!")

# unnest_tokens() lowercases and strips punctuation automatically
raw |> unnest_tokens(word, text) |> pull(word)
#> [1] "the"        "economists" "were"       "studying"
#> [5] "the"        "economies"  "of"         "countries"

To turn off either behaviour explicitly:

raw |> unnest_tokens(word, text, to_lower = FALSE)   # preserve case
raw |> unnest_tokens(word, text, strip_punct = FALSE) # keep punctuation

Without lowercasing, “Economies” and “economies” would be counted as different vocabulary items – inflating the DTM unnecessarily.

Note on punctuation: stripping it is safe for most BoW applications, but in some domains punctuation carries meaning – e.g. $, %, and # in financial filings, or emoticons (:), :() in social media sentiment analysis. Always check whether removing it discards signals relevant to your research question.

Tokenisation and stop-word removal

Tokenisation breaks text into smaller units – words, characters, or n-grams.

library(gutenbergr); library(tidytext)
gutenberg_download(1184)[1, ] |>
  unnest_tokens(input = text, output = word, token = "words")
#> # A tibble: 5 x 2
#>   gutenberg_id word
#>          <int> <chr>
#> 1         1184 the
#> 2         1184 count
#> 3         1184 of
#> 4         1184 monte
#> 5         1184 cristo

Stop-word removal strips high-frequency, low-information words using pre-built dictionaries:

library(hcandersenr); library(tidytext)
tidy_fir_tree <- hca_fairytales() |>
  filter(book == "The fir tree") |>
  unnest_tokens(word, text) |>
  filter(!(word %in% stopwords(source = "snowball")))

Note: “capital gains tax” is a trigram – detecting multi-word expressions requires collocation methods (statistical tests of independence).

Tokenisation and stop-word removal in practice

library(tidytext)

doc <- tibble(text = "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.")

tokens <- doc |> unnest_tokens(word, text)
nrow(tokens)
#> [1] 23

# Which tokens are stop words?
tokens |> semi_join(stop_words, by = "word") |> pull(word)
#>  [1] "it"   "is"   "a"    "that" "a"    "in"   "of"   "a"
#>  [9] "must" "be"   "in"   "want" "of"   "a"

# What remains after anti_join?
tokens |> anti_join(stop_words, by = "word") |> pull(word)
#> [1] "truth"        "universally"  "acknowledged" "single"
#> [5] "man"          "possession"   "good"         "fortune"
#> [9] "wife"

23 tokens $\rightarrow$ 14 stop words removed $\rightarrow$ 9 content words. The signal-to-noise ratio improves substantially.

N-grams

Instead of single words (unigrams), tokenise into sequences of $n$ consecutive words:

doc <- tibble(text = "capital gains tax reduces investment and economic growth")

doc |> unnest_tokens(bigram,  text, token = "ngrams", n = 2) |> pull(bigram)
#> [1] "capital gains"      "gains tax"          "tax reduces"
#> [4] "reduces investment" "investment and"     "and economic"
#> [7] "economic growth"

doc |> unnest_tokens(trigram, text, token = "ngrams", n = 3) |> pull(trigram)
#> [1] "capital gains tax"      "gains tax reduces"
#> [3] "tax reduces investment" ...

Tradeoff: larger $n$ captures more context but multiplies vocabulary size $\rightarrow$ sparser DTM and slower computation.

Multi-word expressions like “capital gains tax”: not all trigrams are meaningful – “tax reduces investment” is not a concept. Use collocation statistics to keep only sequences that co-occur more than chance predicts:

library(quanteda); library(quanteda.textstats)
tokens(corp) |> textstat_collocations(min_count = 5, size = 2:3)
# returns λ (log-likelihood ratio) and z-score for each candidate collocation
# lambda ≈ log(observed co-occurrence / expected co-occurrence under independence)
# z = lambda / standard error(lambda)
#  collocation count count_nested length    lambda        z
# 1              free trade    85            0      2   9.42156   18.302
# 2           market access    72            0      2   8.73421   16.947

toks <- tokens(corp) |>
  tokens_compound(pattern = collocs)

# capital_return_tax

Rule of thumb: start with unigrams; add bigrams/trigrams only for well-attested collocations (e.g. $z > 3$).

Stemming vs. lemmatisation

Both reduce words to a base form:

am, are $\Rightarrow$ be
car, cars, car’s, cars’ $\Rightarrow$ car

Stemming – a crude heuristic that chops off word endings (Porter, 1980):

library(textstem)
x <- c('doggies', ',', 'they', "aren't", 'Joyfully', 'running', '.')
stem_words(x)
#> [1] "doggi"  ","  "thei"  "aren't"  "Joyfulli"  "run"  "."

Lemmatisation – uses vocabulary and morphological analysis:

lemmatize_words(x)
#> [1] "doggy"  ","  "they"  "aren't"  "Joyfully"  "run"  "."

Lemmatisation is more accurate but slower; stemming suffices for many BoW applications.

Lemmatisation in practice

library(tidytext)
library(textstem)

doc <- tibble(text = "The economists were studying the economies of countries that had been economically restructured.")

tokens <- doc |> unnest_tokens(word, text)
tokens |> pull(word)
#> [1] "the"          "economists"   "were"         "studying"
#> [5] "the"          "economies"    "of"           "countries"
#> [9] "that"         "had"          "been"         "economically"
#> [13] "restructured"

# Lemmatise: collapse inflected forms to their dictionary root
tokens |> mutate(lemma = lemmatize_words(word)) |> select(word, lemma)
#>           word         lemma
#>    economists    economist    # plural → singular
#>          were            be    # past → base verb
#>      studying         study    # gerund → base verb
#>      economies       economy    # plural → singular
#>      countries       country    # plural → singular
#>  economically     economic    # adverb → adjective
#>  restructured   restructure    # past participle → base verb

7 of 13 tokens change form – the vocabulary shrinks and variants of the same concept are now counted together.

Bag-of-Words representation

When text is represented as the bag (multiset) of its words:

disregard grammar and word order
keep multiplicity (multiset)

Example: Two movie reviews

“This movie is spooky and is original” $\rightarrow$ BoW_R1 = {This:1, movie:1, is:2, spooky:1, and:1, original:1}
“This movie is original but long” $\rightarrow$ BoW_R2 = {This:1, movie:1, is:1, original:1, but:1, long:1}

	This	movie	is	spooky	and	original	but	long
BoW_R1	1	1	2	1	1	1	0	0
BoW_R2	1	1	1	0	0	1	1	1

New words $\Rightarrow$ larger vocabulary $\Rightarrow$ higher dimensionality $\Rightarrow$ pre-processing matters.

tf–idf: Feature weighting

Goal: down-weight words common across all documents; up-weight distinctive words.

For word $j$ in document $i$:

\[\text{tf-idf}_{ij} = tf_{ij} \times idf_j\]

$tf_{ij}$: count (or relative frequency) of word $j$ in document $i$
$idf_j = \ln\!\left(\frac{N}{\lvert\{i : tf_{ij} > 0\}\rvert}\right)$ – log-inverse share of documents containing $j$

Worked example: 100 party manifestos, 1 000 words each. Document 1 contains “inequality” 16 times; 40 manifestos mention it.

$tf = 16/1000 = 0.016$
$idf = \ln(100/40) = \ln(2.5) \approx 0.916$
tf-idf $= 0.016 \times 0.916 \approx 0.0147$

High tf-idf $\Leftrightarrow$ high within-document frequency and low cross-document frequency $\rightarrow$ filters out common terms.

Computing tf-idf in R

library(tidytext)
library(dplyr)

# From a tidy text data frame with columns: document, word
word_counts <- tidy_books |>
  count(document, word, sort = TRUE)

# Compute tf-idf
tf_idf <- word_counts |>
  bind_tf_idf(word, document, n)

# Top distinctive words per document
tf_idf |> group_by(document) |> slice_max(tf_idf, n = 5)

The Document–Term Matrix (DTM)

After tokenisation, stop-word removal, stemming/lemmatisation, and (optionally) tf-idf weighting, your corpus becomes a document–term matrix:

\[\mathbf{C}_{n \times p} = \begin{pmatrix} c_{11} & c_{12} & \cdots & c_{1p} \\ c_{21} & c_{22} & \cdots & c_{2p} \\ \vdots & & \ddots & \vdots \\ c_{n1} & c_{n2} & \cdots & c_{np} \end{pmatrix}\]

$n$ documents (rows), $p$ vocabulary terms (columns)
Entries are raw counts, relative frequencies, or tf-idf weights
Typically very sparse – most entries are zero

This matrix is the starting point for all downstream analysis: dictionaries, regression, clustering, topic models.

quanteda 4.0 – a modern R text toolkit

The quanteda package (Benoit et al.) provides an integrated, high-performance pipeline for text analysis in R:

library(quanteda)

# Build a corpus, tokenise, create DTM in three lines
corp  <- corpus(my_texts)
toks  <- tokens(corp, remove_punct = TRUE) |>
           tokens_remove(stopwords("en"))
dfmat <- dfm(toks) |> dfm_tfidf()

Key features:

Fast C++ back-end; handles millions of tokens
Built-in dictionaries, collocations, n-grams
Seamless integration with quanteda.textplots, quanteda.textstats, and quanteda.textmodels
Works naturally with tidy workflows via tidytext conversion

See: quanteda.io

Statistical Methods

Methods overview

Three families of methods – today covers 1 and 2; last class covers 3:

Dictionary-based classification – researcher-defined keyword lists; no labelled data required
- Existing dictionaries, building your own, regex rules, validation
Supervised text classification – learn from labelled documents; BoW/tf-idf as features
- Logistic regression, LASSO, random forests
Unsupervised text analysis – discover structure without labels
- Document similarity and clustering
- Topic modelling (LDA)

$\rightarrow$ Methods 3 and 4 from Gentzkow et al. (generative models, word embeddings) are covered in the last class.

Dictionary-based classification

Dictionary-based methods

Classifying documents when categories are known:

Identify a set of words that correspond to each category
- thesaurus: vote = {poll, suffrage, franchis*, ballot*, ^vot}
- sentiment: positive or negative
- emotions: sad, happy, angry, anxious
- topics: economics, culture, etc.
Count how many times these words appear in each document
Normalise by document length
Validate:
- Code a few documents manually and check alignment
- Check sensitivity to exclusion of specific words
- Decide sample size based on the power of your test

Regular expressions

Regex = algebraic notation for pattern-matching in strings. Essential for counting keyword occurrences and text cleaning. Key patterns: [0-9] (digits), ^ (start), $ (end), .* (any characters).

library(stringr)

x <- c("GDP grew by 2.3% in Q1 2020",
       "Unemployment rate: 4.1%",
       "Federal Reserve raised rates",
       "Vote on the bill: 234 in favour")

str_detect(x, "\\d+\\.?\\d*%")       # find percentage figures
#> [1]  TRUE  TRUE FALSE FALSE

str_extract(x, "\\d{4}")             # extract 4-digit years
#> [1] "2020" NA     NA     NA

str_replace(x, "Federal Reserve", "Fed")  # normalise an entity name
#> [1] "GDP grew by 2.3% in Q1 2020"
#> [2] "Unemployment rate: 4.1%"
#> [3] "Fed raised rates"
#> [4] "Vote on the bill: 234 in favour"

Reference: MIT regex cheatsheet (PDF)

Existing dictionaries

Dictionary	Best for	Output
Harvard IV / General Inquirer	Classic content analysis	sentiment, social, political categories
Loughran–McDonald	Financial / corporate text	negative, uncertainty, litigious…
Bing / Liu	General sentiment	positive / negative
NRC Emotion Lexicon	Emotions	anger, joy, fear, trust…
LIWC	Psychology / social science	psycholinguistic categories

Some dictionaries are available directly in R:

# install.packages("textdata")  # needed for afinn and nrc
library(tidytext); library(janeaustenr)

get_sentiments("bing")           # bundled with tidytext
get_sentiments("afinn")          # requires textdata
get_sentiments("nrc")            # requires textdata

# Apply with a join
austen_books() |>
  unnest_tokens(word, text) |>
  inner_join(get_sentiments("bing"), by = "word") |>
  count(book, sentiment) |>
  pivot_wider(names_from = sentiment, values_from = n,
              values_fill = 0) |>
  mutate(net_sentiment = positive - negative)

$\Rightarrow$ Always match the dictionary to your domain – see next slide.

Dictionaries are context-specific

Loughran and McDonald (2010) used the Harvard-IV-4 TagNeg (H4N) dictionary to classify sentiment for firms’ 10-K filings: three-quarters of the “negative” words were typically not negative in a financial context – e.g. cancer, tax, cost, capital, board, liability, foreign.

Polysemes – words with multiple meanings cause misclassification
H4N also lacked negative financial words: felony, litigation, restated, misstatement, unanticipated

$\Rightarrow$ Always validate dictionaries against your specific domain.

Building your own dictionary

Identify “extreme texts” with known positions
Search for differentially occurring words using word frequencies
Use these words (or their lemmas) as category markers

Example: Contingency tables on keyword use in Parliament meetings

	Government	Opposition
labor flexibility	100	20
environment	115	25

Test independence using $\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$, where expected frequencies assume keywords are independent of group.

Validation: text measures are measurements

Text-based variables are rarely “ground truth” — they are constructed measures.

Does the text measure capture the concept we claim it captures?

Face validity – inspect top words / top-scoring documents. Do they actually look like the concept?
Human validation – hand-code a random sample; compare machine labels to human labels. Report accuracy, precision, recall, or F1.
Robustness – vary dictionaries, stop-word lists, stemming, n-grams, thresholds. Do substantive conclusions change?
External validation – compare the text measure to an independent benchmark: surveys, expert ratings, known events, market responses, official statistics.

Example: Baker, Bloom & Davis (2016) validate EPU in two ways: - Human coders re-read a sample of articles and confirm whether they match the keyword criteria (human validation) - EPU spikes around recessions, elections, and policy events (external validation)

$\Rightarrow$ Validation should be tied to the research question, not just to prediction accuracy.

Evaluating classifier performance

Compare machine labels to human labels using a confusion matrix:

	Human: positive	Human: negative
Machine: positive	TP	FP
Machine: negative	FN	TN

Accuracy $= \frac{TP + TN}{TP + TN + FP + FN}$ – share correctly classified
Precision $= \frac{TP}{TP + FP}$ – of documents flagged positive, how many actually are?
Recall $= \frac{TP}{TP + FN}$ – of actual positives, how many did we catch?
F1 $= 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ – harmonic mean; useful when classes are imbalanced

Key tradeoff: a more permissive keyword list $\uparrow$ recall but $\downarrow$ precision (more false alarms); a stricter list does the opposite. Which matters more depends on your research question.

Supervised text classification

Supervised text classification

Dictionary methods use researcher-defined keywords. Supervised methods learn from labelled documents:

$y_i$: human-coded label, e.g. relevant / not relevant, positive / negative
$X_i$: BoW or tf-idf feature vector
Learn $\hat{f}(X_i)$ to predict $y_i$ on new documents

Common models:

Logistic regression / multinomial logit
LASSO or elastic net when $p \gg n$
Random forests, support vector machines

Workflow:

label documents $\rightarrow$ split train/test $\rightarrow$ estimate $\hat{f}$ $\rightarrow$ validate on held-out set $\rightarrow$ apply to unlabelled corpus

$\Rightarrow$ The model learns which words predict the labels, but the labels still come from humans.

Unsupervised text analysis

Similarity across documents

	This	movie	is	spooky	and	original	but	long
BoW_R1	1	1	2	1	1	1	0	0
BoW_R2	1	1	1	0	0	1	1	1

Define $a = |BoW_{R1} \cap BoW_{R2}|$, $b = |BoW_{R1}| - a$, $c = |BoW_{R2}| - a$.

Cosine similarity: normalises by length – robust when documents differ in size \[s_{\cos} = \frac{a}{\sqrt{(a+b)(a+c)}}\]
Jaccard similarity: treats each unique word equally – penalises length differences \[s_{\text{jacc}} = \frac{a}{a+b+c}\]

How much in our example? What changes after stemming?

Topic modelling: discovering themes automatically

Goal: find latent topics in a corpus without pre-defined categories.

Key idea (LDA): assume each document is a mixture of topics, and each topic is a probability distribution over words. Infer the hidden structure from the observed word counts.

Example: given a corpus of news articles, LDA might discover:

Topic 1: bank, rate, inflation, Fed, interest $\rightarrow$ monetary policy
Topic 2: election, vote, party, candidate, poll $\rightarrow$ politics
Topic 3: trade, tariff, export, deficit, China $\rightarrow$ trade

Each article gets a vector of topic proportions, e.g. 70% monetary policy, 20% politics, 10% trade.

In economics: track what central banks talk about over time, characterise media coverage, discover policy themes without imposing categories.

Key limitation: topics are just word distributions – labelling them requires human judgment.

LDA: two distributions learned from data

LDA simultaneously estimates two matrices from word co-occurrence patterns alone:

$\boldsymbol{\beta}$: topic–word matrix – the probability each word belongs to each topic:

	trade	tariff	bank	rate	election
Topic 1	0.042	0.031	0.001	0.002	0.001
Topic 2	0.003	0.002	0.038	0.029	0.001
Topic 3	0.002	0.001	0.003	0.002	0.041

You label each topic by inspecting its top words – this step requires human judgment.

$\boldsymbol{\gamma}$: document–topic matrix – the proportion of each topic in each document:

	Topic 1 (trade)	Topic 2 (finance)	Topic 3 (politics)
Article 1	0.70	0.20	0.10
Article 2	0.05	0.85	0.10
Article 3	0.15	0.10	0.75

$\boldsymbol{\gamma}$ is your new feature matrix – use it as an input to regression or to track topic prevalence over time.

Topic modelling in R

library(quanteda); library(topicmodels); library(tidytext)

# Build DTM
dfmat <- corpus(my_texts) |>
  tokens(remove_punct = TRUE) |>
  tokens_remove(stopwords("en")) |>
  dfm() |> dfm_trim(min_termfreq = 5)

# Fit LDA -- k is the number of topics (your choice)
lda_model <- LDA(convert(dfmat, to = "topicmodels"), k = 10,
                 control = list(seed = 42))

# Top words per topic
terms(lda_model, 10)

# Or extract into a tidy tibble for ggplot2
tidy(lda_model, matrix = "beta") |>        # word–topic probabilities
  group_by(topic) |>
  slice_max(beta, n = 5)

tidy(lda_model, matrix = "gamma")          # document–topic mixtures

cast_dtm(document, word, n) converts a tidy word-count tibble directly to a DTM if you prefer the tidytext pipeline over quanteda.

Applications

Applications overview

Two examples of text-based classification in economics:

Dictionary-based methods – Baker, Bloom & Davis (2016): Economic Policy Uncertainty Index
Clustering by text similarity – Hoberg & Phillips (2016): Product market definitions from 10-K filings

Application 1: Measuring Economic Policy Uncertainty

Can we measure policy uncertainty in the US? How does it look, and does it matter?

Baker, Bloom & Davis (2016)

EPU methodology

Let $i$ be a country–month pair, $j$ a newspaper, and $a$ an article ($j = 1, \ldots, n$; $a = 1, \ldots, m_j$).

$c_{ij} = \frac{1}{m_j} \sum_a \mathbf{1}\!\left[\sum_{t \in \{E, P, U\}} \mathbf{1}[BoW_{ijat} \cap K_t \neq \emptyset] = 3\right]$ – share of articles containing at least one keyword from each of:
- $K_E = \{\text{"economy"}, \text{"economics"}\}$
- $K_U = \{\text{"uncertain"}, \text{"uncertainty"}\}$
- $K_P = \{\text{"regulation"}, \text{"deficit"}, \text{"federal reserve"}, \text{"legislation"}, \text{"white house"}\}$
$c_i = \frac{1}{n} \sum_j c_{ij}$ – average across newspapers
$\hat{v}_i = c_i$ is the Economic Policy Uncertainty (EPU) Index

The EPU Index

Validation of the EPU Index

Computer-coded vs. human-coded EPU

Testing economic hypotheses

Theory: $\uparrow$ Uncertainty $\rightarrow$ $\uparrow$ real option to wait $\rightarrow$ $\downarrow$ investment (Bernanke 1983, Dixit & Pindyck 1994)

Regression: Firm-level estimates exploit differences in industry exposure to government:

\[Y_{it} = \text{FE}_i + \text{FE}_t + \beta\, (INT_i \times EPU_t) + \alpha\, (INT_i \times GS_t) + \varepsilon_{it}\]

where $INT_i$ measures firm $i$’s dependence on government contracts, and $Y_{it}$ is investment or hiring.

Results: EPU and firm-level outcomes

Application 2: Product markets from text (Hoberg & Phillips 2016)

How to define a product market that is endogenous to firms’ choices?

Methodology: Text-based industry classification

Data: Web-crawl of 50,673 firm annual 10-K filings; use the product description section.

Pre-processing:

Keep only nouns (as defined by Webster.com)
Remove words appearing in more than 25% of documents ($1/idf < 25\%$)
Tokenise text and generate BoW

Let $P^i \in \{0,1\}^K$ be a binary vector representation of the product description for firm $i$, where $K = |BoW_1 \cup \ldots \cup BoW_{50{,}673}|$.

Pairwise cosine similarity:

\[S_C(P^i, P^j) = \cos(\theta) = \frac{\sum_{k=1}^K P^i_k P^j_k}{\sqrt{\sum_{k=1}^K (P^i_k)^2}\;\sqrt{\sum_{k=1}^K (P^j_k)^2}} = \frac{\text{words in common}}{\text{normalised by length}}\]

Alternative: define $P^i \in \mathbb{R}^K$ using tf-idf weights.

Validation: Text-based vs. traditional classification

Testing Sutton and Schumpeter with text

Text-based product similarity allows Hoberg & Phillips to test whether firms change their product-market position over time.

Sutton-style mechanism:
Advertising and R&D are endogenous sunk costs that differentiate products and soften competition.

Schumpeterian mechanism:
R&D creates new products and technologies, allowing firms to escape existing competitive pressure.

Prediction:

\[ \text{Advertising/R\&D today} \rightarrow \downarrow \text{future similarity} \rightarrow \downarrow \text{number of close rivals} \rightarrow \uparrow \text{profitability} \]

Results: Testing Sutton’s predictions

Application 3: Business news and business cycles

Can business news measure the state of the economy in real time?

Bybee, Kelly, Manela & Xiu (2024)

Data:
around 800,000 Wall Street Journal articles, 1984–2017

Method:
estimate a topic model and measure monthly news attention to each topic.

$\Rightarrow$ Unstructured news becomes macroeconomic time series.

Figure: Topic taxonomy from Bybee et al. (2024).

From articles to topic attention

LDA produces:

Topic–word distributions
e.g. Fed, rates, inflation
$\rightarrow$ monetary policy
Article–topic distributions
e.g. article = 70% recession, 20% credit, 10% trade

Aggregate article-topic shares by month:

\[ \text{Attention}_{kt} = \frac{1}{N_t} \sum_{i \in t} \gamma_{ik} \]

$\Rightarrow$ Topic modelling measures what the news is about, not only whether it is positive or negative.

Key take-aways

Text is high-dimensional; representation choices matter.
Simple pre-processing + tf-idf already provides useful features.
Dictionary methods are quick but require context-specific validation.
Similarity metrics unlock clustering and network applications.
Modern toolkits like quanteda make the full pipeline accessible in R.
LLMs complement (but do not replace) interpretable bag-of-words approaches.
Note: LLM-based classification is increasingly replacing hand-built dictionaries for many tasks — we’ll see this next.