Tokenise raw text and build a Bag-of-Words document–term matrix in R.
Compute and interpret tf–idf weights.
Apply a domain-specific sentiment dictionary and evaluate its accuracy.
Explain two research designs that use dictionary counts or cosine similarity in economics.
Why text?
It has become increasingly affordable to store and process vast quantities of digital text, triggering an explosion of empirical research that leverages text as data.
Historical cost of computer memory and storage
Examples in economics
Finance — predict asset price movements from news (Frank (2004) and Tetlock (2007))
Macroeconomics — forecast variation in inflation and unemployment from Google searches
Industrial Organization — product reviews used to study the drivers of consumer decision making
Text as Data – Strengths
“Always on”
Google Trends: Ukraine
Text as Data – Strengths (cont.)
“Non-Reactive”
Google Trends: US abortion
Text as Data – Weaknesses
Incomplete
Inaccessible or sensitive
Non-representative
Confounding
Google Flu Trends – overprediction due to media-driven search behaviour
Note: “capital gains tax” is a trigram – detecting multi-word expressions requires collocation methods (statistical tests of independence).
Computing tf-idf in R
library(tidytext)library(dplyr)# From a tidy text data frame with columns: document, wordword_counts <- tidy_books |>count(document, word, sort =TRUE)# Compute tf-idftf_idf <- word_counts |>bind_tf_idf(word, document, n)# Top distinctive words per documenttf_idf |>group_by(document) |>slice_max(tf_idf, n =5)
Bag-of-Words representation
When text is represented as the bag (multiset) of its words:
disregard grammar and word order
keep multiplicity (multiset)
Example: Two movie reviews
“This movie is spooky and is original” \(\rightarrow\)BoW_R1 = {This:1, movie:1, is:2, spooky:1, and:1, original:1}
“This movie is original but long” \(\rightarrow\)BoW_R2 = {This:1, movie:1, is:1, original:1, but:1, long:1}
This
movie
is
spooky
and
original
but
long
BoW_R1
1
1
2
1
1
1
0
0
BoW_R2
1
1
1
0
0
1
1
1
New words \(\Rightarrow\) larger vocabulary \(\Rightarrow\) higher dimensionality \(\Rightarrow\)pre-processing matters.
Stemming vs. lemmatisation
Both reduce words to a base form:
am, are \(\Rightarrow\) be
car, cars, car’s, cars’ \(\Rightarrow\) car
Stemming – a crude heuristic that chops off word endings (Porter, 1980):
VADER – Valence Aware Dictionary and Sentiment Reasoner
LexiCoder (media text), SentiStrength (social media text)
Dictionaries are context-specific
Loughran and McDonald (2010) used the Harvard-IV-4 TagNeg (H4N) dictionary to classify sentiment for firms’ 10-K filings: three-quarters of the “negative” words were typically not negative in a financial context – e.g. cancer, tax, cost, capital, board, liability, foreign.
Polysemes – words with multiple meanings cause misclassification
Wordscores (Laver et al. 2003): a supervised scaling method.
Assign known scores to reference texts (e.g., left=−1, right=+1 for party manifestos)
Each word inherits a weighted average score based on its frequency across reference texts
Score new documents by averaging their words’ scores
See quanteda.textmodels::textmodel_wordscores() for implementation.
A note on LLMs and bag-of-words
Large Language Models (GPT, Claude, LLaMA, etc.) have transformed NLP since ~2018:
They capture word order, context, and semantics – everything BoW discards
Zero-shot and few-shot classification can replace hand-built dictionaries
Pre-trained embeddings provide richer document representations than tf-idf
But bag-of-words methods remain valuable:
Transparent, reproducible, and easy to audit
Low computational cost – no GPU required
Ideal when the research design needs interpretable features (e.g., which words drive the EPU index?)
Many landmark studies (Baker et al., Hoberg & Phillips) pre-date LLMs and their results hold up
\(\Rightarrow\) Think of BoW and LLMs as complements, not substitutes. Use the simpler tool when it suffices; reach for LLMs when context and nuance matter.
Applications
Applications overview
Two examples of text-based classification in economics:
Dictionary-based methods – Baker, Bloom & Davis (2016): Economic Policy Uncertainty Index
Clustering by text similarity – Hoberg & Phillips (2016): Product market definitions from 10-K filings
Can we measure policy uncertainty in the US? How does it look, and does it matter?
Baker, Bloom & Davis (2016)
EPU methodology
Let \(i\) be a country–month pair, \(j\) a newspaper, and \(a\) an article (\(j = 1, \ldots, n\); \(a = 1, \ldots, m_j\)).
\(c_{ij} = \frac{1}{m_j} \sum_a \mathbf{1}\!\left[\sum_{t \in \{E, P, U\}} \mathbf{1}[BoW_{ijat} \cap K_t \neq \emptyset] = 3\right]\) – share of articles containing at least one keyword from each of:
\(c_i = \frac{1}{n} \sum_j c_{ij}\) – average across newspapers
\(\hat{v}_i = c_i\) is the Economic Policy Uncertainty (EPU) Index
The EPU Index
Validation of the EPU Index
Computer-coded vs. human-coded EPU
Testing economic hypotheses
Theory:\(\uparrow\) Uncertainty \(\rightarrow\)\(\uparrow\) real option to wait \(\rightarrow\)\(\downarrow\) investment (Bernanke 1983, Dixit & Pindyck 1994)
Regression: Firm-level estimates exploit differences in industry exposure to government:
where \(INT_i\) measures firm \(i\)’s dependence on government contracts, and \(Y_{it}\) is investment or hiring.
Results: EPU and firm-level outcomes
Application 2: Product markets from text (Hoberg & Phillips 2016)
How to define a product market that is endogenous to firms’ choices?
Methodology: Text-based industry classification
Data: Web-crawl of 50,673 firm annual 10-K filings; use the product description section.
Pre-processing:
Keep only nouns (as defined by Webster.com)
Remove words appearing in more than 25% of documents (\(1/idf < 25\%\))
Tokenise text and generate BoW
Let \(P^i \in \{0,1\}^K\) be a binary vector representation of the product description for firm \(i\), where \(K = |BoW_1 \cup \ldots \cup BoW_{50{,}673}|\).
Pairwise cosine similarity:
\[S_C(P^i, P^j) = \cos(\theta) = \frac{\sum_{k=1}^K P^i_k P^j_k}{\sqrt{\sum_{k=1}^K (P^i_k)^2}\;\sqrt{\sum_{k=1}^K (P^j_k)^2}} = \frac{\text{words in common}}{\text{normalised by length}}\]
Alternative: define \(P^i \in \mathbb{R}^K\) using tf-idf weights.
Validation: Text-based vs. traditional classification
Results: Testing Sutton’s predictions
Topic Modeling: Discovering Themes Automatically
LDA (Latent Dirichlet Allocation) discovers topics without pre-defined categories:
library(quanteda)library(topicmodels)# Build a document-feature matrixdfmat <-corpus(my_texts) |>tokens(remove_punct =TRUE) |>tokens_remove(stopwords("en")) |>dfm() |>dfm_trim(min_termfreq =5)# Fit LDA with 10 topicslda_model <-LDA(convert(dfmat, to ="topicmodels"), k =10)# Top words per topicterms(lda_model, 10)
Each document is a mixture of topics; each topic is a mixture of words. Useful for exploring large corpora without pre-defined categories.
Key take-aways
Text is high-dimensional; representation choices matter.