06. Text as Data

Data Science for Economists

Irene Iodice

2026-03-01

Learning objectives

By the end of today you should be able to:

Tokenise raw text and build a Bag-of-Words document–term matrix in R.
Compute and interpret tf–idf weights.
Apply a domain-specific sentiment dictionary and evaluate its accuracy.
Explain two research designs that use dictionary counts or cosine similarity in economics.

Why text?

It has become increasingly affordable to store and process vast quantities of digital text, triggering an explosion of empirical research that leverages text as data.

Historical cost of computer memory and storage

Examples in economics

Finance — predict asset price movements from news (Frank (2004) and Tetlock (2007))
Macroeconomics — forecast variation in inflation and unemployment from Google searches
Industrial Organization — product reviews used to study the drivers of consumer decision making

Text as Data – Strengths

“Always on”

Google Trends: Ukraine

Text as Data – Strengths (cont.)

“Non-Reactive”

Google Trends: US abortion

Text as Data – Weaknesses

Incomplete
Inaccessible or sensitive
Non-representative
Confounding

Google Flu Trends – overprediction due to media-driven search behaviour

Check here the article

Dimensionality challenge

Why is text data hard?

Imagine a document with $w$ tokens, each drawn from a vocabulary of $p$ distinct words.

How many unique documents exist?

\[p^w\]

This is an astronomically large number (a combinatorial explosion!).

Implication: Text data is high-dimensional. To analyse it, we must:

engineer features (e.g., BoW, tf–idf);
reduce noise (stop words, stemming);
select relevant signals.

Texts $\rightarrow$ Feature matrix $\rightarrow$ Analysis

Source: Kenneth Benoit, Course on Quantitative Text Analysis (TCD 2016)

Roadmap

Representing Text
- Bag-of-Words and beyond
- Pre-processing: tokenisation, stop-word removal, stemming/lemmatisation
- Feature weighting (tf-idf)
Analysing Text
- Dictionary-based classification
- Similarity measures and simple clustering
Applications in Economics

Representing Text as Data

How to represent news?

library(wordcloud)

Source: Financial Times Blog, 24 March 2020

What would you have done differently?

Processing pipeline

Split corpus into documents (e.g. one article per day).
Tokenise documents (words, n-grams, or characters).
Normalise and reduce: lower-case, remove punctuation/stop words, stemming or lemmatisation.

Tokenisation and stop-word removal

Tokenisation breaks text into smaller units – words, characters, or n-grams.

library(gutenbergr); library(tidytext)
gutenberg_download(1184)[1, ] |>
  unnest_tokens(input = text, output = word, token = "words")
#> # A tibble: 5 x 2
#>   gutenberg_id word
#>          <int> <chr>
#> 1         1184 the
#> 2         1184 count
#> 3         1184 of
#> 4         1184 monte
#> 5         1184 cristo

Stop-word removal strips high-frequency, low-information words using pre-built dictionaries:

library(hcandersenr); library(tidytext)
tidy_fir_tree <- hca_fairytales() |>
  filter(book == "The fir tree") |>
  unnest_tokens(word, text) |>
  filter(!(word %in% stopwords(source = "snowball")))

Note: “capital gains tax” is a trigram – detecting multi-word expressions requires collocation methods (statistical tests of independence).

Computing tf-idf in R

library(tidytext)
library(dplyr)

# From a tidy text data frame with columns: document, word
word_counts <- tidy_books |>
  count(document, word, sort = TRUE)

# Compute tf-idf
tf_idf <- word_counts |>
  bind_tf_idf(word, document, n)

# Top distinctive words per document
tf_idf |> group_by(document) |> slice_max(tf_idf, n = 5)

Bag-of-Words representation

When text is represented as the bag (multiset) of its words:

disregard grammar and word order
keep multiplicity (multiset)

Example: Two movie reviews

“This movie is spooky and is original” $\rightarrow$ BoW_R1 = {This:1, movie:1, is:2, spooky:1, and:1, original:1}
“This movie is original but long” $\rightarrow$ BoW_R2 = {This:1, movie:1, is:1, original:1, but:1, long:1}

	This	movie	is	spooky	and	original	but	long
BoW_R1	1	1	2	1	1	1	0	0
BoW_R2	1	1	1	0	0	1	1	1

New words $\Rightarrow$ larger vocabulary $\Rightarrow$ higher dimensionality $\Rightarrow$ pre-processing matters.

Stemming vs. lemmatisation

Both reduce words to a base form:

am, are $\Rightarrow$ be
car, cars, car’s, cars’ $\Rightarrow$ car

Stemming – a crude heuristic that chops off word endings (Porter, 1980):

library(textstem)
x <- c('doggies', ',', 'they', "aren't", 'Joyfully', 'running', '.')
stem_words(x)
#> [1] "doggi"  ","  "thei"  "aren't"  "Joyfulli"  "run"  "."

Lemmatisation – uses vocabulary and morphological analysis:

lemmatize_words(x)
#> [1] "doggy"  ","  "they"  "aren't"  "Joyfully"  "run"  "."

Lemmatisation is more accurate but slower; stemming suffices for many BoW applications.

tf–idf: Feature weighting

Goal: down-weight words common across all documents; up-weight distinctive words.

For word $j$ in document $i$:

\[\text{tf-idf}_{ij} = tf_{ij} \times idf_j\]

$tf_{ij}$: count (or relative frequency) of word $j$ in document $i$
$idf_j = \ln\!\left(\frac{N}{\lvert\{i : tf_{ij} > 0\}\rvert}\right)$ – log-inverse share of documents containing $j$

Worked example: 100 party manifestos, 1 000 words each. Document 1 contains “inequality” 16 times; 40 manifestos mention it.

$tf = 16/1000 = 0.016$
$idf = \ln(100/40) = \ln(2.5) \approx 0.916$
tf-idf $= 0.016 \times 0.916 \approx 0.0147$

High tf-idf $\Leftrightarrow$ high within-document frequency and low cross-document frequency $\rightarrow$ filters out common terms.

The Document–Term Matrix (DTM)

After tokenisation, stop-word removal, stemming/lemmatisation, and (optionally) tf-idf weighting, your corpus becomes a document–term matrix:

\[\mathbf{C}_{n \times p} = \begin{pmatrix} c_{11} & c_{12} & \cdots & c_{1p} \\ c_{21} & c_{22} & \cdots & c_{2p} \\ \vdots & & \ddots & \vdots \\ c_{n1} & c_{n2} & \cdots & c_{np} \end{pmatrix}\]

$n$ documents (rows), $p$ vocabulary terms (columns)
Entries are raw counts, relative frequencies, or tf-idf weights
Typically very sparse – most entries are zero

This matrix is the starting point for all downstream analysis: dictionaries, regression, clustering, topic models.

Similarity across documents

	This	movie	is	spooky	and	original	but	long
BoW_R1	1	1	2	1	1	1	0	0
BoW_R2	1	1	1	0	0	1	1	1

Define $a = |BoW_{R1} \cap BoW_{R2}|$, $b = |BoW_{R1}| - a$, $c = |BoW_{R2}| - a$.

Cosine similarity: \[s_{\cos} = \frac{a}{\sqrt{(a+b)(a+c)}}\]
Jaccard similarity: \[s_{\text{jacc}} = \frac{a}{a+b+c}\]

How much in our example? What changes after stemming?

quanteda 4.0 – a modern R text toolkit

The quanteda package (Benoit et al.) provides an integrated, high-performance pipeline for text analysis in R:

library(quanteda)

# Build a corpus, tokenise, create DTM in three lines
corp  <- corpus(my_texts)
toks  <- tokens(corp, remove_punct = TRUE) |>
           tokens_remove(stopwords("en"))
dfmat <- dfm(toks) |> dfm_tfidf()

Key features:

Fast C++ back-end; handles millions of tokens
Built-in dictionaries, collocations, n-grams
Seamless integration with quanteda.textplots, quanteda.textstats, and quanteda.textmodels
Works naturally with tidy workflows via tidytext conversion

See: quanteda.io

Statistical Methods

Overview of methods

Grimmer and Stewart (2013), expanded by Kenneth Benoit

Classifying and scaling documents

Two types of measurement schemes:

Classification of documents – involves categorical (often binary) measures
Scaling of documents – involves a continuous measure

Common goal: assign a text to a particular category, or a particular position on a scale.

From text tokens to attributes

Let $\mathbf{C}$ be the document–token matrix and $\mathbf{V}$ the matrix of attributes.

$\mathbf{C}^{\text{train}}$: documents for which we observe $\mathbf{V}^{\text{train}}$
$\mathbf{C}^{\text{test}}$: documents for which $\mathbf{V}$ is unobserved
$\mathbf{C}^{\text{train}}$ is $n^{\text{train}} \times p$; $\mathbf{V}^{\text{train}}$ is $n^{\text{train}} \times k$

How to map $\mathbf{C}$ to predictions $\hat{\mathbf{V}}$?

Main methods in the economics literature

From “Text as Data” by Gentzkow et al. (link):

Dictionary-Based Methods – a pre-specified dictionary characterises $f(\cdot)$, s.t. $\hat{v}_i = f(c_i)$
Text Regression Methods – model $p(v_i \mid c_i)$, use $\mathbf{C}^{\text{train}}$, $\mathbf{V}^{\text{train}}$ to estimate $E(v_i \mid c_i)$
Generative Models – model $p(c_i \mid v_i)$, e.g. fit $f_\theta(c_i, v_i)$ and invert to predict $v_i$
Word Embeddings – representation of words in vector space, e.g. Word2Vec

Dictionary-based methods

Classifying documents when categories are known:

Identify a set of words that correspond to each category
- thesaurus: vote = {poll, suffrage, franchis*, ballot*, ^vot}
- sentiment: positive or negative
- emotions: sad, happy, angry, anxious
- topics: economics, culture, etc.
Count how many times these words appear in each document
Normalise by document length
Validate:
- Code a few documents manually and check alignment
- Check sensitivity to exclusion of specific words
- Decide sample size based on the power of your test

Existing dictionaries

Existing lists of words associated with sentiment, emotions, topics, etc.

General Inquirer (Stone et al., 1966): proprietary, but a sample is accessible via:

library("qdapDictionaries")
data(power.words)
force(power.words)[1:8]
#> [1] "abolish"        "accomplish"     "accomplishment" "accord"
#> [5] "achievement"    "adjudication"   "administer"     "administration"

Open-source alternatives:

VADER – Valence Aware Dictionary and Sentiment Reasoner
LexiCoder (media text), SentiStrength (social media text)

Dictionaries are context-specific

Loughran and McDonald (2010) used the Harvard-IV-4 TagNeg (H4N) dictionary to classify sentiment for firms’ 10-K filings: three-quarters of the “negative” words were typically not negative in a financial context – e.g. cancer, tax, cost, capital, board, liability, foreign.

Polysemes – words with multiple meanings cause misclassification
H4N also lacked negative financial words: felony, litigation, restated, misstatement, unanticipated

$\Rightarrow$ Always validate dictionaries against your specific domain.

Building your own dictionary

Identify “extreme texts” with known positions
Search for differentially occurring words using word frequencies
Use these words (or their lemmas) as category markers

Example: Contingency tables on keyword use in Parliament meetings

	Government	Opposition
labor flexibility	100	20
environment	115	25

Test independence using $\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$, where expected frequencies assume keywords are independent of group.

Regular expressions

Regex = algebraic notation for pattern-matching in strings. Essential for text cleaning and dictionary construction. Key patterns: [0-9] (digits), ^ (start), $ (end), .* (any characters).

Reference: MIT regex cheatsheet (PDF)

Scaling using Wordscores

Wordscores (Laver et al. 2003): a supervised scaling method.

Assign known scores to reference texts (e.g., left=−1, right=+1 for party manifestos)
Each word inherits a weighted average score based on its frequency across reference texts
Score new documents by averaging their words’ scores

See quanteda.textmodels::textmodel_wordscores() for implementation.

A note on LLMs and bag-of-words

Large Language Models (GPT, Claude, LLaMA, etc.) have transformed NLP since ~2018:

They capture word order, context, and semantics – everything BoW discards
Zero-shot and few-shot classification can replace hand-built dictionaries
Pre-trained embeddings provide richer document representations than tf-idf

But bag-of-words methods remain valuable:

Transparent, reproducible, and easy to audit
Low computational cost – no GPU required
Ideal when the research design needs interpretable features (e.g., which words drive the EPU index?)
Many landmark studies (Baker et al., Hoberg & Phillips) pre-date LLMs and their results hold up

$\Rightarrow$ Think of BoW and LLMs as complements, not substitutes. Use the simpler tool when it suffices; reach for LLMs when context and nuance matter.

Applications

Applications overview

Two examples of text-based classification in economics:

Dictionary-based methods – Baker, Bloom & Davis (2016): Economic Policy Uncertainty Index
Clustering by text similarity – Hoberg & Phillips (2016): Product market definitions from 10-K filings

Application 1: Measuring Economic Policy Uncertainty

Can we measure policy uncertainty in the US? How does it look, and does it matter?

Baker, Bloom & Davis (2016)

EPU methodology

Let $i$ be a country–month pair, $j$ a newspaper, and $a$ an article ($j = 1, \ldots, n$; $a = 1, \ldots, m_j$).

$c_{ij} = \frac{1}{m_j} \sum_a \mathbf{1}\!\left[\sum_{t \in \{E, P, U\}} \mathbf{1}[BoW_{ijat} \cap K_t \neq \emptyset] = 3\right]$ – share of articles containing at least one keyword from each of:
- $K_E = \{\text{"economy"}, \text{"economics"}\}$
- $K_U = \{\text{"uncertain"}, \text{"uncertainty"}\}$
- $K_P = \{\text{"regulation"}, \text{"deficit"}, \text{"federal reserve"}, \text{"legislation"}, \text{"white house"}\}$
$c_i = \frac{1}{n} \sum_j c_{ij}$ – average across newspapers
$\hat{v}_i = c_i$ is the Economic Policy Uncertainty (EPU) Index

The EPU Index

Validation of the EPU Index

Computer-coded vs. human-coded EPU

Testing economic hypotheses

Theory: $\uparrow$ Uncertainty $\rightarrow$ $\uparrow$ real option to wait $\rightarrow$ $\downarrow$ investment (Bernanke 1983, Dixit & Pindyck 1994)

Regression: Firm-level estimates exploit differences in industry exposure to government:

\[Y_{it} = \text{FE}_i + \text{FE}_t + \beta\, (INT_i \times EPU_t) + \alpha\, (INT_i \times GS_t) + \varepsilon_{it}\]

where $INT_i$ measures firm $i$’s dependence on government contracts, and $Y_{it}$ is investment or hiring.

Results: EPU and firm-level outcomes

Application 2: Product markets from text (Hoberg & Phillips 2016)

How to define a product market that is endogenous to firms’ choices?

Methodology: Text-based industry classification

Data: Web-crawl of 50,673 firm annual 10-K filings; use the product description section.

Pre-processing:

Keep only nouns (as defined by Webster.com)
Remove words appearing in more than 25% of documents ($1/idf < 25\%$)
Tokenise text and generate BoW

Let $P^i \in \{0,1\}^K$ be a binary vector representation of the product description for firm $i$, where $K = |BoW_1 \cup \ldots \cup BoW_{50{,}673}|$.

Pairwise cosine similarity:

\[S_C(P^i, P^j) = \cos(\theta) = \frac{\sum_{k=1}^K P^i_k P^j_k}{\sqrt{\sum_{k=1}^K (P^i_k)^2}\;\sqrt{\sum_{k=1}^K (P^j_k)^2}} = \frac{\text{words in common}}{\text{normalised by length}}\]

Alternative: define $P^i \in \mathbb{R}^K$ using tf-idf weights.

Validation: Text-based vs. traditional classification

Results: Testing Sutton’s predictions

Topic Modeling: Discovering Themes Automatically

LDA (Latent Dirichlet Allocation) discovers topics without pre-defined categories:

library(quanteda)
library(topicmodels)

# Build a document-feature matrix
dfmat <- corpus(my_texts) |>
  tokens(remove_punct = TRUE) |>
  tokens_remove(stopwords("en")) |>
  dfm() |>
  dfm_trim(min_termfreq = 5)

# Fit LDA with 10 topics
lda_model <- LDA(convert(dfmat, to = "topicmodels"), k = 10)

# Top words per topic
terms(lda_model, 10)

Each document is a mixture of topics; each topic is a mixture of words. Useful for exploring large corpora without pre-defined categories.

Key take-aways

Text is high-dimensional; representation choices matter.
Simple pre-processing + tf-idf already provides useful features.
Dictionary methods are quick but require context-specific validation.
Similarity metrics unlock clustering and network applications.
Modern toolkits like quanteda make the full pipeline accessible in R.
LLMs complement (but do not replace) interpretable bag-of-words approaches.
Note: LLM-based classification is increasingly replacing hand-built dictionaries for many tasks — we’ll see this next.

Next: Large Language Models

From bag-of-words to transformers — and building one yourself