06. Text as Data

Data Science for Economists

Irene Iodice

2026-03-01

Learning objectives

By the end of today you should be able to:

  1. Tokenise raw text and build a Bag-of-Words document–term matrix in R.
  2. Compute and interpret tf–idf weights.
  3. Apply a domain-specific sentiment dictionary and evaluate its accuracy.
  4. Explain two research designs that use dictionary counts or cosine similarity in economics.

Why text?

It has become increasingly affordable to store and process vast quantities of digital text, triggering an explosion of empirical research that leverages text as data.

Historical cost of computer memory and storage

Examples in economics

  • Finance — Tetlock (2007): pessimism in the WSJ “Abreast of the Market” column predicts next-day stock market declines and subsequent reversals

  • Labour — Hershbein & Kahn (2018): job posting text shows skill requirements rose faster in cities hit hardest by the Great Recession

  • Political economy — Gentzkow & Shapiro (2010): compare newspaper text to congressional speech to measure media slant; find strong demand-side pressure from readers

  • Macroeconomics — Baker, Bloom & Davis (2016): newspaper keyword counts measure Economic Policy Uncertainty \(\rightarrow\) Application 1

  • Macroeconomics / finance — Bybee, Kelly, Manela & Xiu (2024): topic model applied to WSJ articles tracks business-cycle themes in real time

  • Public finance / surveys — Ferrario & Stantcheva (2022): open-ended survey responses reveal people’s first-order concerns about tax policy

  • Industrial Organisation — Hoberg & Phillips (2016): cosine similarity of 10-K product descriptions defines dynamic industry boundaries \(\rightarrow\) Application 2

Text as Data – Strengths

  • “Always on”

Google Trends: Ukraine

Text as Data – Strengths (cont.)

  • “Non-Reactive”

Google Trends: US abortion

Text as Data – Weaknesses

  • Incomplete
  • Inaccessible or sensitive
  • Non-representative
  • Confounding