06. Text as Data

Data Science for Economists

Irene Iodice

2026-03-01

Learning objectives

By the end of today you should be able to:

  1. Tokenise raw text and build a Bag-of-Words document–term matrix in R.
  2. Compute and interpret tf–idf weights.
  3. Apply a domain-specific sentiment dictionary and evaluate its accuracy.
  4. Explain two research designs that use dictionary counts or cosine similarity in economics.

Why text?

It has become increasingly affordable to store and process vast quantities of digital text, triggering an explosion of empirical research that leverages text as data.

Historical cost of computer memory and storage

Examples in economics

  1. Finance — predict asset price movements from news (Frank (2004) and Tetlock (2007))
  2. Macroeconomics — forecast variation in inflation and unemployment from Google searches
  3. Industrial Organization — product reviews used to study the drivers of consumer decision making

Text as Data – Strengths

  • “Always on”

Google Trends: Ukraine

Text as Data – Strengths (cont.)

  • “Non-Reactive”

Google Trends: US abortion

Text as Data – Weaknesses

  • Incomplete
  • Inaccessible or sensitive
  • Non-representative
  • Confounding