04. Web Scraping & APIs

Data Science for Economists

2026-03-01

Session roadmap

  1. Why web data? Applications in economics
  2. APIs: accessing structured data
  3. Web scraping: extracting data from HTML
  4. Legal and ethical considerations
  5. Hands-on examples

Why Web Data?

When to use web data

Find data that does not exist in conventional databases:

  • Data generated by users at scale (Amazon, Uber, Reddit)
  • Data that is the measurement of online activity (job postings, prices, reviews)
  • Data aggregated on a single site from many sources (Wikipedia lists)

Before you scrape, always ask:

  • Does this data already exist in a conventional source?
  • Would the site be willing to share data or partner with you?
  • Do they have an API? (Much easier than scraping.)

The Billion Prices Project

“By 2010, we were collecting 5 million prices every day.”

— Alberto Cavallo and Roberto Rigobon (2016)

  • Daily web-scraped prices since 2008 from hundreds of retailers in 60+ countries
  • Detected Argentina’s inflation manipulation: official CPI < 10%, online prices > 25%
  • Online prices caught the Lehman Brothers price drop two months before official CPI

Online vs official CPI

  • ~60% of CPI expenditure weights can be found online (goods, food, fuel; fewer services)
  • Short-run discrepancies (mainly developing countries) but strong medium/long-term co-movement

Other applications of web data

Job postings

  • Kuhn & Shen (2013). Gender discrimination in job ads. QJE
  • Deming et al. (2018). Skill requirements across firms and labor markets.
  • Acemoglu et al. (2020). AI and jobs: Evidence from online vacancies.

Rental and real estate

  • Horn & Merante (2017). Is home sharing driving up rents? JHE

Online prices

  • Cavallo (2017). Are online and offline prices similar? AER
  • Gorodnichenko & Talavera (2017). Price setting in online markets. AER
  • Cavallo (2018). Scraped data and sticky prices. RESTAT

Consumption behaviors

  • Davis & Dingell (2016): Yelp and ethnic segregation in consumption

Conventional data sources

Key databases for economists (non-exhaustive):

Source Coverage
IMF / World Bank WDI Macro aggregates, development indicators
FRED US economy: GDP, interest rates, monetary aggregates
OECD / Eurostat Developed countries, EU-level data
CEPII (BACI, MACMAP) Trade flows, tariffs, gravity variables
Penn World Table Real national accounts, cross-country comparisons

Web data complements these — it fills gaps in timeliness, granularity, and coverage.

APIs

What is an API?

Application Programming Interface: a structured way to request data from a server

API data formats

  1. XML (Extensible Markup Language) – nested tag structure
<job-offers>
    <job>
       <job-title> Data Analyst </job-title>
       <location> Berlin </location>
    </job>
</job-offers>
  1. JSON (JavaScript Object Notation) – key-value pairs
{"job-offers": [{
    "job-title": "Data Analyst",
    "location": "Berlin"
}]}
  1. CSV, RSS, etc.

API key setup

Many APIs require an API key for authentication. Store it securely in .Renviron:

# Add key to .Renviron (one-time setup)
usethis::edit_r_environ()
# Add line: MY_API_KEY=your_key_here

# Access in your code
Sys.getenv("MY_API_KEY")

Never hardcode API keys in scripts that you share or commit to Git.

Example: World Bank API with httr2

library(httr2)
library(data.table)

# GDP for Germany, 2015-2023 (no API key needed)
resp <- request("https://api.worldbank.org/v2") |>
  req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
  req_url_query(date = "2015:2023", format = "json", per_page = 50) |>
  req_perform()

body <- resp |> resp_body_json()

# body[[1]] = metadata, body[[2]] = data records
dt <- rbindlist(lapply(body[[2]], function(x) {
  data.table(year = as.integer(x$date), gdp_pc = as.numeric(x$value))
}))
print(dt[order(year)])

httr2: retry and error handling

library(httr2)

# Built-in retry, rate limiting, and error suppression
resp <- request("https://api.worldbank.org/v2") |>
  req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
  req_url_query(date = "2022", format = "json") |>
  req_retry(max_tries = 3, backoff = ~ 2) |>  # exponential backoff
  req_throttle(rate = 10 / 60) |>              # max 10 requests/min
  req_error(is_error = \(resp) FALSE) |>       # don't throw on HTTP errors
  req_perform()

if (resp_status(resp) == 200) {
  data <- resp |> resp_body_json()
} else {
  message("HTTP ", resp_status(resp), ": ", resp_status_desc(resp))
}

Web Scraping

Scraping, parsing, crawling

  • Scraping: using tools to gather data you can see on a webpage
  • Parsing: analyzing the HTML text to collect the data you need
  • Crawling: moving across multiple pages/URLs to gather data at scale

Two technologies that build every webpage:

  • HTML (Hypertext Markup Language) – structure of the page
  • CSS (Cascading Style Sheets) – visual layout and styling

HTML basics

HTML uses tags to structure content — similar to XML:

\[\underbrace{\text{<p>}}_{\text{start tag}} \underbrace{\text{this is a paragraph}}_{\text{content}} \underbrace{\text{</p>}}_{\text{close tag}}\]

A minimal HTML page:

<!DOCTYPE html>
<html>
<body>
  <h1>Page Title</h1>
  <p class="intro">Hello World!</p>
  <table class="wikitable">
    <tr><td>Cell 1</td><td>Cell 2</td></tr>
  </table>
</body>
</html>

CSS selectors

CSS selectors target HTML elements by tag, class, or ID:

Selector Matches
"table" all <table> elements
".wikitable" elements with class="wikitable"
"#content" the element with id="content"
"div.mw-body p" <p> inside <div class="mw-body">

You use these selectors in R to extract specific pieces of a webpage.

Scraping a Wikipedia table with rvest

library(rvest)

# Read the page
url <- "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
page <- read_html(url)

# Extract the first table matching the CSS selector
companies <- page |>
  html_element("table.wikitable") |>
  html_table()

head(companies)

That’s it — three lines to go from URL to data frame.

Scraping multiple pages in a loop

library(rvest)

# Scrape population from several Wikipedia country pages
countries <- c("Germany", "France", "Japan", "Brazil")
results <- list()

for (i in seq_along(countries)) {
  url <- paste0("https://en.wikipedia.org/wiki/", countries[i])
  page <- tryCatch(read_html(url), error = function(e) NULL)

  if (!is.null(page)) {
    results[[i]] <- page |>
      html_elements("table.infobox") |>   # country infobox
      html_table() |>
      (\(x) if (length(x) > 0) x[[1]] else NULL)()
  }
  Sys.sleep(runif(1, 1, 3))  # polite delay between requests
}
  • Sys.sleep(runif(1, 1, 3)): random 1–3 second delay mimics human browsing
  • tryCatch(): keeps the loop running even if one page fails

URL anatomy

\[\underbrace{\text{https:}}_{\text{protocol}} \underbrace{\text{//de.indeed.com}}_{\text{host}} \underbrace{\text{/Jobs?q=Analyst\&l=Berlin\&start=10}}_{\text{path + query}}\]

  • q= query for type of job, separating search terms with +
  • &l= location parameter
  • &start= pagination offset

Understanding URL structure lets you construct URLs programmatically in a scraping loop.

Legal & Ethical Considerations

robots.txt

Every well-configured website publishes a robots.txt file that tells crawlers what is allowed:

# Example: https://www.ikea.com/robots.txt
User-agent: *
Disallow: /checkout/
Disallow: /profile/
Crawl-delay: 10

User-agent: Googlebot
Allow: /
  • Always check robots.txt before scraping
  • Respect Disallow directives and Crawl-delay
  • Many jurisdictions treat violation of robots.txt as unauthorized access

Rate limiting and the polite package

  • Check the website’s Terms of Service
  • Add random pauses between requests
  • The polite package automates responsible scraping:
library(polite)

# 1. Introduce yourself and check robots.txt
session <- bow("https://en.wikipedia.org",
               user_agent = "DSfE course bot (academic use)",
               delay = 5)
print(session)  # shows which paths are allowed/disallowed

# 2. Propose a specific path
page <- nod(session, path = "/wiki/List_of_countries_by_GDP_(nominal)")

# 3. Scrape respectfully (rate-limited, robots.txt-aware)
result <- scrape(page)

LLM-based data extraction

Use large language models to parse unstructured web content into structured data:

  • Feed raw HTML or rendered text to an LLM → ask it to extract fields as JSON
  • Useful when page structure is irregular or frequently changes
  • Models: GPT-4.1-mini, Claude Haiku 4.5 — fast and cheap enough for bulk extraction
library(ellmer)

html_snippet <- "<div class='product'><h2>BILLY Bookcase</h2><span>€49.99</span></div>"
chat <- chat_openai(model = "gpt-4.1-mini")
result <- chat$chat(paste(
  "Extract product name and price as JSON from this HTML:", html_snippet
))
cat(result)

Trade-off: more flexible than CSS selectors, but slower and costlier at scale.

Wrap Up

Summary

  1. Web data fills gaps that conventional sources miss — prices, job postings, real-time activity
  2. APIs provide structured access; use httr2 to build requests with retry and rate limiting
  3. Web scraping with rvest extracts data from HTML using CSS selectors
  4. Legal and ethical: always check robots.txt, respect rate limits, use the polite package
  5. LLMs can parse messy HTML when CSS selectors break down

Further reading

  • Cavallo & Rigobon (2016), “The Billion Prices Project”, JEP: bpp.mit.edu
  • Brown et al. (2025), “Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations”, Big Data & Society