Data Science for Economists
2026-03-01
Why Web Data?
Find data that does not exist in conventional databases:
Before you scrape, always ask:
“By 2010, we were collecting 5 million prices every day.”
— Alberto Cavallo and Roberto Rigobon (2016)
Job postings
Rental and real estate
Online prices
Consumption behaviors
Key databases for economists (non-exhaustive):
| Source | Coverage |
|---|---|
| IMF / World Bank WDI | Macro aggregates, development indicators |
| FRED | US economy: GDP, interest rates, monetary aggregates |
| OECD / Eurostat | Developed countries, EU-level data |
| CEPII (BACI, MACMAP) | Trade flows, tariffs, gravity variables |
| Penn World Table | Real national accounts, cross-country comparisons |
Web data complements these — it fills gaps in timeliness, granularity, and coverage.
APIs
Application Programming Interface: a structured way to request data from a server
Many APIs require an API key for authentication. Store it securely in .Renviron:
Never hardcode API keys in scripts that you share or commit to Git.
httr2library(httr2)
library(data.table)
# GDP for Germany, 2015-2023 (no API key needed)
resp <- request("https://api.worldbank.org/v2") |>
req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
req_url_query(date = "2015:2023", format = "json", per_page = 50) |>
req_perform()
body <- resp |> resp_body_json()
# body[[1]] = metadata, body[[2]] = data records
dt <- rbindlist(lapply(body[[2]], function(x) {
data.table(year = as.integer(x$date), gdp_pc = as.numeric(x$value))
}))
print(dt[order(year)])httr2: retry and error handlinglibrary(httr2)
# Built-in retry, rate limiting, and error suppression
resp <- request("https://api.worldbank.org/v2") |>
req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
req_url_query(date = "2022", format = "json") |>
req_retry(max_tries = 3, backoff = ~ 2) |> # exponential backoff
req_throttle(rate = 10 / 60) |> # max 10 requests/min
req_error(is_error = \(resp) FALSE) |> # don't throw on HTTP errors
req_perform()
if (resp_status(resp) == 200) {
data <- resp |> resp_body_json()
} else {
message("HTTP ", resp_status(resp), ": ", resp_status_desc(resp))
}Web Scraping
Two technologies that build every webpage:
HTML uses tags to structure content — similar to XML:
\[\underbrace{\text{<p>}}_{\text{start tag}} \underbrace{\text{this is a paragraph}}_{\text{content}} \underbrace{\text{</p>}}_{\text{close tag}}\]
CSS selectors target HTML elements by tag, class, or ID:
| Selector | Matches |
|---|---|
"table" |
all <table> elements |
".wikitable" |
elements with class="wikitable" |
"#content" |
the element with id="content" |
"div.mw-body p" |
<p> inside <div class="mw-body"> |
You use these selectors in R to extract specific pieces of a webpage.
rvestThat’s it — three lines to go from URL to data frame.
library(rvest)
# Scrape population from several Wikipedia country pages
countries <- c("Germany", "France", "Japan", "Brazil")
results <- list()
for (i in seq_along(countries)) {
url <- paste0("https://en.wikipedia.org/wiki/", countries[i])
page <- tryCatch(read_html(url), error = function(e) NULL)
if (!is.null(page)) {
results[[i]] <- page |>
html_elements("table.infobox") |> # country infobox
html_table() |>
(\(x) if (length(x) > 0) x[[1]] else NULL)()
}
Sys.sleep(runif(1, 1, 3)) # polite delay between requests
}Sys.sleep(runif(1, 1, 3)): random 1–3 second delay mimics human browsingtryCatch(): keeps the loop running even if one page fails\[\underbrace{\text{https:}}_{\text{protocol}} \underbrace{\text{//de.indeed.com}}_{\text{host}} \underbrace{\text{/Jobs?q=Analyst\&l=Berlin\&start=10}}_{\text{path + query}}\]
q= query for type of job, separating search terms with +&l= location parameter&start= pagination offsetUnderstanding URL structure lets you construct URLs programmatically in a scraping loop.
Legal & Ethical Considerations
Every well-configured website publishes a robots.txt file that tells crawlers what is allowed:
robots.txt before scrapingDisallow directives and Crawl-delayrobots.txt as unauthorized accesspolite packagepolite package automates responsible scraping:library(polite)
# 1. Introduce yourself and check robots.txt
session <- bow("https://en.wikipedia.org",
user_agent = "DSfE course bot (academic use)",
delay = 5)
print(session) # shows which paths are allowed/disallowed
# 2. Propose a specific path
page <- nod(session, path = "/wiki/List_of_countries_by_GDP_(nominal)")
# 3. Scrape respectfully (rate-limited, robots.txt-aware)
result <- scrape(page)Use large language models to parse unstructured web content into structured data:
Trade-off: more flexible than CSS selectors, but slower and costlier at scale.
Wrap Up
httr2 to build requests with retry and rate limitingrvest extracts data from HTML using CSS selectorsrobots.txt, respect rate limits, use the polite package