Data Science for Economists
2026-03-01
Why Web Data?
Find data that does not exist in conventional databases:
Before you scrape, always ask:
“By 2010, we were collecting 5 million prices every day.”
— Alberto Cavallo and Roberto Rigobon (2016)
Job postings
Rental and real estate
Online prices
Consumption behaviors
Key databases for economists (non-exhaustive):
| Source | Coverage |
|---|---|
| IMF / World Bank WDI | Macro aggregates, development indicators |
| FRED | US economy: GDP, interest rates, monetary aggregates |
| OECD / Eurostat | Developed countries, EU-level data |
| CEPII (BACI, MACMAP) | Trade flows, tariffs, gravity variables |
| Penn World Table | Real national accounts, cross-country comparisons |
Web data complements these — it fills gaps in timeliness, granularity, and coverage.
Web data is not a random sample — always ask who is visible online:
| Dataset | Who’s missing |
|---|---|
| Amazon reviews | Silent majority; fake reviews; skewed toward extremes |
| Job postings | Internal hires, referrals, informal labor markets |
| Airbnb listings | Traditional rentals, regulated housing |
| Twitter/X data | Younger, urban, politically active users |
For causal inference: selection into being online is often correlated with the outcome of interest. Treat web data the same way you’d treat any non-random sample — identify the selection mechanism before making population claims.
APIs
Application Programming Interface: a structured way to request data from a server
Many APIs require an API key for authentication. Store it securely in .Renviron:
Never hardcode API keys in scripts that you share or commit to Git.
httr2library(httr2)
library(data.table)
# GDP for Germany, 2015-2023 (no API key needed)
resp <- request("https://api.worldbank.org/v2") |>
req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
req_url_query(date = "2015:2023", format = "json", per_page = 50) |>
req_perform()
body <- resp |> resp_body_json()
# body[[1]] = metadata, body[[2]] = data records
dt <- rbindlist(lapply(body[[2]], function(x) {
data.table(year = as.integer(x$date), gdp_pc = as.numeric(x$value))
}))
print(dt[order(year)])resp_body_json() returns a nested R list that mirrors the JSON structure:
body
├── [[1]] metadata (page info, total records, last updated)
└── [[2]] data records
├── [[1]] first year (list: date, value, country, indicator, ...)
├── [[2]] second year
├── [[3]] third year
└── ...
httr2: retry and error handlinglibrary(httr2)
# Built-in retry, rate limiting, and error suppression
resp <- request("https://api.worldbank.org/v2") |>
req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
req_url_query(date = "2022", format = "json") |>
req_retry(max_tries = 3, backoff = ~ 2) |> # exponential backoff
req_throttle(rate = 10 / 60) |> # max 10 requests/min
req_error(is_error = \(resp) FALSE) |> # don't throw on HTTP errors
req_perform()
if (resp_status(resp) == 200) {
data <- resp |> resp_body_json()
} else {
message("HTTP ", resp_status(resp), ": ", resp_status_desc(resp))
}req_retry(): retries automatically on temporary failures (server overload, dropped connection)req_throttle(): slows request rate — avoids getting blocked and is polite to the serverreq_error(): prevents R from stopping on HTTP errors; lets you handle them yourselfWeb Scraping
Three technologies behind most webpages:
For basic scraping, we usually start with HTML and CSS selectors.
HTML uses tags to structure content — similar to XML:
\(\underbrace{\text{<p>}}_{\text{start tag}} \underbrace{\text{this is a paragraph}}_{\text{content}} \underbrace{\text{</p>}}_{\text{close tag}}\)
CSS selectors target HTML elements by tag, class, or ID:
| Selector | Matches |
|---|---|
"table" |
all <table> elements |
".wikitable" |
elements with class="wikitable" |
"#content" |
the element with id="content" |
"div.mw-body p" |
<p> inside <div class="mw-body"> |
You use these selectors in R to extract specific pieces of a webpage.
Right-click any element → Inspect (or F12) to open DevTools:
Practical workflow:
class or idhtml_elements(page, "your.selector")rvestThat’s it — three lines to go from URL to data frame.
\[\underbrace{\text{https:}}_{\text{protocol}} \underbrace{\text{//de.indeed.com}}_{\text{host}} \underbrace{\text{/Jobs?q=Analyst\&l=Berlin\&start=10}}_{\text{path + query}}\]
q= query for type of job, separating search terms with +&l= location parameter&start= pagination offsetUnderstanding URL structure lets you construct URLs programmatically in a scraping loop.
library(rvest)
# Scrape population from several Wikipedia country pages
countries <- c("Germany", "France", "Japan", "Brazil")
results <- list()
for (i in seq_along(countries)) {
url <- paste0("https://en.wikipedia.org/wiki/", countries[i])
page <- tryCatch(read_html(url), error = function(e) NULL)
if (!is.null(page)) {
results[[i]] <- page |>
html_elements("table.infobox") |> # country infobox
html_table() |>
(\(x) if (length(x) > 0) x[[1]] else NULL)()
}
Sys.sleep(runif(1, 1, 3)) # polite delay between requests
}Sys.sleep(runif(1, 1, 3)): random 1–3 second delay mimics human browsingtryCatch(): keeps the loop running even if one page failshtml_elements("table.infobox") |> html_table(): finds the country infobox table and converts it from HTML into an R table\(x) ...: a lambda function / anonymous function; here it keeps the first table if one exists, otherwise returns NULLMany websites embed machine-readable product metadata inside the HTML for SEO — search engines use it to read product names, prices, availability, and ratings.
Instead of scraping the visible webpage:
we can scrape structured metadata:
JSON-LD is often more stable than CSS classes, which may change when the frontend is redesigned.
library(rvest)
library(jsonlite)
url <- "https://www.ikea.com/de/de/p/billy-buecherregal-weiss-00263850/"
page <- read_html(url)
scripts <- page |>
html_elements("script[type='application/ld+json']") |>
html_text()
product <- fromJSON(scripts[2]) # second block is the Product schema
product$offers$price
product$offers$priceCurrency| Method | Example | Pros | Cons |
|---|---|---|---|
| API | World Bank API | Clean, structured, stable | Not always available |
| CSS scraping | Wikipedia tables | Easy for visible content | Breaks when layout changes |
| JSON-LD scraping | IKEA prices | Structured and robust | Only if metadata is provided |
Rule of thumb:
rvest downloads the raw HTML sent by the server.
But some websites load the actual data after the page opens, using JavaScript.
The raw HTML contains text like:
Quotes to Scrape
Login
Loading...
but not the quote cards.
Open DevTools → Network.
On this page, JavaScript requests:
https://quotes.toscrape.com/api/quotes?page=1
Clues that this is an API request:
json200application/jsonjquery.js (xhr)So the page gets quote data from an API, then inserts it into the HTML.
library(httr2)
library(data.table)
resp <- request("https://quotes.toscrape.com/api/quotes") |>
req_url_query(page = 1) |>
req_perform()
data <- resp |> resp_body_json()
quotes <- rbindlist(lapply(data$quotes, function(q) {
data.table(
text = q$text,
author = q$author$name,
tags = paste(q$tags, collapse = ", ")
)
}))
quotesThe browser does not just send a URL — it may also send cookies with the request:
This means two users visiting the same page may receive different data:
For research: document and, where possible, control the session state.
A VPN can help control apparent location, but it does not solve all personalization problems.
Better practice:
The key point: document the browsing conditions under which the data were collected.
| Strategy | What it does | When to use |
|---|---|---|
rvest |
reads raw HTML | data is already in the source |
| API request | downloads JSON data directly | data is loaded through Network requests |
chromote |
runs JavaScript like a browser | no accessible API or complex interaction |
Rule of thumb: before using a headless browser, inspect the Network tab.
Use a headless browser only when the data cannot be easily accessed through an API.
library(chromote)
library(rvest)
url <- "https://quotes.toscrape.com/scroll"
b <- ChromoteSession$new()
b$Page$navigate(url)
Sys.sleep(3)
html <- b$Runtime$evaluate(
"document.documentElement.outerHTML"
)$result$value
b$close()
page_rendered <- read_html(html)
page_rendered |>
html_elements(".quote") |>
html_text2()Legal & Ethical Considerations
Every well-configured website publishes a robots.txt file that tells crawlers what is allowed:
Reading this file: for most crawlers, do not crawl /checkout/ or /profile/, and wait 10 seconds between requests. For Googlebot, crawling is broadly allowed.
robots.txt before scrapingDisallow directives and Crawl-delayrobots.txt as unauthorized accesspolite packagepolite package automates responsible scraping:library(polite)
# 1. Introduce yourself and check robots.txt
session <- bow("https://en.wikipedia.org",
user_agent = "DSfE course bot (academic use)",
delay = 5)
print(session) # shows which paths are allowed/disallowed
# 2. Propose a specific path
page <- nod(session, path = "/wiki/List_of_countries_by_GDP_(nominal)")
# 3. Scrape respectfully (rate-limited, robots.txt-aware)
result <- scrape(page)Use large language models to parse unstructured web content into structured data:
Trade-off: more flexible than CSS selectors, but slower and costlier at scale.
Wrap Up
httr2 to build requests with retry and rate limitingrvest extracts data from HTML using CSS selectorsrobots.txt, respect rate limits, use the polite packageCase Study: The IKEA Index
The Law of One Price says that identical goods should sell for the same price across countries once converted to a common currency:
\[P_{\text{local}} = P_{\text{EUR}} \times S\]
where \(S\) is the exchange rate: local currency per EUR.
The Big Mac Index applies this idea to a standardized McDonald’s product.
Our case study asks:
Can we build a similar index using IKEA’s BILLY bookcase?
Like the Big Mac Index, the idea is simple:
compare the local price of the same good across countries after converting currencies.
CSS class names can break when websites redesign.
Instead, many e-commerce sites embed JSON-LD structured data for SEO.
extract_price_jsonld <- function(page) {
scripts <- page |>
html_nodes("script[type='application/ld+json']") |>
html_text()
for (s in scripts) {
parsed <- tryCatch(
fromJSON(s, simplifyVector = FALSE),
error = function(e) NULL
)
if (is.null(parsed)) next
offers <- parsed[["offers"]]
if (!is.null(offers)) {
return(list(
price = as.numeric(offers[["price"]]),
currency = offers[["priceCurrency"]]
))
}
}
list(price = NA_real_, currency = NA_character_)
}countries <- c(
"de/de", "fr/fr", "it/it", "es/es",
"us/en", "gb/en", "at/de", "ch/fr"
)
dt <- data.table(
country = character(),
price = numeric(),
currency = character()
)
for (cc in countries) {
url <- str_c(
"https://www.ikea.com/",
cc,
"/p/billy-buecherregal-weiss-00263850/"
)
page <- tryCatch(read_html(url), error = function(e) NULL)
if (is.null(page)) next
result <- extract_price_jsonld(page)
dt <- rbind(
dt,
data.table(
country = cc,
price = result$price,
currency = result$currency
)
)
Sys.sleep(1 + runif(1, 0, 2))
}This combines two tools from today:
Use Germany as the benchmark price.
\[\text{LoP ratio} = \frac{P_{\text{local}}}{P_{\text{DE}} \times S_{\text{local/EUR}}}\]
| Ratio | Interpretation |
|---|---|
| = 1 | price parity with Germany |
| > 1 | BILLY is more expensive than LoP predicts |
| < 1 | BILLY is cheaper than LoP predicts |
If markets were fully integrated and arbitrage were frictionless, all ratios would be close to 1.
Even for an identical product, prices may differ because of:
So deviations from LoP are not necessarily “mistakes” — they reveal market frictions.