04. Web Scraping & APIs

Data Science for Economists

2026-03-01

Session roadmap

  1. Why web data? Applications in economics
  2. APIs: accessing structured data
  3. Web scraping: extracting data from HTML
  4. Legal and ethical considerations
  5. Hands-on examples

Why Web Data?

When to use web data

Find data that does not exist in conventional databases:

  • Data generated by users at scale (Amazon, Uber, Reddit)
  • Data that is the measurement of online activity (job postings, prices, reviews)
  • Data aggregated on a single site from many sources (Wikipedia lists)

Before you scrape, always ask:

  • Does this data already exist in a conventional source?
  • Would the site be willing to share data or partner with you?
  • Do they have an API? (Much easier than scraping.)

The Billion Prices Project

“By 2010, we were collecting 5 million prices every day.”

— Alberto Cavallo and Roberto Rigobon (2016)

  • Daily web-scraped prices since 2008 from hundreds of retailers in 60+ countries
  • Detected Argentina’s inflation manipulation: official CPI < 10%, online prices > 25%
  • Online prices caught the Lehman Brothers price drop two months before official CPI

Online vs official CPI

  • ~60% of CPI expenditure weights can be found online (goods, food, fuel; fewer services)
  • Short-run discrepancies (mainly developing countries) but strong medium/long-term co-movement

Other applications of web data

Job postings

  • Kuhn & Shen (2013). Gender discrimination in job ads. QJE
  • Deming et al. (2018). Skill requirements across firms and labor markets.
  • Acemoglu et al. (2020). AI and jobs: Evidence from online vacancies.

Rental and real estate

  • Horn & Merante (2017). Is home sharing driving up rents? JHE

Online prices

  • Cavallo (2017). Are online and offline prices similar? AER
  • Gorodnichenko & Talavera (2017). Price setting in online markets. AER
  • Cavallo (2018). Scraped data and sticky prices. RESTAT

Consumption behaviors

  • Davis & Dingell (2016): Yelp and ethnic segregation in consumption

Conventional data sources

Key databases for economists (non-exhaustive):

Source Coverage
IMF / World Bank WDI Macro aggregates, development indicators
FRED US economy: GDP, interest rates, monetary aggregates
OECD / Eurostat Developed countries, EU-level data
CEPII (BACI, MACMAP) Trade flows, tariffs, gravity variables
Penn World Table Real national accounts, cross-country comparisons

Web data complements these — it fills gaps in timeliness, granularity, and coverage.

Caveats: representativeness and selection

Web data is not a random sample — always ask who is visible online:

Dataset Who’s missing
Amazon reviews Silent majority; fake reviews; skewed toward extremes
Job postings Internal hires, referrals, informal labor markets
Airbnb listings Traditional rentals, regulated housing
Twitter/X data Younger, urban, politically active users

For causal inference: selection into being online is often correlated with the outcome of interest. Treat web data the same way you’d treat any non-random sample — identify the selection mechanism before making population claims.

APIs

What is an API?

Application Programming Interface: a structured way to request data from a server

API data formats

  1. XML (Extensible Markup Language) – nested tag structure
<job-offers>
    <job>
       <job-title> Data Analyst </job-title>
       <location> Berlin </location>
    </job>
</job-offers>
  1. JSON (JavaScript Object Notation) – key-value pairs
{"job-offers": [{
    "job-title": "Data Analyst",
    "location": "Berlin"
}]}
  1. CSV, RSS, etc.

API key setup

Many APIs require an API key for authentication. Store it securely in .Renviron:

install.packages("usethis")
# Open the .Renviron file
usethis::edit_r_environ()
# or:
file.edit("~/.Renviron")

# Add one line manually:
# MY_API_KEY=your_key_here

# Restart R, then access the key:
Sys.getenv("MY_API_KEY")

Never hardcode API keys in scripts that you share or commit to Git.

Example: World Bank API with httr2

library(httr2)
library(data.table)

# GDP for Germany, 2015-2023 (no API key needed)
resp <- request("https://api.worldbank.org/v2") |>
  req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
  req_url_query(date = "2015:2023", format = "json", per_page = 50) |>
  req_perform()

body <- resp |> resp_body_json()

# body[[1]] = metadata, body[[2]] = data records
dt <- rbindlist(lapply(body[[2]], function(x) {
  data.table(year = as.integer(x$date), gdp_pc = as.numeric(x$value))
}))
print(dt[order(year)])

Understanding the API response structure

resp_body_json() returns a nested R list that mirrors the JSON structure:

body
├── [[1]]  metadata       (page info, total records, last updated)
└── [[2]]  data records
      ├── [[1]]  first year   (list: date, value, country, indicator, ...)
      ├── [[2]]  second year
      ├── [[3]]  third year
      └── ...

lapply(body[[2]], ...) iterates over each year record and pulls out the fields you need:

body[[2]][[1]]$date    # "2015"
body[[2]][[1]]$value   # 41075.96

httr2: retry and error handling

library(httr2)

# Built-in retry, rate limiting, and error suppression
resp <- request("https://api.worldbank.org/v2") |>
  req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
  req_url_query(date = "2022", format = "json") |>
  req_retry(max_tries = 3, backoff = ~ 2) |>  # exponential backoff
  req_throttle(rate = 10 / 60) |>              # max 10 requests/min
  req_error(is_error = \(resp) FALSE) |>       # don't throw on HTTP errors
  req_perform()

if (resp_status(resp) == 200) {
  data <- resp |> resp_body_json()
} else {
  message("HTTP ", resp_status(resp), ": ", resp_status_desc(resp))
}
  • req_retry(): retries automatically on temporary failures (server overload, dropped connection)
  • req_throttle(): slows request rate — avoids getting blocked and is polite to the server
  • req_error(): prevents R from stopping on HTTP errors; lets you handle them yourself

Web Scraping

Scraping, parsing, crawling

  • Scraping: using tools to gather data you can see on a webpage
  • Parsing: analyzing the HTML text to collect the data you need
  • Crawling: moving across multiple pages/URLs to gather data at scale

Three technologies behind most webpages:

  • HTML — structure and content
  • CSS — visual layout and styling
  • JavaScript — interactivity and dynamic

For basic scraping, we usually start with HTML and CSS selectors.

HTML basics

HTML uses tags to structure content — similar to XML:

\(\underbrace{\text{<p>}}_{\text{start tag}} \underbrace{\text{this is a paragraph}}_{\text{content}} \underbrace{\text{</p>}}_{\text{close tag}}\)

A minimal HTML page:

<!DOCTYPE html>
<html>
<body>
  <h1>Page Title</h1>
  <p class="intro">Hello World!</p>
  <table class="wikitable">
    <tr><td>Cell 1</td><td>Cell 2</td></tr>
  </table>
</body>
</html>

CSS selectors

CSS selectors target HTML elements by tag, class, or ID:

Selector Matches
"table" all <table> elements
".wikitable" elements with class="wikitable"
"#content" the element with id="content"
"div.mw-body p" <p> inside <div class="mw-body">

You use these selectors in R to extract specific pieces of a webpage.

CSS selectors: example

<html>
  <body>
    <div class='mw-body'>
      <h1>Article title</h1>
      <p>This paragraph is inside mw-body.</p>
      <p>This paragraph is also inside mw-body.</p>
    </div>
    <div class='footer'>
      <p>This paragraph is inside the footer.</p>
    </div>
  </body>
</html>
page |> html_elements("p")              # all 3 paragraphs
page |> html_elements("div.mw-body p") # only the 2 inside mw-body
page |> html_element("h1")             # the article title

Finding selectors: browser DevTools

Right-click any element → Inspect (or F12) to open DevTools:

  • Elements tab: navigate the live HTML tree; hover to highlight elements on the page
  • Right-click an element → Copy → Copy selector: get the CSS selector automatically
  • Network tab: see all requests the page makes — useful for spotting hidden APIs

Practical workflow:

  1. Open the target page in Chrome/Firefox
  2. Right-click the data you want → Inspect
  3. Find the enclosing tag and its class or id
  4. Test in R: html_elements(page, "your.selector")

Scraping a Wikipedia table with rvest

library(rvest)

# Read the page
url <- "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
page <- read_html(url)

# Extract the first table matching the CSS selector
companies <- page |>
  html_element("table.wikitable") |>
  html_table()

head(companies)

That’s it — three lines to go from URL to data frame.

URL anatomy

\[\underbrace{\text{https:}}_{\text{protocol}} \underbrace{\text{//de.indeed.com}}_{\text{host}} \underbrace{\text{/Jobs?q=Analyst\&l=Berlin\&start=10}}_{\text{path + query}}\]

  • q= query for type of job, separating search terms with +
  • &l= location parameter
  • &start= pagination offset

Understanding URL structure lets you construct URLs programmatically in a scraping loop.

Scraping multiple pages in a loop

library(rvest)

# Scrape population from several Wikipedia country pages
countries <- c("Germany", "France", "Japan", "Brazil")
results <- list()

for (i in seq_along(countries)) {
  url <- paste0("https://en.wikipedia.org/wiki/", countries[i])
  page <- tryCatch(read_html(url), error = function(e) NULL)

  if (!is.null(page)) {
    results[[i]] <- page |>
      html_elements("table.infobox") |>   # country infobox
      html_table() |>
      (\(x) if (length(x) > 0) x[[1]] else NULL)()
  }
  Sys.sleep(runif(1, 1, 3))  # polite delay between requests
}
  • Sys.sleep(runif(1, 1, 3)): random 1–3 second delay mimics human browsing
  • tryCatch(): keeps the loop running even if one page fails
  • html_elements("table.infobox") |> html_table(): finds the country infobox table and converts it from HTML into an R table
  • \(x) ...: a lambda function / anonymous function; here it keeps the first table if one exists, otherwise returns NULL

Scraping structured metadata: JSON-LD

Many websites embed machine-readable product metadata inside the HTML for SEO — search engines use it to read product names, prices, availability, and ratings.

Instead of scraping the visible webpage:

<span class="some-changing-price-class">€59.99</span>

we can scrape structured metadata:

<script type="application/ld+json">
{
  "@type": "Product",
  "name": "BILLY bookcase",
  "offers": { "price": "59.99", "priceCurrency": "EUR" }
}
</script>

JSON-LD is often more stable than CSS classes, which may change when the frontend is redesigned.

Example: extracting price from JSON-LD

library(rvest)
library(jsonlite)

url  <- "https://www.ikea.com/de/de/p/billy-buecherregal-weiss-00263850/"
page <- read_html(url)

scripts <- page |>
  html_elements("script[type='application/ld+json']") |>
  html_text()

product <- fromJSON(scripts[2])  # second block is the Product schema

product$offers$price
product$offers$priceCurrency

fromJSON(scripts[2]) turns the raw JSON string:

{
  "@type": "Product",
  "name": "BILLY bookcase",
  "offers": { "price": "59.99", "priceCurrency": "EUR" }
}

into an R object you can navigate with $:

product$name                  # "BILLY bookcase"
product$offers$price          # "59.99"
product$offers$priceCurrency  # "EUR"

Three ways to collect web data

Method Example Pros Cons
API World Bank API Clean, structured, stable Not always available
CSS scraping Wikipedia tables Easy for visible content Breaks when layout changes
JSON-LD scraping IKEA prices Structured and robust Only if metadata is provided

Rule of thumb:

  1. Use an API if one exists.
  2. Use structured metadata (JSON-LD) if available.
  3. Use CSS selectors when data only appears in the visible HTML.

JavaScript-rendered pages

rvest downloads the raw HTML sent by the server.

But some websites load the actual data after the page opens, using JavaScript.

Example:

https://quotes.toscrape.com/scroll

The browser shows quotes, but rvest only sees the initial page shell.

library(rvest)

url <- "https://quotes.toscrape.com/scroll"
page <- read_html(url)

page |> html_text2()

Raw HTML vs rendered page

The raw HTML contains text like:

Quotes to Scrape
Login
Loading...

but not the quote cards.

page |>
  html_elements(".quote") |>
  length()

This returns:

0

because .quote elements are created later by JavaScript.

DevTools: finding the API endpoint

Open DevTools → Network.

On this page, JavaScript requests:

https://quotes.toscrape.com/api/quotes?page=1

Clues that this is an API request:

  • Type: json
  • Status: 200
  • Content-Type: application/json
  • Initiator: jquery.js (xhr)

So the page gets quote data from an API, then inserts it into the HTML.

Scrape the API directly

library(httr2)
library(data.table)

resp <- request("https://quotes.toscrape.com/api/quotes") |>
  req_url_query(page = 1) |>
  req_perform()

data <- resp |> resp_body_json()

quotes <- rbindlist(lapply(data$quotes, function(q) {
  data.table(
    text = q$text,
    author = q$author$name,
    tags = paste(q$tags, collapse = ", ")
  )
}))

quotes

A caveat: cookies and session state

The browser does not just send a URL — it may also send cookies with the request:

  • Session identity or login status
  • Location and language
  • A/B test group
  • Recently viewed products or user history

This means two users visiting the same page may receive different data:

  • Different prices, discounts, or availability
  • Different product rankings or recommendations
  • Different content due to A/B tests or login status

For research: document and, where possible, control the session state.

Controlling session state

A VPN can help control apparent location, but it does not solve all personalization problems.

Better practice:

  • Scrape as a logged-out user
  • Clear cookies or use a fresh browser profile
  • Fix country and language settings where possible
  • Record request headers, timestamps, and collection location
  • Repeat collection across locations if geographic variation matters

The key point: document the browsing conditions under which the data were collected.

Why this is better than scraping the rendered page

Strategy What it does When to use
rvest reads raw HTML data is already in the source
API request downloads JSON data directly data is loaded through Network requests
chromote runs JavaScript like a browser no accessible API or complex interaction

Rule of thumb: before using a headless browser, inspect the Network tab.

Headless browser fallback

Use a headless browser only when the data cannot be easily accessed through an API.

library(chromote)
library(rvest)

url <- "https://quotes.toscrape.com/scroll"

b <- ChromoteSession$new()
b$Page$navigate(url)
Sys.sleep(3)

html <- b$Runtime$evaluate(
  "document.documentElement.outerHTML"
)$result$value

b$close()

page_rendered <- read_html(html)

page_rendered |>
  html_elements(".quote") |>
  html_text2()

Legal & Ethical Considerations

robots.txt

Every well-configured website publishes a robots.txt file that tells crawlers what is allowed:

# Example: https://www.ikea.com/robots.txt
User-agent: *
Disallow: /checkout/
Disallow: /profile/
Crawl-delay: 10

User-agent: Googlebot
Allow: /

Reading this file: for most crawlers, do not crawl /checkout/ or /profile/, and wait 10 seconds between requests. For Googlebot, crawling is broadly allowed.

  • Always check robots.txt before scraping
  • Respect Disallow directives and Crawl-delay
  • Many jurisdictions treat violation of robots.txt as unauthorized access

Rate limiting and the polite package

  • Check the website’s Terms of Service
  • Add random pauses between requests
  • The polite package automates responsible scraping:
library(polite)

# 1. Introduce yourself and check robots.txt
session <- bow("https://en.wikipedia.org",
               user_agent = "DSfE course bot (academic use)",
               delay = 5)
print(session)  # shows which paths are allowed/disallowed

# 2. Propose a specific path
page <- nod(session, path = "/wiki/List_of_countries_by_GDP_(nominal)")

# 3. Scrape respectfully (rate-limited, robots.txt-aware)
result <- scrape(page)

LLM-based data extraction

Use large language models to parse unstructured web content into structured data:

  • Feed raw HTML or rendered text to an LLM → ask it to extract fields as JSON
  • Useful when page structure is irregular or frequently changes
  • Models: GPT-4.1-mini, Claude Haiku 4.5 — fast and cheap enough for bulk extraction
library(ellmer)

html_snippet <- "<div class='product'><h2>BILLY Bookcase</h2><span>€49.99</span></div>"
chat <- chat_openai(model = "gpt-4.1-mini")
result <- chat$chat(paste(
  "Extract product name and price as JSON from this HTML:", html_snippet
))
cat(result)

Trade-off: more flexible than CSS selectors, but slower and costlier at scale.

Wrap Up

Summary

  1. Web data fills gaps that conventional sources miss — prices, job postings, real-time activity
  2. APIs provide structured access; use httr2 to build requests with retry and rate limiting
  3. Web scraping with rvest extracts data from HTML using CSS selectors
  4. Legal and ethical: always check robots.txt, respect rate limits, use the polite package
  5. LLMs can parse messy HTML when CSS selectors break down

Further reading

  • Cavallo & Rigobon (2016), “The Billion Prices Project”, JEP: bpp.mit.edu
  • Brown et al. (2025), “Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations”, Big Data & Society

Case Study: The IKEA Index

From the Law of One Price to the Big Mac Index

The Law of One Price says that identical goods should sell for the same price across countries once converted to a common currency:

\[P_{\text{local}} = P_{\text{EUR}} \times S\]

where \(S\) is the exchange rate: local currency per EUR.

The Big Mac Index applies this idea to a standardized McDonald’s product.

Our case study asks:

Can we build a similar index using IKEA’s BILLY bookcase?

Why IKEA BILLY?

  • Highly standardized product sold across many countries
  • Same product family and similar design across markets
  • Easy to observe online through national IKEA websites
  • Useful for comparing prices across countries

Like the Big Mac Index, the idea is simple:

compare the local price of the same good across countries after converting currencies.

Scraping IKEA prices

CSS class names can break when websites redesign.

Instead, many e-commerce sites embed JSON-LD structured data for SEO.

extract_price_jsonld <- function(page) {
  scripts <- page |>
    html_nodes("script[type='application/ld+json']") |>
    html_text()

  for (s in scripts) {
    parsed <- tryCatch(
      fromJSON(s, simplifyVector = FALSE),
      error = function(e) NULL
    )

    if (is.null(parsed)) next

    offers <- parsed[["offers"]]

    if (!is.null(offers)) {
      return(list(
        price = as.numeric(offers[["price"]]),
        currency = offers[["priceCurrency"]]
      ))
    }
  }

  list(price = NA_real_, currency = NA_character_)
}

Scraping across countries

countries <- c(
  "de/de", "fr/fr", "it/it", "es/es",
  "us/en", "gb/en", "at/de", "ch/fr"
)

dt <- data.table(
  country = character(),
  price = numeric(),
  currency = character()
)

for (cc in countries) {
  url <- str_c(
    "https://www.ikea.com/",
    cc,
    "/p/billy-buecherregal-weiss-00263850/"
  )

  page <- tryCatch(read_html(url), error = function(e) NULL)

  if (is.null(page)) next

  result <- extract_price_jsonld(page)

  dt <- rbind(
    dt,
    data.table(
      country = cc,
      price = result$price,
      currency = result$currency
    )
  )

  Sys.sleep(1 + runif(1, 0, 2))
}

Data pipeline

  1. Scrape IKEA country pages
  2. Extract price and currency from JSON-LD
  3. Collect exchange rates from an API
  4. Convert local prices into EUR
  5. Compare each country to Germany
  6. Compute Law of One Price ratios

This combines two tools from today:

  • web scraping for IKEA prices
  • APIs for exchange rates

Testing the Law of One Price

Use Germany as the benchmark price.

\[\text{LoP ratio} = \frac{P_{\text{local}}}{P_{\text{DE}} \times S_{\text{local/EUR}}}\]

Ratio Interpretation
= 1 price parity with Germany
> 1 BILLY is more expensive than LoP predicts
< 1 BILLY is cheaper than LoP predicts

If markets were fully integrated and arbitrage were frictionless, all ratios would be close to 1.

Why might prices differ?

Even for an identical product, prices may differ because of:

  • transport costs
  • tariffs and taxes
  • exchange-rate pass-through
  • local competition
  • market power
  • VAT differences
  • distribution and warehousing costs
  • local pricing strategies

So deviations from LoP are not necessarily “mistakes” — they reveal market frictions.