04. Web Scraping & APIs

Data Science for Economists

Julian Hinz & Irene Iodice

2026-03-01

Session roadmap

Why web data? Applications in economics
APIs: accessing structured data
Web scraping: extracting data from HTML
Legal and ethical considerations
Hands-on examples

Why Web Data?

When to use web data

Find data that does not exist in conventional databases:

Data generated by users at scale (Amazon, Uber, Reddit)
Data that is the measurement of online activity (job postings, prices, reviews)
Data aggregated on a single site from many sources (Wikipedia lists)

Before you scrape, always ask:

Does this data already exist in a conventional source?
Would the site be willing to share data or partner with you?
Do they have an API? (Much easier than scraping.)

The Billion Prices Project

“By 2010, we were collecting 5 million prices every day.”

— Alberto Cavallo and Roberto Rigobon (2016)

Daily web-scraped prices since 2008 from hundreds of retailers in 60+ countries
Detected Argentina’s inflation manipulation: official CPI < 10%, online prices > 25%
Online prices caught the Lehman Brothers price drop two months before official CPI

Online vs official CPI

~60% of CPI expenditure weights can be found online (goods, food, fuel; fewer services)
Short-run discrepancies (mainly developing countries) but strong medium/long-term co-movement

Other applications of web data

Job postings

Kuhn & Shen (2013). Gender discrimination in job ads. QJE
Deming et al. (2018). Skill requirements across firms and labor markets.
Acemoglu et al. (2020). AI and jobs: Evidence from online vacancies.

Rental and real estate

Horn & Merante (2017). Is home sharing driving up rents? JHE

Online prices

Cavallo (2017). Are online and offline prices similar? AER
Gorodnichenko & Talavera (2017). Price setting in online markets. AER
Cavallo (2018). Scraped data and sticky prices. RESTAT

Consumption behaviors

Davis & Dingell (2016): Yelp and ethnic segregation in consumption

Conventional data sources

Key databases for economists (non-exhaustive):

Source	Coverage
IMF / World Bank WDI	Macro aggregates, development indicators
FRED	US economy: GDP, interest rates, monetary aggregates
OECD / Eurostat	Developed countries, EU-level data
CEPII (BACI, MACMAP)	Trade flows, tariffs, gravity variables
Penn World Table	Real national accounts, cross-country comparisons

Web data complements these — it fills gaps in timeliness, granularity, and coverage.

Caveats: representativeness and selection

Web data is not a random sample — always ask who is visible online:

Dataset	Who’s missing
Amazon reviews	Silent majority; fake reviews; skewed toward extremes
Job postings	Internal hires, referrals, informal labor markets
Airbnb listings	Traditional rentals, regulated housing
Twitter/X data	Younger, urban, politically active users

For causal inference: selection into being online is often correlated with the outcome of interest. Treat web data the same way you’d treat any non-random sample — identify the selection mechanism before making population claims.

APIs

What is an API?

Application Programming Interface: a structured way to request data from a server

API data formats

XML (Extensible Markup Language) – nested tag structure

<job-offers>
    <job>
       <job-title> Data Analyst </job-title>
       <location> Berlin </location>
    </job>
</job-offers>

JSON (JavaScript Object Notation) – key-value pairs

{"job-offers": [{
    "job-title": "Data Analyst",
    "location": "Berlin"
}]}

CSV, RSS, etc.

API key setup

Many APIs require an API key for authentication. Store it securely in .Renviron:

install.packages("usethis")
# Open the .Renviron file
usethis::edit_r_environ()
# or:
file.edit("~/.Renviron")

# Add one line manually:
# MY_API_KEY=your_key_here

# Restart R, then access the key:
Sys.getenv("MY_API_KEY")

Never hardcode API keys in scripts that you share or commit to Git.

Example: World Bank API with `httr2`

library(httr2)
library(data.table)

# GDP for Germany, 2015-2023 (no API key needed)
resp <- request("https://api.worldbank.org/v2") |>
  req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
  req_url_query(date = "2015:2023", format = "json", per_page = 50) |>
  req_perform()

body <- resp |> resp_body_json()

# body[[1]] = metadata, body[[2]] = data records
dt <- rbindlist(lapply(body[[2]], function(x) {
  data.table(year = as.integer(x$date), gdp_pc = as.numeric(x$value))
}))
print(dt[order(year)])

Understanding the API response structure

resp_body_json() returns a nested R list that mirrors the JSON structure:

body
├── [[1]]  metadata       (page info, total records, last updated)
└── [[2]]  data records
      ├── [[1]]  first year   (list: date, value, country, indicator, ...)
      ├── [[2]]  second year
      ├── [[3]]  third year
      └── ...

lapply(body[[2]], ...) iterates over each year record and pulls out the fields you need:

body[[2]][[1]]$date    # "2015"
body[[2]][[1]]$value   # 41075.96

`httr2`: retry and error handling

library(httr2)

# Built-in retry, rate limiting, and error suppression
resp <- request("https://api.worldbank.org/v2") |>
  req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
  req_url_query(date = "2022", format = "json") |>
  req_retry(max_tries = 3, backoff = ~ 2) |>  # exponential backoff
  req_throttle(rate = 10 / 60) |>              # max 10 requests/min
  req_error(is_error = \(resp) FALSE) |>       # don't throw on HTTP errors
  req_perform()

if (resp_status(resp) == 200) {
  data <- resp |> resp_body_json()
} else {
  message("HTTP ", resp_status(resp), ": ", resp_status_desc(resp))
}

req_retry(): retries automatically on temporary failures (server overload, dropped connection)
req_throttle(): slows request rate — avoids getting blocked and is polite to the server
req_error(): prevents R from stopping on HTTP errors; lets you handle them yourself

Web Scraping

Scraping, parsing, crawling

Scraping: using tools to gather data you can see on a webpage
Parsing: analyzing the HTML text to collect the data you need
Crawling: moving across multiple pages/URLs to gather data at scale

Three technologies behind most webpages:

HTML — structure and content
CSS — visual layout and styling
JavaScript — interactivity and dynamic

For basic scraping, we usually start with HTML and CSS selectors.

HTML basics

HTML uses tags to structure content — similar to XML:

$\underbrace{\text{<p>}}_{\text{start tag}} \underbrace{\text{this is a paragraph}}_{\text{content}} \underbrace{\text{</p>}}_{\text{close tag}}$

A minimal HTML page:

<!DOCTYPE html>
<html>
<body>
  <h1>Page Title</h1>
  <p class="intro">Hello World!</p>
  <table class="wikitable">
    <tr><td>Cell 1</td><td>Cell 2</td></tr>
  </table>
</body>
</html>

CSS selectors

CSS selectors target HTML elements by tag, class, or ID:

Selector	Matches
`"table"`	all `<table>` elements
`".wikitable"`	elements with `class="wikitable"`
`"#content"`	the element with `id="content"`
`"div.mw-body p"`	`<p>` inside `<div class="mw-body">`

You use these selectors in R to extract specific pieces of a webpage.

CSS selectors: example

<html>
  <body>
    <div class='mw-body'>
      <h1>Article title</h1>
      <p>This paragraph is inside mw-body.</p>
      <p>This paragraph is also inside mw-body.</p>
    </div>
    <div class='footer'>
      <p>This paragraph is inside the footer.</p>
    </div>
  </body>
</html>

page |> html_elements("p")              # all 3 paragraphs
page |> html_elements("div.mw-body p") # only the 2 inside mw-body
page |> html_element("h1")             # the article title

Finding selectors: browser DevTools

Right-click any element → Inspect (or F12) to open DevTools:

Elements tab: navigate the live HTML tree; hover to highlight elements on the page
Right-click an element → Copy → Copy selector: get the CSS selector automatically
Network tab: see all requests the page makes — useful for spotting hidden APIs

Practical workflow:

Open the target page in Chrome/Firefox
Right-click the data you want → Inspect
Find the enclosing tag and its class or id
Test in R: html_elements(page, "your.selector")

Scraping a Wikipedia table with `rvest`

library(rvest)

# Read the page
url <- "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
page <- read_html(url)

# Extract the first table matching the CSS selector
companies <- page |>
  html_element("table.wikitable") |>
  html_table()

head(companies)

That’s it — three lines to go from URL to data frame.

URL anatomy

\[\underbrace{\text{https:}}_{\text{protocol}} \underbrace{\text{//de.indeed.com}}_{\text{host}} \underbrace{\text{/Jobs?q=Analyst\&l=Berlin\&start=10}}_{\text{path + query}}\]

q= query for type of job, separating search terms with +
&l= location parameter
&start= pagination offset

Understanding URL structure lets you construct URLs programmatically in a scraping loop.

Scraping multiple pages in a loop

library(rvest)

# Scrape population from several Wikipedia country pages
countries <- c("Germany", "France", "Japan", "Brazil")
results <- list()

for (i in seq_along(countries)) {
  url <- paste0("https://en.wikipedia.org/wiki/", countries[i])
  page <- tryCatch(read_html(url), error = function(e) NULL)

  if (!is.null(page)) {
    results[[i]] <- page |>
      html_elements("table.infobox") |>   # country infobox
      html_table() |>
      (\(x) if (length(x) > 0) x[[1]] else NULL)()
  }
  Sys.sleep(runif(1, 1, 3))  # polite delay between requests
}

Sys.sleep(runif(1, 1, 3)): random 1–3 second delay mimics human browsing
tryCatch(): keeps the loop running even if one page fails
html_elements("table.infobox") |> html_table(): finds the country infobox table and converts it from HTML into an R table
\(x) ...: a lambda function / anonymous function; here it keeps the first table if one exists, otherwise returns NULL

Scraping structured metadata: JSON-LD

Many websites embed machine-readable product metadata inside the HTML for SEO — search engines use it to read product names, prices, availability, and ratings.

Instead of scraping the visible webpage:

<span class="some-changing-price-class">€59.99</span>

we can scrape structured metadata:

<script type="application/ld+json">
{
  "@type": "Product",
  "name": "BILLY bookcase",
  "offers": { "price": "59.99", "priceCurrency": "EUR" }
}
</script>

JSON-LD is often more stable than CSS classes, which may change when the frontend is redesigned.

Example: extracting price from JSON-LD

library(rvest)
library(jsonlite)

url  <- "https://www.ikea.com/de/de/p/billy-buecherregal-weiss-00263850/"
page <- read_html(url)

scripts <- page |>
  html_elements("script[type='application/ld+json']") |>
  html_text()

product <- fromJSON(scripts[2])  # second block is the Product schema

product$offers$price
product$offers$priceCurrency

fromJSON(scripts[2]) turns the raw JSON string:

{
  "@type": "Product",
  "name": "BILLY bookcase",
  "offers": { "price": "59.99", "priceCurrency": "EUR" }
}

into an R object you can navigate with $:

product$name                  # "BILLY bookcase"
product$offers$price          # "59.99"
product$offers$priceCurrency  # "EUR"

Three ways to collect web data

Method	Example	Pros	Cons
API	World Bank API	Clean, structured, stable	Not always available
CSS scraping	Wikipedia tables	Easy for visible content	Breaks when layout changes
JSON-LD scraping	IKEA prices	Structured and robust	Only if metadata is provided

Rule of thumb:

Use an API if one exists.
Use structured metadata (JSON-LD) if available.
Use CSS selectors when data only appears in the visible HTML.

JavaScript-rendered pages

rvest downloads the raw HTML sent by the server.

But some websites load the actual data after the page opens, using JavaScript.

Example:

https://quotes.toscrape.com/scroll

The browser shows quotes, but rvest only sees the initial page shell.

library(rvest)

url <- "https://quotes.toscrape.com/scroll"
page <- read_html(url)

page |> html_text2()

Raw HTML vs rendered page

The raw HTML contains text like:

Quotes to Scrape
Login
Loading...

but not the quote cards.

page |>
  html_elements(".quote") |>
  length()

This returns:

because .quote elements are created later by JavaScript.

DevTools: finding the API endpoint

Open DevTools → Network.

On this page, JavaScript requests:

https://quotes.toscrape.com/api/quotes?page=1

Clues that this is an API request:

Type: json
Status: 200
Content-Type: application/json
Initiator: jquery.js (xhr)

So the page gets quote data from an API, then inserts it into the HTML.

Scrape the API directly

library(httr2)
library(data.table)

resp <- request("https://quotes.toscrape.com/api/quotes") |>
  req_url_query(page = 1) |>
  req_perform()

data <- resp |> resp_body_json()

quotes <- rbindlist(lapply(data$quotes, function(q) {
  data.table(
    text = q$text,
    author = q$author$name,
    tags = paste(q$tags, collapse = ", ")
  )
}))

quotes

A caveat: cookies and session state

The browser does not just send a URL — it may also send cookies with the request:

Session identity or login status
Location and language
A/B test group
Recently viewed products or user history

This means two users visiting the same page may receive different data:

Different prices, discounts, or availability
Different product rankings or recommendations
Different content due to A/B tests or login status

For research: document and, where possible, control the session state.

Controlling session state

A VPN can help control apparent location, but it does not solve all personalization problems.

Better practice:

Scrape as a logged-out user
Clear cookies or use a fresh browser profile
Fix country and language settings where possible
Record request headers, timestamps, and collection location
Repeat collection across locations if geographic variation matters

The key point: document the browsing conditions under which the data were collected.

Why this is better than scraping the rendered page

Strategy	What it does	When to use
`rvest`	reads raw HTML	data is already in the source
API request	downloads JSON data directly	data is loaded through Network requests
`chromote`	runs JavaScript like a browser	no accessible API or complex interaction

Rule of thumb: before using a headless browser, inspect the Network tab.

Headless browser fallback

Use a headless browser only when the data cannot be easily accessed through an API.

library(chromote)
library(rvest)

url <- "https://quotes.toscrape.com/scroll"

b <- ChromoteSession$new()
b$Page$navigate(url)
Sys.sleep(3)

html <- b$Runtime$evaluate(
  "document.documentElement.outerHTML"
)$result$value

b$close()

page_rendered <- read_html(html)

page_rendered |>
  html_elements(".quote") |>
  html_text2()

Legal & Ethical Considerations

robots.txt

Every well-configured website publishes a robots.txt file that tells crawlers what is allowed:

# Example: https://www.ikea.com/robots.txt
User-agent: *
Disallow: /checkout/
Disallow: /profile/
Crawl-delay: 10

User-agent: Googlebot
Allow: /

Reading this file: for most crawlers, do not crawl /checkout/ or /profile/, and wait 10 seconds between requests. For Googlebot, crawling is broadly allowed.

Always check robots.txt before scraping
Respect Disallow directives and Crawl-delay
Many jurisdictions treat violation of robots.txt as unauthorized access

Rate limiting and the `polite` package

Check the website’s Terms of Service
Add random pauses between requests
The polite package automates responsible scraping:

library(polite)

# 1. Introduce yourself and check robots.txt
session <- bow("https://en.wikipedia.org",
               user_agent = "DSfE course bot (academic use)",
               delay = 5)
print(session)  # shows which paths are allowed/disallowed

# 2. Propose a specific path
page <- nod(session, path = "/wiki/List_of_countries_by_GDP_(nominal)")

# 3. Scrape respectfully (rate-limited, robots.txt-aware)
result <- scrape(page)

LLM-based data extraction

Use large language models to parse unstructured web content into structured data:

Feed raw HTML or rendered text to an LLM → ask it to extract fields as JSON
Useful when page structure is irregular or frequently changes
Models: GPT-4.1-mini, Claude Haiku 4.5 — fast and cheap enough for bulk extraction

library(ellmer)

html_snippet <- "<div class='product'><h2>BILLY Bookcase</h2><span>€49.99</span></div>"
chat <- chat_openai(model = "gpt-4.1-mini")
result <- chat$chat(paste(
  "Extract product name and price as JSON from this HTML:", html_snippet
))
cat(result)

Trade-off: more flexible than CSS selectors, but slower and costlier at scale.

Wrap Up

Summary

Web data fills gaps that conventional sources miss — prices, job postings, real-time activity
APIs provide structured access; use httr2 to build requests with retry and rate limiting
Web scraping with rvest extracts data from HTML using CSS selectors
Legal and ethical: always check robots.txt, respect rate limits, use the polite package
LLMs can parse messy HTML when CSS selectors break down

From the Law of One Price to the Big Mac Index

The Law of One Price says that identical goods should sell for the same price across countries once converted to a common currency:

\[P_{\text{local}} = P_{\text{EUR}} \times S\]

where $S$ is the exchange rate: local currency per EUR.

The Big Mac Index applies this idea to a standardized McDonald’s product.

Our case study asks:

Can we build a similar index using IKEA’s BILLY bookcase?

Why IKEA BILLY?

Highly standardized product sold across many countries
Same product family and similar design across markets
Easy to observe online through national IKEA websites
Useful for comparing prices across countries

Like the Big Mac Index, the idea is simple:

compare the local price of the same good across countries after converting currencies.

Scraping IKEA prices

CSS class names can break when websites redesign.

Instead, many e-commerce sites embed JSON-LD structured data for SEO.

extract_price_jsonld <- function(page) {
  scripts <- page |>
    html_nodes("script[type='application/ld+json']") |>
    html_text()

  for (s in scripts) {
    parsed <- tryCatch(
      fromJSON(s, simplifyVector = FALSE),
      error = function(e) NULL
    )

    if (is.null(parsed)) next

    offers <- parsed[["offers"]]

    if (!is.null(offers)) {
      return(list(
        price = as.numeric(offers[["price"]]),
        currency = offers[["priceCurrency"]]
      ))
    }
  }

  list(price = NA_real_, currency = NA_character_)
}

Scraping across countries

countries <- c(
  "de/de", "fr/fr", "it/it", "es/es",
  "us/en", "gb/en", "at/de", "ch/fr"
)

dt <- data.table(
  country = character(),
  price = numeric(),
  currency = character()
)

for (cc in countries) {
  url <- str_c(
    "https://www.ikea.com/",
    cc,
    "/p/billy-buecherregal-weiss-00263850/"
  )

  page <- tryCatch(read_html(url), error = function(e) NULL)

  if (is.null(page)) next

  result <- extract_price_jsonld(page)

  dt <- rbind(
    dt,
    data.table(
      country = cc,
      price = result$price,
      currency = result$currency
    )
  )

  Sys.sleep(1 + runif(1, 0, 2))
}

Data pipeline

Scrape IKEA country pages
Extract price and currency from JSON-LD
Collect exchange rates from an API
Convert local prices into EUR
Compare each country to Germany
Compute Law of One Price ratios

This combines two tools from today:

web scraping for IKEA prices
APIs for exchange rates

Testing the Law of One Price

Use Germany as the benchmark price.

\[\text{LoP ratio} = \frac{P_{\text{local}}}{P_{\text{DE}} \times S_{\text{local/EUR}}}\]

Ratio	Interpretation
= 1	price parity with Germany
> 1	BILLY is more expensive than LoP predicts
< 1	BILLY is cheaper than LoP predicts

If markets were fully integrated and arbitrage were frictionless, all ratios would be close to 1.

Why might prices differ?

Even for an identical product, prices may differ because of:

transport costs
tariffs and taxes
exchange-rate pass-through
local competition
market power
VAT differences
distribution and warehousing costs
local pricing strategies

So deviations from LoP are not necessarily “mistakes” — they reveal market frictions.