04. Web Scraping & APIs

Data Science for Economists

2026-03-01

What skills are demanded in Germany?

You can use conventional databases, for example:

And look at the skill-content of newly created jobs

Job offers online

You can create your own data starting from job postings online

Right click the page and “Inspect” the page

Type of occupations

HTML source of this link

<div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
  <h2 class="jobTitle jobTitle-color-purple">
    <a aria-label="full details of RN Specialty Practice Clinic"
       class="jcs-JobTitle" data-jk="64208e3673f6e103"
       href="/rc/clk?jk=64208e3673f6e103" role="button" target="_blank">
      <span title="RN Specialty Practice Clinic">
        RN Specialty Practice Clinic
      </span>
    </a>
  </h2>
</div>
<div class="heading6 company_location tapItem-gutter companyInfo">
  <span class="companyName">
    Androscoggin Valley Hospital - NURSING - PHYS...
  </span>
  <div class="companyLocation">Berlin, NH 03570</div>
  <div class="heading6 tapItem-gutter metadataContainer">
    <div class="metadata estimated-salary-container">
      <span class="estimated-salary">...</span>
    </div>
  </div>
</div>

Extracting job titles

Look across divs and within h2 match the attribute class="jobTitle jobTitle-color-purple" and extract content in the span

> jobs |> head()
[1] "RN Specialty Practice Clinic"
[2] "Communications Assistant"
[3] "Stocker"
[4] "Director of Rehabilitation/DOR - Gorham, NH"
[5] "Nursing Unit Aide"
[6] "Administrative Assistant to the Superintendent of Schools (2..."

New Data vs Conventional Sources

Strengths

  • You can better identify demand (matches confound supply characteristics)
  • Rich and customizable

Weaknesses

  • Incomplete and non-representative

  • Time dimension
  • Unless you are Google, might not be easy to web scrape

When to web scrape online info?

Find data that does not exist elsewhere

  1. Data is generated by contributions from large user base (Uber, Amazon)
  2. Data is itself measurement of activity on website (Reddit) or network (Facebook, LinkedIn, WeiBo)

Re-arrange data in more convenient format

  1. Data from many sources aggregated on one site (Wiki on Civil Wars)
  2. Parsing techniques of web scraping can also be used when data provider gives you data in inefficient form (e.g. max 1000 spreadsheets)

Literature using online data

Job postings

  • Kuhn & Shen (2013). Gender discrimination in job ads. QJE
  • Deming et al. (2018). Skill requirements across firms and labor markets.
  • Javorcik et al. (2019). The Brexit vote and labour demand
  • Acemoglu et al. (2020). AI and jobs: Evidence from online vacancies.

Rental and real estate

  • Halket & Pignatti (2015): scrape Craigslist to study US rental market
  • Horn & Merante (2017). Is home sharing driving up rents? JHE
  • Yilmaz, Talavera & Jia (2020). Liquidity, seasonality, and distance to universities.

Online vs offline prices

  • Chevalier et al. (2003). Measuring prices online: Amazon vs Barnes & Noble.
  • Ellison & Ellison (2009). Search, obfuscation, and price elasticities. Econometrica
  • Cavallo & Rigobon (2015): “Billion Prices Project”
  • Cavallo (2017). Are online and offline prices similar? AER
  • Gorodnichenko & Talavera (2017). Price setting in online markets. AER
  • Cavallo (2018). Scraped data and sticky prices. RESTAT

Consumption behaviors

  • Baye & Morgan (2009). Brand and price advertising online. Management Science
  • Davis & Dingell (2016): Yelp and ethnic segregation in consumption

Where to start

  • Does this data already exist?
  • Would the site be willing to give you the data or partner with you?
  • Do they have an API?
    • …if not, extract website data yourself!

How to continue

  • If the data is user-contributed, who are the users? Is selection bias a big problem?
  • Does the site customize the data based on browser characteristics (IP, time, cookies)?
  • Do you need a panel? If the website changes or is pulled down, could you still write a paper?
  • How much measurement error can you tolerate in your research design?

Overview of today

  1. Main economic data sources available
  2. How to use an API to access data
  3. Basics of web scraping
  4. Legal and ethical considerations

Applications

  1. The Billion Prices Project
  2. IKEA scraper as practical example

Conventional Sources of Data in Economics

Key databases for economists (non-exhaustive):

Source Coverage
IMF / World Bank WDI Macro aggregates, development indicators
FRED US economy: GDP, interest rates, monetary aggregates
OECD / Eurostat Developed countries, EU-level data
CEPII (BACI, MACMAP) Trade flows, tariffs, gravity variables
Penn World Table Real national accounts, cross-country comparisons
Baker-Bloom-Davis EPU Economic Policy Uncertainty index

Full list with links on the course website.

APIs

API

Application Programming Interface: online tool to access info or download raw data

API data formats

  1. XML (Extensible Markup Language) – Node structure
<job-offers>
    <job>
       <job-title> Data Analyst </job-title>
       <location> Berlin </location>
       <benefits>
           <salary> 50k </salary>
           <remote> yes </remote>
           <type> full-time </type>
       </benefits>
    </job>
</job-offers>

\[\underbrace{\text{<job-title>}}_{\text{opening tag}} \underbrace{\text{Data Analyst}}_{\text{value}} \underbrace{\text{</job-title>}}_{\text{closing tag}}\]

API data formats (cont.)

  1. JSON – Key-value pairs structure
{"job-offers": [{
    "job-title": "Data Analyst",
    "location": "Berlin",
    "benefits": {
      "salary": "50k",
      "remote": "yes",
      "type": "full time"
    }
}]}

\[\underbrace{\text{"job-title":}}_{\text{Key}} \underbrace{\text{"Data Analyst"}}_{\text{value}}\]

  1. CSV, RSS, etc.

APIs: where to find them

Field Description
API Link to API documentation
Auth Does this API require authentication? (OAuth, apiKey, no)
HTTPS Does the API support HTTPS?
CORS Does the API support CORS? Without proper CORS configuration an API will only be usable server side.

API key setup

Many APIs require an API key for authentication. Store it securely in .Renviron:

# Add key to .Renviron (one-time setup)
usethis::edit_r_environ()
# Add line: MY_API_KEY=your_key_here

# Access in your code
Sys.getenv("MY_API_KEY")

Never hardcode API keys in scripts that you share or commit to Git.

Example: World Bank API with httr2

library(httr2)
library(jsonlite)

# GDP per capita for Germany, 2015-2023
resp <- request("https://api.worldbank.org/v2") |>
  req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
  req_url_query(date = "2015:2023", format = "json", per_page = 50) |>
  req_perform()

data <- resp |> resp_body_json()
# data[[2]] contains the actual records
  • httr2 is the modern replacement for httr
  • Build requests step-by-step with pipes
  • Automatic retry, rate limiting, and error handling built in

Web Scraping

Definitions

  • Scraping: Using tools to gather data you can see on a webpage
    • Parsing: The act of analyzing the text (HTML, …) to collect the data you need
    • Crawling: Moving across or through a website to gather data from multiple URLs/pages
  • Two main technologies to build a webpage:
    • HTML (Hypertext Markup Language) – structure of the page
    • CSS (Cascading Style Sheets) – visual and aural layout for a variety of devices

Web Crawling

\[\underbrace{\text{https:}}_{\text{protocol}} \underbrace{\text{//de.indeed.com}}_{\text{host}} \underbrace{\text{/Jobs?q=Analyst\&l=Berlin\&start=10}}_{\text{path}}\]

  • q= query for type of job, separating search terms with + (e.g. school+teacher)
  • &l= begins the string for location, again North+Westphalia
  • &start= index the number of items you want to see

HTML

  • HTML: similar to XML, both use tags and node structure
  • Different functions: HTML displays content on a web page, XML represents data in a hierarchical structure
  • XML is case-sensitive while HTML is not
  • Link here to tags in HTML

\[\underbrace{\text{<p>}}_{\text{start tag}} \underbrace{\text{this is a paragraph}}_{\text{HTML element}} \underbrace{\text{</p>}}_{\text{close tag}}\]

  • In HTML you can leave tags open, in XML not

HTML vs CSS

Save the following in a file .html and open via the browser

<!DOCTYPE html>
<html>
<head>
<!-- We define the layout of paragraph using CSS -->
<style>
p {
  color: purple;
  text-align: center;
}
</style>
</head>
<body>

<p>Hello World!</p>
<p>These paragraphs are styled with CSS.</p>

</body>
</html>

Legal & Ethical Considerations

robots.txt

Every well-configured website publishes a robots.txt file that tells crawlers what is allowed:

# Example: https://www.ikea.com/robots.txt
User-agent: *
Disallow: /checkout/
Disallow: /profile/
Crawl-delay: 10

User-agent: Googlebot
Allow: /
  • Always check robots.txt before scraping
  • Respect Disallow directives and Crawl-delay
  • Many jurisdictions treat violation of robots.txt as unauthorized access

Rate limiting and politeness

  • Check the website’s Terms of Service: limits on number of requests? data usage?
  • Add random pauses between requests to emulate human-like browsing behavior
  • Use proxies from different IP addresses only when necessary and permissible

Common scraping tools/libraries:

Language Libraries
R rvest, httr2, polite, chromote (for JS-heavy pages)
Python requests, BeautifulSoup, Scrapy, Playwright

The polite package in R

The polite package automates responsible scraping:

library(polite)

# 1. Introduce yourself and check robots.txt
session <- bow("https://www.example.com",
               user_agent = "MyResearchBot (me@uni.edu)",
               delay = 5)

# 2. Propose a specific path -- polite checks if it's allowed
page <- nod(session, path = "/data/prices")

# 3. Scrape respectfully (rate-limited, robots.txt-aware)
result <- scrape(page)
  • Automatically reads and respects robots.txt
  • Enforces a crawl delay between requests
  • Identifies your bot with a user-agent string

LLM-based data extraction

Use large language models to parse unstructured web content into structured data.

  • Feed raw HTML or rendered text to an LLM and ask it to extract fields
  • Particularly useful when page structure is irregular or frequently changes
  • Models: GPT-4.1-mini, Claude Haiku 4.5 — fast and cheap enough for bulk extraction
library(ellmer)

# Use an LLM to extract structured data from messy HTML
chat <- chat_openai(model = "gpt-4.1-mini")
result <- chat$chat(paste(
  "Extract product name and price as JSON from this HTML:",
  html_snippet
))

Trade-off: more flexible than CSS selectors, but slower and costlier at scale. Best for irregular pages or one-off extraction tasks.

The Billion Prices Project

“By 2010, we were collecting 5 million prices every day.”

— Alberto Cavallo and Roberto Rigobon (2016)

  • Institutions: Inflacion Verdadera, MIT, PriceStats
  • Daily price data since 2008
  • From hundreds of large multi-channel retailers
  • In over 60 countries

Motivation: Argentina’s inflation crisis

  • Argentina’s CPI (2007–2015) was widely questioned: official < 10%, expectations > 25%
  • Online prices offered an independent cross-check

Online vs official CPI

  • ~60% of CPI expenditure weights can be found online (goods, food, fuel; fewer services)
  • Short-run discrepancies (mainly developing countries) but strong medium/long-term co-movement

Key insights from online prices

  1. Speed: Online prices detected the Lehman Brothers price drop two months before official CPI
  2. Price stickiness: New evidence on menu pricing (Cavallo 2018)
  3. Law of One Price: LOOP holds within same-currency zones (IKEA, Zara comparisons)

Access to the project

  • Data and results: bpp.mit.edu
  • Raw micro data available to academic researchers with data-access agreement

Hands-On Scraping

CSS Selectors with rvest

library(rvest)

# Read the page
page <- read_html("https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue")

# Extract the first table using CSS selectors
table <- page |>
  html_element("table.wikitable") |>
  html_table()

head(table)

CSS selectors target elements by tag, class, or ID:

  • "table" — all <table> elements
  • ".wikitable" — elements with class="wikitable"
  • "#content" — the element with id="content"
  • "div.mw-body p"<p> tags inside <div class="mw-body">

Rate Limiting: Be a Good Citizen

library(rvest)

urls <- paste0("https://example.com/page/", 1:100)

results <- list()
for (i in seq_along(urls)) {
  results[[i]] <- tryCatch({
    page <- read_html(urls[i])
    page |> html_element("h1") |> html_text()
  }, error = function(e) {
    message("Failed: ", urls[i], " - ", e$message)
    NA_character_
  })
  Sys.sleep(runif(1, 1, 3))  # random delay between requests
}
  • Sys.sleep(runif(1, 1, 3)): random 1–3 second delay mimics human browsing
  • tryCatch(): keeps the loop running even if one page fails
  • Or use the polite package (shown earlier) which automates this

Error Handling with httr2

library(httr2)

resp <- request("https://api.example.com/data") |>
  req_retry(max_tries = 3, backoff = ~ 2) |>  # retry with exponential backoff
  req_throttle(rate = 10 / 60) |>              # max 10 requests per minute
  req_error(is_error = \(resp) FALSE) |>       # don't error on HTTP failures
  req_perform()

if (resp_status(resp) == 200) {
  data <- resp |> resp_body_json()
} else {
  message("HTTP ", resp_status(resp), ": ", resp_status_desc(resp))
}

httr2 has built-in retry logic, throttling, and error handling — use it for API work.

Further reading

  • Brown et al. (2025), “Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations”, Big Data & Society