Data Science for Economists
2026-03-01
You can use conventional databases, for example:
And look at the skill-content of newly created jobs
You can create your own data starting from job postings online
Right click the page and “Inspect” the page
HTML source of this link
<div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
<h2 class="jobTitle jobTitle-color-purple">
<a aria-label="full details of RN Specialty Practice Clinic"
class="jcs-JobTitle" data-jk="64208e3673f6e103"
href="/rc/clk?jk=64208e3673f6e103" role="button" target="_blank">
<span title="RN Specialty Practice Clinic">
RN Specialty Practice Clinic
</span>
</a>
</h2>
</div>
<div class="heading6 company_location tapItem-gutter companyInfo">
<span class="companyName">
Androscoggin Valley Hospital - NURSING - PHYS...
</span>
<div class="companyLocation">Berlin, NH 03570</div>
<div class="heading6 tapItem-gutter metadataContainer">
<div class="metadata estimated-salary-container">
<span class="estimated-salary">...</span>
</div>
</div>
</div>Look across divs and within h2 match the attribute class="jobTitle jobTitle-color-purple" and extract content in the span
Strengths
Weaknesses

Find data that does not exist elsewhere
Re-arrange data in more convenient format
Job postings
Rental and real estate
Online vs offline prices
Consumption behaviors
How to continue
Applications
Key databases for economists (non-exhaustive):
| Source | Coverage |
|---|---|
| IMF / World Bank WDI | Macro aggregates, development indicators |
| FRED | US economy: GDP, interest rates, monetary aggregates |
| OECD / Eurostat | Developed countries, EU-level data |
| CEPII (BACI, MACMAP) | Trade flows, tariffs, gravity variables |
| Penn World Table | Real national accounts, cross-country comparisons |
| Baker-Bloom-Davis EPU | Economic Policy Uncertainty index |
Full list with links on the course website.
APIs
Application Programming Interface: online tool to access info or download raw data
\[\underbrace{\text{<job-title>}}_{\text{opening tag}} \underbrace{\text{Data Analyst}}_{\text{value}} \underbrace{\text{</job-title>}}_{\text{closing tag}}\]
\[\underbrace{\text{"job-title":}}_{\text{Key}} \underbrace{\text{"Data Analyst"}}_{\text{value}}\]
| Field | Description |
|---|---|
| API | Link to API documentation |
| Auth | Does this API require authentication? (OAuth, apiKey, no) |
| HTTPS | Does the API support HTTPS? |
| CORS | Does the API support CORS? Without proper CORS configuration an API will only be usable server side. |
Many APIs require an API key for authentication. Store it securely in .Renviron:
Never hardcode API keys in scripts that you share or commit to Git.
httr2library(httr2)
library(jsonlite)
# GDP per capita for Germany, 2015-2023
resp <- request("https://api.worldbank.org/v2") |>
req_url_path_append("country", "DEU", "indicator", "NY.GDP.PCAP.CD") |>
req_url_query(date = "2015:2023", format = "json", per_page = 50) |>
req_perform()
data <- resp |> resp_body_json()
# data[[2]] contains the actual recordshttr2 is the modern replacement for httrWeb Scraping
\[\underbrace{\text{https:}}_{\text{protocol}} \underbrace{\text{//de.indeed.com}}_{\text{host}} \underbrace{\text{/Jobs?q=Analyst\&l=Berlin\&start=10}}_{\text{path}}\]
q= query for type of job, separating search terms with + (e.g. school+teacher)&l= begins the string for location, again North+Westphalia&start= index the number of items you want to see\[\underbrace{\text{<p>}}_{\text{start tag}} \underbrace{\text{this is a paragraph}}_{\text{HTML element}} \underbrace{\text{</p>}}_{\text{close tag}}\]
Save the following in a file .html and open via the browser
Legal & Ethical Considerations
Every well-configured website publishes a robots.txt file that tells crawlers what is allowed:
robots.txt before scrapingDisallow directives and Crawl-delayrobots.txt as unauthorized accessCommon scraping tools/libraries:
| Language | Libraries |
|---|---|
| R | rvest, httr2, polite, chromote (for JS-heavy pages) |
| Python | requests, BeautifulSoup, Scrapy, Playwright |
polite package in RThe polite package automates responsible scraping:
library(polite)
# 1. Introduce yourself and check robots.txt
session <- bow("https://www.example.com",
user_agent = "MyResearchBot (me@uni.edu)",
delay = 5)
# 2. Propose a specific path -- polite checks if it's allowed
page <- nod(session, path = "/data/prices")
# 3. Scrape respectfully (rate-limited, robots.txt-aware)
result <- scrape(page)robots.txtUse large language models to parse unstructured web content into structured data.
Trade-off: more flexible than CSS selectors, but slower and costlier at scale. Best for irregular pages or one-off extraction tasks.
The Billion Prices Project
“By 2010, we were collecting 5 million prices every day.”
— Alberto Cavallo and Roberto Rigobon (2016)


Hands-On Scraping
rvestCSS selectors target elements by tag, class, or ID:
"table" — all <table> elements".wikitable" — elements with class="wikitable""#content" — the element with id="content""div.mw-body p" — <p> tags inside <div class="mw-body">library(rvest)
urls <- paste0("https://example.com/page/", 1:100)
results <- list()
for (i in seq_along(urls)) {
results[[i]] <- tryCatch({
page <- read_html(urls[i])
page |> html_element("h1") |> html_text()
}, error = function(e) {
message("Failed: ", urls[i], " - ", e$message)
NA_character_
})
Sys.sleep(runif(1, 1, 3)) # random delay between requests
}Sys.sleep(runif(1, 1, 3)): random 1–3 second delay mimics human browsingtryCatch(): keeps the loop running even if one page failspolite package (shown earlier) which automates thishttr2library(httr2)
resp <- request("https://api.example.com/data") |>
req_retry(max_tries = 3, backoff = ~ 2) |> # retry with exponential backoff
req_throttle(rate = 10 / 60) |> # max 10 requests per minute
req_error(is_error = \(resp) FALSE) |> # don't error on HTTP failures
req_perform()
if (resp_status(resp) == 200) {
data <- resp |> resp_body_json()
} else {
message("HTTP ", resp_status(resp), ": ", resp_status_desc(resp))
}httr2 has built-in retry logic, throttling, and error handling — use it for API work.