Data Science for Economists
2026-03-01
Large structured datasets are everywhere in economics:
This session:
LARGE DATA IN ECONOMICS
Challenge: these datasets are too large for naive workflows (Excel, read.csv into RAM)
\(\Rightarrow\) You need the right tools and habits
The workhorse: BACI (CEPII) — harmonised bilateral trade flows
Other trade datasets at scale:
“Countries don’t trade. Firms trade.”
— Hallak and Levinsohn, 2005
Firm-level customs data (available for many countries since the 1990s) revealed:
Melitz (2003): Fixed costs of exporting → only the most productive firms select into exporting. Trade liberalisation raises average productivity through firm selection.
Colombian firm export values — massively skewed, heavy-tailed. A handful of firms account for most of total exports.
Matched employer-employee data link workers to firms over time — millions of records per country
Tax and census records at scale:
Scanner / price data — item-level prices scraped or scanned daily
Financial transaction data — credit/debit card spending, real-time
MEMORY MANAGEMENT
Data processed in R has to be fully loaded into the RAM
How much free RAM do I have?
systeminfo | find "Available Physical Memory"topfree -hobject.size(my_data)
chr_vect <- c("12", "11", "33")
dbl_vect <- c(12, 11, 33)
format(object.size(chr_vect), units = "auto")
# [1] "248 bytes"
format(object.size(dbl_vect), units = "auto")
# [1] "80 bytes"
memory.profile()
# NULL symbol pairlist closure environment promise
# 1 37875 1250989 24936 7117 34670
# language special builtin char logical integer
# 338840 45 685 122872 36574 196015
# double complex character ... any list
# 18644 40 353876 11 0 91695
# expression bytecode externalptr weakref raw S4
# 1 76205 7455 2137 2187 3573Encode variables efficiently (e.g., factor instead of character):
gender <- c("female", "male", "other")
format(object.size(gender), units = "auto")
# [1] "272 bytes"
format(object.size(as.factor(gender)), units = "auto")
# [1] "672 bytes"
gender <- rep(c("female", "male", "other"), 100)
format(object.size(gender), units = "auto")
# [1] "2.6 Kb"
format(object.size(as.factor(gender)), units = "auto")
# [1] "1.8 Kb"A few other memory-management tips in R:
dataframe$redundant <- NULLrm()gc() in loops (automatic gc is enough most of the time)CHUNK AND PULL
French SIREN data: stock of all French firms (23M+ records since 1973)
Idea: data too large to load entirely → split by year in the shell, process each chunk in R, combine results
library(data.table)
# Step 2: function to process one chunk
process_chunk <- function(file) {
dt <- fread(file)
dt[, .(n_rows = .N, mean_val = mean(value, na.rm = TRUE),
file = basename(file))]
}
# Step 3: apply to all chunks and bind
my_files <- list.files("temp/yearly_data", full.names = TRUE)
results <- rbindlist(lapply(my_files, process_chunk))tidyverse
filter: Filter (i.e. subset) rows based on their valuesarrange: Arrange (i.e. reorder) rows based on their valuesselect: Select (i.e. subset) columns by their namesmutate: Create new columnsstarwars |>
filter(
species == "Human",
height >= 190
)
# A tibble: 4 x 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Darth Va~ 202 136 none white yellow 41.9 male mascu~
# 2 Qui-Gon ~ 193 89 brown fair blue 92 male mascu~
# 3 Dooku 193 80 white fair brown 102 male mascu~
# 4 Bail Pre~ 191 NA black tan brown 67 male mascu~starwars |>
arrange(birth_year)
# A tibble: 87 x 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Wicket ~ 88 20 brown brown brown 8 male mascu~
# 2 IG-88 200 140 none metal red 15 none mascu~
# 3 Luke Sk~ 172 77 blond fair blue 19 male mascu~
# 4 Leia Or~ 150 49 brown light brown 19 fema~ femin~
# 5 Wedge A~ 170 77 brown fair hazel 21 male mascu~
# ...starwars |>
select(name:skin_color, species, -height)
# A tibble: 87 x 5
# name mass hair_color skin_color species
# <chr> <dbl> <chr> <chr> <chr>
# 1 Luke Skywalker 77 blond fair Human
# 2 C-3PO 75 NA gold Droid
# 3 R2-D2 32 NA white, blue Droid
# 4 Darth Vader 136 none white Human
# 5 Leia Organa 49 brown light Human
# ...starwars |>
select(name, birth_year) |>
mutate(dog_years = birth_year * 7) |>
mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
# A tibble: 87 x 4
# name birth_year dog_years comment
# <chr> <dbl> <dbl> <chr>
# 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years.
# 2 C-3PO 112 784 C-3PO is 784 in dog years.
# 3 R2-D2 33 231 R2-D2 is 231 in dog years.
# 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years.
# ...starwars |>
group_by(species, gender) |>
summarise(mean_height = mean(height, na.rm = TRUE))
# `summarise()` has grouped output by 'species'.
# A tibble: 42 x 3
# Groups: species [38]
# species gender mean_height
# <chr> <chr> <dbl>
# 1 Aleena masculine 79
# 2 Besalisk masculine 198
# 3 Cerean masculine 198
# 4 Chagrian masculine 196
# 5 Clawdite feminine 168
# ...data.table
tidyversedata.tablecollapse_dplyr <- function() {
storms |>
group_by(name, year, month, day) |>
summarize(wind = mean(wind),
pressure = mean(pressure),
category = dplyr::first(category))
}
storms_dt <- as.data.table(storms)
collapse_dt <- function() {
storms_dt[, .(wind = mean(wind),
pressure = mean(pressure),
category = first(category)),
by = .(name, year, month, day)]
}
microbenchmark(collapse_dplyr(), collapse_dt(), times = 1)
# Unit: milliseconds
# expr min median max neval
# collapse_dplyr() 67.061792 67.061792 67.061792 1
# collapse_dt() 1.943042 1.943042 1.943042 1DT[i, j, by]filter(), slice(), arrange()select(), mutate()group_by()LHS := RHS form.() is just a data.table shortcut for list()| tidyverse | data.table | base |
|---|---|---|
readr::read_csv |
data.table::fread |
utils::read.csv |
tibble::tibble |
data.table::data.table |
base::data.frame |
dplyr::if_else |
– | base::ifelse |
\(\rightarrow\) mix and win.
MODERN WORKFLOWS: PARQUET AND DUCKPLYR
CSV files are the default format for tabular data, but they have limitations at scale:
Apache Parquet is a columnar storage format that solves all of these problems.
library(arrow)
data(storms, package = "dplyr")
# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")
# Compare file sizes
cat("CSV: ", round(file.size("storms.csv") / 1e6, 2), "MB\n")
cat("Parquet:", round(file.size("storms.parquet") / 1e6, 2), "MB\n")
# Read it back
df <- read_parquet("storms.parquet")
# Read only specific columns (much faster than CSV)
df_small <- read_parquet("storms.parquet",
col_select = c("name", "year", "wind", "pressure"))Key advantages:
col_types headaches)duckplyr uses DuckDB as a backend for dplyr – same syntax, dramatically faster on large data:
library(arrow)
library(duckplyr)
# 1. Convert CSV to Parquet (once)
data(storms, package = "dplyr")
write_parquet(storms, "storms.parquet")
# 2. Work with the Parquet file directly
df <- read_parquet("storms.parquet")
# 3. Analyze with dplyr syntax, DuckDB speed
result <- df |>
filter(wind > 50) |>
group_by(status, year) |>
summarise(
mean_wind = mean(wind),
n = n(),
.groups = "drop"
) |>
collect()\(\rightarrow\) Same tidyverse code you already know, but scales to billions of rows.
DuckDB supports SQL natively — useful if you already know SQL or need complex queries:
library(DBI)
library(duckdb)
con <- dbConnect(duckdb())
# Query a Parquet file directly with SQL --- no loading step
dbGetQuery(con, "
SELECT status, year,
AVG(wind) AS mean_wind,
COUNT(*) AS n_obs
FROM read_parquet('storms.parquet')
WHERE year >= 2010
GROUP BY status, year
ORDER BY mean_wind DESC
")
dbDisconnect(con, shutdown = TRUE)library(arrow); library(data.table)
data(storms, package = "dplyr")
# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")
# Reading: CSV vs Parquet
system.time(read.csv("storms.csv")) # base R CSV
system.time(data.table::fread("storms.csv")) # fread
system.time(read_parquet("storms.parquet")) # Parquet
# Summarising: dplyr vs data.table
storms_dt <- as.data.table(storms)
system.time(storms |> dplyr::group_by(name, year) |>
dplyr::summarise(w = mean(wind), .groups = "drop"))
system.time(storms_dt[, .(w = mean(wind)), by = .(name, year)])Rule of thumb: Start with duckplyr + Parquet. Reach for data.table when you need maximum speed or in-place modification.
Wrap up