Data Science for Economists
2026-05-01
Large structured datasets are everywhere in economics:
Question: In what dimension are these datasets “large”?
LARGE DATA IN ECONOMICS
Challenge: these datasets are too large for naive workflows (Excel, read.csv into RAM)
\(\Rightarrow\) You need the right tools and habits
The workhorse: BACI (CEPII) — harmonised bilateral trade flows
Other trade datasets at scale:
“Countries don’t trade. Firms trade.”
— Hallak and Levinsohn, 2005
Firm-level customs data (available for many countries since the 1990s) revealed:
Melitz (2003): Fixed costs of exporting → only the most productive firms select into exporting. Trade liberalisation raises average productivity through firm selection.
Colombian firm export values — massively skewed, heavy-tailed. A handful of firms account for most of total exports.
Does this ring a bell? Have you come across Zipf’s law before — city sizes, word frequencies in a dictionary?
On the arithmetic scale, the top firms dominate the picture.
Suppose firm exports have a Pareto-like upper tail:
\[P(Y > y) = C y^{-\alpha}\]
Taking logs:
\[\log P(Y > y) = \log C - \alpha \log y\]
So on a log-log plot:
\(\Rightarrow\) Log-log plots help us see whether very large firms are rare exceptions or part of a systematic tail pattern. \(\Rightarrow\) The slope tells us how heavy the tail is.
Aggregate exports, sales, employment, or wealth are sums of micro units:
\[Y = \sum_{i=1}^{N} y_i\]
The macro weight of unit \(i\) is:
\[s_i = \frac{y_i}{Y}\]
Large microdata let us study the full distribution of \(s_i\).
In representative-agent thinking, no single unit matters:
\[\max_i s_i \to 0\]
In granular economies, some units remain large:
\[\max_i s_i \not\approx 0\]
\(\Rightarrow\) Aggregates depend on the distribution of micro units, not only on their average.
Matched employer-employee data link workers to firms over time — millions of records per country
Tax and census records at scale:
Scanner / price data — item-level prices scraped or scanned daily
Financial transaction data — credit/debit card spending, real-time
MEMORY MANAGEMENT
Data processed in R has to be fully loaded into the RAM
How much free RAM do I have?
systeminfo | find "Available Physical Memory"topfree -hobject.size(my_data)
chr_vect <- c("12", "11", "33")
dbl_vect <- c(12, 11, 33)
format(object.size(chr_vect), units = "auto")
# [1] "248 bytes"
format(object.size(dbl_vect), units = "auto")
# [1] "80 bytes"
memory.profile()
# NULL symbol pairlist closure environment promise
# 1 37875 1250989 24936 7117 34670
# language special builtin char logical integer
# 338840 45 685 122872 36574 196015
# double complex character ... any list
# 18644 40 353876 11 0 91695
# expression bytecode externalptr weakref raw S4
# 1 76205 7455 2137 2187 3573Encode variables efficiently (e.g., factor instead of character):
gender <- c("female", "male", "other")
format(object.size(gender), units = "auto")
# [1] "272 bytes"
format(object.size(as.factor(gender)), units = "auto")
# [1] "672 bytes"
gender <- rep(c("female", "male", "other"), 100)
format(object.size(gender), units = "auto")
# [1] "2.6 Kb"
format(object.size(as.factor(gender)), units = "auto")
# [1] "1.8 Kb"A few other memory-management tips in R:
dataframe$redundant <- NULLrm()gc() in loops (automatic gc is enough most of the time)CHUNK AND PULL
French SIREN data: stock of all French firms (23M+ records since 1973)
Idea: data too large to load entirely → split by year in the shell, process each chunk in R, combine results
library(data.table)
# Step 2: function to process one chunk
process_chunk <- function(file) {
dt <- fread(file)
dt[, .(n_rows = .N, mean_val = mean(value, na.rm = TRUE),
file = basename(file))]
}
# Step 3: apply to all chunks and bind
my_files <- list.files("temp/yearly_data", full.names = TRUE)
results <- rbindlist(lapply(my_files, process_chunk))tidyverse
filter: Filter (i.e. subset) rows based on their valuesarrange: Arrange (i.e. reorder) rows based on their valuesselect: Select (i.e. subset) columns by their namesmutate: Create new columnsstarwars |>
filter(
species == "Human",
height >= 190
)
# A tibble: 4 x 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Darth Va~ 202 136 none white yellow 41.9 male mascu~
# 2 Qui-Gon ~ 193 89 brown fair blue 92 male mascu~
# 3 Dooku 193 80 white fair brown 102 male mascu~
# 4 Bail Pre~ 191 NA black tan brown 67 male mascu~starwars |>
arrange(birth_year)
# A tibble: 87 x 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Wicket ~ 88 20 brown brown brown 8 male mascu~
# 2 IG-88 200 140 none metal red 15 none mascu~
# 3 Luke Sk~ 172 77 blond fair blue 19 male mascu~
# 4 Leia Or~ 150 49 brown light brown 19 fema~ femin~
# 5 Wedge A~ 170 77 brown fair hazel 21 male mascu~
# ...starwars |>
select(name:skin_color, species, -height)
# A tibble: 87 x 5
# name mass hair_color skin_color species
# <chr> <dbl> <chr> <chr> <chr>
# 1 Luke Skywalker 77 blond fair Human
# 2 C-3PO 75 NA gold Droid
# 3 R2-D2 32 NA white, blue Droid
# 4 Darth Vader 136 none white Human
# 5 Leia Organa 49 brown light Human
# ...starwars |>
select(name, birth_year) |>
mutate(dog_years = birth_year * 7) |>
mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
# A tibble: 87 x 4
# name birth_year dog_years comment
# <chr> <dbl> <dbl> <chr>
# 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years.
# 2 C-3PO 112 784 C-3PO is 784 in dog years.
# 3 R2-D2 33 231 R2-D2 is 231 in dog years.
# 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years.
# ...starwars |>
group_by(species, gender) |>
summarise(mean_height = mean(height, na.rm = TRUE))
# `summarise()` has grouped output by 'species'.
# A tibble: 42 x 3
# Groups: species [38]
# species gender mean_height
# <chr> <chr> <dbl>
# 1 Aleena masculine 79
# 2 Besalisk masculine 198
# 3 Cerean masculine 198
# 4 Chagrian masculine 196
# 5 Clawdite feminine 168
# ...data.table
tidyversedata.tablecollapse_dplyr <- function() {
storms |>
group_by(name, year, month, day) |>
summarize(wind = mean(wind),
pressure = mean(pressure),
category = dplyr::first(category))
}
storms_dt <- as.data.table(storms)
collapse_dt <- function() {
storms_dt[, .(wind = mean(wind),
pressure = mean(pressure),
category = first(category)),
by = .(name, year, month, day)]
}
microbenchmark(collapse_dplyr(), collapse_dt(), times = 1)
# Unit: milliseconds
# expr min median max neval
# collapse_dplyr() 67.061792 67.061792 67.061792 1
# collapse_dt() 1.943042 1.943042 1.943042 1DT[i, j, by]filter(), slice(), arrange()select(), mutate()group_by()data.table uses := to add or modify columns by reference.
This means:
changes DT directly.
Add one new column:
.() is just a data.table shortcut for list().SD.SD means Subset of Data.
In a grouped data.table operation, .SD is the subset of the data for the current group, excluding the grouping variable.
Here, the groups are defined by species.
The same operation can be written more compactly with .SD:
Here:
.SD = the selected columns inside each species group.SDcols = the columns to include in .SDlapply(.SD, mean, na.rm = TRUE) = apply mean() to each selected column| tidyverse | data.table | base |
|---|---|---|
readr::read_csv |
data.table::fread |
utils::read.csv |
tibble::tibble |
data.table::data.table |
base::data.frame |
dplyr::if_else |
– | base::ifelse |
\(\rightarrow\) mix and win.
Example: aggregate millions of rows quickly with data.table, then plot the smaller result with ggplot2.
MODERN WORKFLOWS: PARQUET AND DUCKPLYR
CSV files are the default format for tabular data, but they have limitations at scale:
Apache Parquet is a columnar storage format that solves all of these problems.
library(arrow)
data(storms, package = "dplyr")
# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")
# Compare file sizes
cat("CSV: ", round(file.size("storms.csv") / 1e6, 2), "MB\n")
cat("Parquet:", round(file.size("storms.parquet") / 1e6, 2), "MB\n")
# Read it back
df <- read_parquet("storms.parquet")
# Read only specific columns (much faster than CSV)
df_small <- read_parquet("storms.parquet",
col_select = c("name", "year", "wind", "pressure"))Key advantages:
col_types headaches)duckplyr uses DuckDB (database engine designed for data analysis) as a backend for dplyr – same syntax, dramatically faster on large data:
library(arrow)
library(duckplyr)
# 1. Convert CSV to Parquet (once)
data(storms, package = "dplyr")
write_parquet(storms, "storms.parquet")
# 2. Work with the Parquet file directly
df <- read_parquet("storms.parquet")
# 3. Analyze with dplyr syntax, DuckDB speed
result <- df |>
filter(wind > 50) |>
group_by(status, year) |>
summarise(
mean_wind = mean(wind),
n = n(),
.groups = "drop"
) |>
collect()\(\rightarrow\) Same tidyverse code you already know, but scales to billions of rows.
DuckDB supports SQL natively — useful if you already know SQL or need complex queries:
library(DBI)
library(duckdb)
con <- dbConnect(duckdb())
# Query a Parquet file directly with SQL --- no loading step
dbGetQuery(con, "
SELECT status, year,
AVG(wind) AS mean_wind,
COUNT(*) AS n_obs
FROM read_parquet('storms.parquet')
WHERE year >= 2010
GROUP BY status, year
ORDER BY mean_wind DESC
")
dbDisconnect(con, shutdown = TRUE)library(arrow); library(data.table)
data(storms, package = "dplyr")
# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")
# Reading: CSV vs Parquet
system.time(read.csv("storms.csv")) # base R CSV
system.time(data.table::fread("storms.csv")) # fread
system.time(read_parquet("storms.parquet")) # Parquet
# Summarising: dplyr vs data.table
storms_dt <- as.data.table(storms)
system.time(storms |> dplyr::group_by(name, year) |>
dplyr::summarise(w = mean(wind), .groups = "drop"))
system.time(storms_dt[, .(w = mean(wind)), by = .(name, year)])Rule of thumb: Start with duckplyr + Parquet. Reach for data.table when you need maximum speed or in-place modification.
Wrap up