03. Large Structured Data

Data Science for Economists

Julian Hinz & Irene Iodice

2026-05-01

Overview

Large structured datasets are everywhere in economics:

International trade flows — long panels: exporter × importer × product × year
Firm-level administrative records — wide registers: many firms, many variables
Matched employer-employee data — relational panels: workers × firms × time
Tax records and scanner prices — long and wide: population scale, high frequency, many variables

Question: In what dimension are these datasets “large”?

Long: many rows
Wide: many columns
Relational: many linked tables
Temporal: many repeated periods

Roadmap

Examples of large data in economics
Memory management in R
tidyverse and data.table
Modern workflows: Parquet and duckplyr

LARGE DATA IN ECONOMICS

Why large datasets matter

Key economic questions require disaggregated data: inequality, misallocation, welfare effects of trade, firm dynamics
Aggregate data hides heterogeneity — the action is in the tails and cross-sections
Datasets keep growing: customs records, credit registries, scanner prices, web-scraped data

Challenge: these datasets are too large for naive workflows (Excel, read.csv into RAM)

$\Rightarrow$ You need the right tools and habits

International trade data

The workhorse: BACI (CEPII) — harmonised bilateral trade flows

~12 million rows per year (reporter × partner × product at HS6)
Coverage: 200+ countries, 5,000+ products, 1995–present
Source: UN COMTRADE, cleaned and reconciled by CEPII

Other trade datasets at scale:

WIOD / ICIO: inter-country input-output tables
AIS / shipping data: vessel-level tracking (billions of pings)

Firms in trade

“Countries don’t trade. Firms trade.”

— Hallak and Levinsohn, 2005

Firm-level customs data (available for many countries since the 1990s) revealed:

Exporting is rare — only a small share of firms export at all
Exporters are bigger, more productive, and pay higher wages
Even among exporters, a few firms dominate total export value

Melitz (2003): Fixed costs of exporting → only the most productive firms select into exporting. Trade liberalisation raises average productivity through firm selection.

How frequent is exporting?

Firm-level trade data: what it looks like

Colombian firm export values — massively skewed, heavy-tailed. A handful of firms account for most of total exports.

Does this ring a bell? Have you come across Zipf’s law before — city sizes, word frequencies in a dictionary?

Why log-log?

On the arithmetic scale, the top firms dominate the picture.

Suppose firm exports have a Pareto-like upper tail:

\[P(Y > y) = C y^{-\alpha}\]

Taking logs:

\[\log P(Y > y) = \log C - \alpha \log y\]

So on a log-log plot:

the upper tail appears approximately as a straight line
the slope is ($-\alpha$)
smaller ($\alpha$) means a heavier tail

$\Rightarrow$ Log-log plots help us see whether very large firms are rare exceptions or part of a systematic tail pattern. $\Rightarrow$ The slope tells us how heavy the tail is.

The macroeconomy is in the tail

Aggregate exports, sales, employment, or wealth are sums of micro units:

\[Y = \sum_{i=1}^{N} y_i\]

The macro weight of unit $i$ is:

\[s_i = \frac{y_i}{Y}\]

Large microdata let us study the full distribution of $s_i$.

Granularity

In representative-agent thinking, no single unit matters:

\[\max_i s_i \to 0\]

In granular economies, some units remain large:

\[\max_i s_i \not\approx 0\]

$\Rightarrow$ Aggregates depend on the distribution of micro units, not only on their average.

Administrative and matched data

Matched employer-employee data link workers to firms over time — millions of records per country

Abowd, Kramarz & Margolis (1999): decompose wages into worker and firm effects (French DADS data)
Card, Heining & Kline (2013): rising wage inequality in Germany driven by firm heterogeneity (IAB data, 25M+ worker-year obs.)

Tax and census records at scale:

Chetty et al. (2014): intergenerational mobility across US commuting zones using IRS tax records (billions of observations)
IPUMS: harmonised census microdata across countries and decades

Scanner and transaction data

Scanner / price data — item-level prices scraped or scanned daily

Cavallo & Rigobon (2016): Billion Prices Project — web-scraped online prices for many stores and countries
Nielsen / IRI scanner data: weekly item-level sales for US retail (used in IO, trade, macro)

Financial transaction data — credit/debit card spending, real-time

Chetty et al. (2024): Opportunity Insights Economic Tracker — credit card, payroll, and job-posting data to measure COVID-19 impacts in real time

MEMORY MANAGEMENT

How large can be large?

Data processed in R has to be fully loaded into the RAM

Max size of data that you can process depends on the amount of free RAM available
- Rule of thumb: free RAM = 2–3x size of data

How much free RAM do I have?

Windows (PowerShell/CMD): systeminfo | find "Available Physical Memory"
Mac: top
Linux: free -h

Object size

object.size(my_data)

chr_vect <- c("12", "11", "33")
dbl_vect <- c(12, 11, 33)
format(object.size(chr_vect), units = "auto")
# [1] "248 bytes"
format(object.size(dbl_vect), units = "auto")
# [1] "80 bytes"
memory.profile()
#        NULL      symbol    pairlist     closure environment     promise
#           1       37875     1250989       24936        7117       34670
#    language     special     builtin        char     logical     integer
#      338840          45         685      122872       36574      196015
#      double     complex   character         ...         any        list
#       18644          40      353876          11           0       91695
#  expression    bytecode externalptr     weakref         raw          S4
#           1       76205        7455        2137        2187        3573

Factors vs Characters

Encode variables efficiently (e.g., factor instead of character):

gender <- c("female", "male", "other")
format(object.size(gender), units = "auto")
# [1] "272 bytes"
format(object.size(as.factor(gender)), units = "auto")
# [1] "672 bytes"

gender <- rep(c("female", "male", "other"), 100)
format(object.size(gender), units = "auto")
# [1] "2.6 Kb"
format(object.size(as.factor(gender)), units = "auto")
# [1] "1.8 Kb"

Memory management tips

A few other memory-management tips in R:

Sessions continue – and memory is occupied – until you log out.
Manage sessions efficiently by tidying the R session workspace:
- Load only the data you need
- Remove redundant dataframe columns: dataframe$redundant <- NULL
- Remove data objects from the workspace once you don’t need them: rm()
- Force Garbage Collection gc() in loops (automatic gc is enough most of the time)

CHUNK AND PULL

Chunk and Pull

Chunk and Pull – Example

French SIREN data: stock of all French firms (23M+ records since 1973)

Idea: data too large to load entirely → split by year in the shell, process each chunk in R, combine results

Chunk and Pull in 3 Steps

Chunk (shell): split large file into yearly CSVs using AWK or similar
Process (R): write a function that reads one chunk and computes what you need
Pull (R): apply the function to all chunks and bind the results

library(data.table)

# Step 2: function to process one chunk
process_chunk <- function(file) {
  dt <- fread(file)
  dt[, .(n_rows = .N, mean_val = mean(value, na.rm = TRUE),
         file = basename(file))]
}

# Step 3: apply to all chunks and bind
my_files <- list.files("temp/yearly_data", full.names = TRUE)
results <- rbindlist(lapply(my_files, process_chunk))

tidyverse

Tidy data

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

tidyverse

library(tidyverse)
# -- Attaching core tidyverse packages --- tidyverse 2.0.0 --
# dplyr     1.1.4     readr     2.1.5
# forcats   1.0.0     stringr   1.5.1
# ggplot2   3.5.1     tibble    3.2.1
# lubridate 1.9.4     tidyr     1.3.1
# purrr     1.0.2

Pipes!

mpg |>
  filter(manufacturer == "audi") |>
  group_by(model) |>
  summarise(hwy_mean = mean(hwy))
# # A tibble: 3 x 2
#   model      hwy_mean
#   <chr>         <dbl>
# 1 a4             28.3
# 2 a4 quattro     25.8
# 3 a6 quattro     24

dplyr verbs

filter: Filter (i.e. subset) rows based on their values
arrange: Arrange (i.e. reorder) rows based on their values
select: Select (i.e. subset) columns by their names
mutate: Create new columns

dplyr::filter

starwars |>
  filter(
    species == "Human",
    height >= 190
  )
# A tibble: 4 x 14
#   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
# 1 Darth Va~    202   136 none       white      yellow          41.9 male  mascu~
# 2 Qui-Gon ~    193    89 brown      fair       blue            92   male  mascu~
# 3 Dooku        193    80 white      fair       brown          102   male  mascu~
# 4 Bail Pre~    191    NA black      tan        brown           67   male  mascu~

dplyr::arrange

starwars |>
  arrange(birth_year)
# A tibble: 87 x 14
#   name     height  mass hair_color skin_color eye_color birth_year sex   gender
#   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
#  1 Wicket ~     88  20   brown      brown      brown            8   male  mascu~
#  2 IG-88       200 140   none       metal      red             15   none  mascu~
#  3 Luke Sk~    172  77   blond      fair       blue            19   male  mascu~
#  4 Leia Or~    150  49   brown      light      brown           19   fema~ femin~
#  5 Wedge A~    170  77   brown      fair       hazel           21   male  mascu~
#  ...

dplyr::select

starwars |>
  select(name:skin_color, species, -height)
# A tibble: 87 x 5
#   name                mass hair_color    skin_color  species
#   <chr>              <dbl> <chr>         <chr>       <chr>
#  1 Luke Skywalker        77 blond         fair        Human
#  2 C-3PO                 75 NA            gold        Droid
#  3 R2-D2                 32 NA            white, blue Droid
#  4 Darth Vader          136 none          white       Human
#  5 Leia Organa           49 brown         light       Human
#  ...

dplyr::mutate

starwars |>
  select(name, birth_year) |>
  mutate(dog_years = birth_year * 7) |>
  mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
# A tibble: 87 x 4
#   name               birth_year dog_years comment
#   <chr>                   <dbl>     <dbl> <chr>
#  1 Luke Skywalker           19        133  Luke Skywalker is 133 in dog years.
#  2 C-3PO                   112        784  C-3PO is 784 in dog years.
#  3 R2-D2                    33        231  R2-D2 is 231 in dog years.
#  4 Darth Vader              41.9      293. Darth Vader is 293.3 in dog years.
#  ...

dplyr::summarise

starwars |>
  group_by(species, gender) |>
  summarise(mean_height = mean(height, na.rm = TRUE))
# `summarise()` has grouped output by 'species'.
# A tibble: 42 x 3
# Groups:   species [38]
#   species   gender    mean_height
#   <chr>     <chr>           <dbl>
#  1 Aleena    masculine          79
#  2 Besalisk  masculine         198
#  3 Cerean    masculine         198
#  4 Chagrian  masculine         196
#  5 Clawdite  feminine          168
#  ...

data.table

Concise
Insanely fast
Memory efficient
Feature rich (and stable)
Dependency free

Concise

tidyverse

data(starwars, package = "dplyr")
starwars |>
  filter(species == "Human") |>
  group_by(homeworld) |>
  summarise(mean_height = mean(height))

data.table

setDT(starwars)
starwars_dt[species == "Human", mean(height), by = homeworld]

Fast

collapse_dplyr <- function() {
  storms |>
    group_by(name, year, month, day) |>
    summarize(wind = mean(wind),
              pressure = mean(pressure),
              category = dplyr::first(category))
}
storms_dt <- as.data.table(storms)
collapse_dt <- function() {
  storms_dt[, .(wind = mean(wind),
                pressure = mean(pressure),
                category = first(category)),
            by = .(name, year, month, day)]
}
microbenchmark(collapse_dplyr(), collapse_dt(), times = 1)
# Unit: milliseconds
#              expr       min    median       max neval
#  collapse_dplyr() 67.061792 67.061792 67.061792     1
#     collapse_dt()  1.943042  1.943042  1.943042     1

Syntax: `DT[i, j, by]`

i: On which rows?
- tidyverse: filter(), slice(), arrange()
j: What to do?
- tidyverse: select(), mutate()
by: Grouped by what?
- tidyverse: group_by()

Subsetting

Subset to rows where variable x equals “string”

DT[x == "string", ]

Subset to rows where variable y is greater than 5

DT[y > 5, ]

Subset to the first 10 rows

DT[1:10, ]

Ordering

Sort by youngest to oldest

starwars_dt[order(birth_year)]

Sort by youngest to oldest, by reference

setorder(starwars_dt, birth_year, na.last = TRUE)

Modifying columns: by reference

data.table uses := to add or modify columns by reference.

This means:

DT[, total := x + y]

changes DT directly.

You do not need:

DT <- DT[, total := x + y]

library(data.table)

DT <- data.table(
  firm = c("A", "B", "C"),
  x = c(10, 20, 30),
  y = c(1, 2, 3)
)

DT

Modifying columns: examples

Add one new column:

DT[, total := x + y]
DT

Add several columns at once:

DT[, c("double_x", "share_y") := .(
  2 * x,
  y / sum(y)
)]
DT

Use a newly created column:

DT[, total_sq := total^2]
DT

Chaining:

DT[, total := x + y][
   , total_sq := total^2][]

The final [] prints the updated table.

Subsetting columns

Subset by column position

starwars_dt[1:2, c(1:3, 10)]

Subset by name

starwars_dt[1:2, .(name, height, mass, homeworld)]

.() is just a data.table shortcut for list()

by

Collapse by variable

starwars_dt[, .(species_height = mean(height, na.rm = TRUE)), by = species]

…and by reference

starwars_dt[, species_n := .N, by = species][]

Efficient subsetting with `.SD`

.SD means Subset of Data.

In a grouped data.table operation, .SD is the subset of the data for the current group, excluding the grouping variable.

Here, the groups are defined by species.

starwars_dt[,
            .(mean_height = mean(height, na.rm = TRUE),
              mean_mass = mean(mass, na.rm = TRUE),
              mean_birth_year = mean(birth_year, na.rm = TRUE)),
            by = species]

The same operation can be written more compactly with .SD:

starwars_dt[,
            lapply(.SD, mean, na.rm = TRUE),
            .SDcols = c("height", "mass", "birth_year"),
            by = species]

Here:

.SD = the selected columns inside each species group
.SDcols = the columns to include in .SD
lapply(.SD, mean, na.rm = TRUE) = apply mean() to each selected column

tidyverse vs. data.table vs. base R

tidyverse	data.table	base
`readr::read_csv`	`data.table::fread`	`utils::read.csv`
`tibble::tibble`	`data.table::data.table`	`base::data.frame`
`dplyr::if_else`	–	`base::ifelse`

$\rightarrow$ mix and win.

Example: aggregate millions of rows quickly with data.table, then plot the smaller result with ggplot2.

Combine

storms_dt[, .(wind = mean(wind),
              pressure = mean(pressure),
              category = first(category)),
          by = .(name, year, month, day)] |>
  ggplot(aes(x = pressure, y = wind, col = category)) +
  geom_point(alpha = 0.3) +
  theme_minimal()

MODERN WORKFLOWS: PARQUET AND DUCKPLYR

The problem with CSV

CSV files are the default format for tabular data, but they have limitations at scale:

Slow to read/write – text parsing is expensive
No type information – everything is a string on disk
No compression – files are much larger than necessary
Full scan required – cannot read a subset of columns efficiently

Apache Parquet is a columnar storage format that solves all of these problems.

Writing and reading Parquet

library(arrow)
data(storms, package = "dplyr")

# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")

# Compare file sizes
cat("CSV:    ", round(file.size("storms.csv") / 1e6, 2), "MB\n")
cat("Parquet:", round(file.size("storms.parquet") / 1e6, 2), "MB\n")

# Read it back
df <- read_parquet("storms.parquet")

# Read only specific columns (much faster than CSV)
df_small <- read_parquet("storms.parquet",
  col_select = c("name", "year", "wind", "pressure"))

Key advantages:

Typically 2–10x smaller than gzipped CSV
Columns are stored separately – reading a subset of columns is very fast
Types are preserved (no more col_types headaches)

duckplyr: a modern backend for dplyr

duckplyr uses DuckDB (database engine designed for data analysis) as a backend for dplyr – same syntax, dramatically faster on large data:

library(duckplyr)
library(arrow)

df <- read_parquet("storms.parquet")

df |>
  filter(year >= 2010) |>
  group_by(name, year) |>
  summarise(
    max_wind = max(wind, na.rm = TRUE),
    min_pressure = min(pressure, na.rm = TRUE),
    n_obs = n()
  ) |>
  arrange(desc(max_wind))

Queries are lazily evaluated and optimized before execution
Works directly on Parquet files without loading everything into RAM
Falls back to dplyr when DuckDB cannot handle an operation

A complete modern pipeline

library(arrow)
library(duckplyr)

# 1. Convert CSV to Parquet (once)
data(storms, package = "dplyr")
write_parquet(storms, "storms.parquet")

# 2. Work with the Parquet file directly
df <- read_parquet("storms.parquet")

# 3. Analyze with dplyr syntax, DuckDB speed
result <- df |>
  filter(wind > 50) |>
  group_by(status, year) |>
  summarise(
    mean_wind = mean(wind),
    n = n(),
    .groups = "drop"
  ) |>
  collect()

$\rightarrow$ Same tidyverse code you already know, but scales to billions of rows.

DuckDB also speaks SQL

DuckDB supports SQL natively — useful if you already know SQL or need complex queries:

library(DBI)
library(duckdb)

con <- dbConnect(duckdb())

# Query a Parquet file directly with SQL --- no loading step
dbGetQuery(con, "
  SELECT status, year,
         AVG(wind) AS mean_wind,
         COUNT(*) AS n_obs
  FROM read_parquet('storms.parquet')
  WHERE year >= 2010
  GROUP BY status, year
  ORDER BY mean_wind DESC
")

dbDisconnect(con, shutdown = TRUE)

Benchmarking: Why format and tool choice matter

library(arrow); library(data.table)
data(storms, package = "dplyr")

# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")

# Reading: CSV vs Parquet
system.time(read.csv("storms.csv"))              # base R CSV
system.time(data.table::fread("storms.csv"))     # fread
system.time(read_parquet("storms.parquet"))       # Parquet

# Summarising: dplyr vs data.table
storms_dt <- as.data.table(storms)
system.time(storms |> dplyr::group_by(name, year) |>
              dplyr::summarise(w = mean(wind), .groups = "drop"))
system.time(storms_dt[, .(w = mean(wind)), by = .(name, year)])

Rule of thumb: Start with duckplyr + Parquet. Reach for data.table when you need maximum speed or in-place modification.

Wrap up

Summary

Large structured data is central to modern empirical economics: trade, firms, admin records, prices
Memory management matters: know your RAM, encode efficiently, chunk and pull
tidyverse for readable, expressive data wrangling
data.table for speed and memory efficiency
Parquet + duckplyr for modern large-data workflows without leaving R

03. Large Structured Data

Overview

Roadmap

Why large datasets matter

International trade data

Firms in trade

How frequent is exporting?

Firm-level trade data: what it looks like

Why log-log?

The macroeconomy is in the tail

Granularity

Administrative and matched data

Scanner and transaction data

How large can be large?

Object size

Factors vs Characters

Memory management tips

Chunk and Pull

Chunk and Pull – Example

Chunk and Pull in 3 Steps

Tidy data

tidyverse

Pipes!

dplyr verbs

dplyr::filter

dplyr::arrange

dplyr::select

dplyr::mutate

dplyr::summarise

data.table

Concise

Fast

Syntax: DT[i, j, by]

Subsetting

Ordering

Modifying columns: by reference

Modifying columns: examples

Subsetting columns

by

Efficient subsetting with .SD

tidyverse vs. data.table vs. base R

Combine

The problem with CSV

Writing and reading Parquet

duckplyr: a modern backend for dplyr

A complete modern pipeline

DuckDB also speaks SQL

Benchmarking: Why format and tool choice matter

Summary

Further reading

Syntax: `DT[i, j, by]`

Efficient subsetting with `.SD`