03. Large Structured Data

Data Science for Economists

2026-03-01

Overview

Large structured datasets are everywhere in economics:

  • International trade flows (billions of transactions)
  • Firm-level administrative records (millions of firms)
  • Matched employer-employee data, tax records, scanner prices

This session:

  1. Examples of large data in economics
  2. Memory management in R
  3. tidyverse and data.table
  4. Modern workflows: Parquet and duckplyr

LARGE DATA IN ECONOMICS

Why large datasets matter

  • Key economic questions require disaggregated data: inequality, misallocation, welfare effects of trade, firm dynamics
  • Aggregate data hides heterogeneity — the action is in the tails and cross-sections
  • Datasets keep growing: customs records, credit registries, scanner prices, web-scraped data

Challenge: these datasets are too large for naive workflows (Excel, read.csv into RAM)

\(\Rightarrow\) You need the right tools and habits

International trade data

The workhorse: BACI (CEPII) — harmonised bilateral trade flows

  • ~12 million rows per year (reporter × partner × product at HS6)
  • Coverage: 200+ countries, 5,000+ products, 1995–present
  • Source: UN COMTRADE, cleaned and reconciled by CEPII

Other trade datasets at scale:

  • WIOD / ICIO: inter-country input-output tables
  • AIS / shipping data: vessel-level tracking (billions of pings)

Firms in trade

“Countries don’t trade. Firms trade.”

— Hallak and Levinsohn, 2005

Firm-level customs data (available for many countries since the 1990s) revealed:

  • Exporting is rare — only a small share of firms export at all
  • Exporters are bigger, more productive, and pay higher wages
  • Even among exporters, a few firms dominate total export value

Melitz (2003): Fixed costs of exporting → only the most productive firms select into exporting. Trade liberalisation raises average productivity through firm selection.

Firm-level trade data: what it looks like

Arithmetic scale

Log-log scale

Colombian firm export values — massively skewed, heavy-tailed. A handful of firms account for most of total exports.

How frequent is exporting?

Bernard et al. (2007)

Administrative and matched data

Matched employer-employee data link workers to firms over time — millions of records per country

  • Abowd, Kramarz & Margolis (1999): decompose wages into worker and firm effects (French DADS data)
  • Card, Heining & Kline (2013): rising wage inequality in Germany driven by firm heterogeneity (IAB data, 25M+ worker-year obs.)

Tax and census records at scale:

  • Chetty et al. (2014): intergenerational mobility across US commuting zones using IRS tax records (billions of observations)
  • IPUMS: harmonised census microdata across countries and decades

Scanner and transaction data

Scanner / price data — item-level prices scraped or scanned daily

  • Cavallo & Rigobon (2016): Billion Prices Project — web-scraped online prices for many stores and countries
  • Nielsen / IRI scanner data: weekly item-level sales for US retail (used in IO, trade, macro)

Financial transaction data — credit/debit card spending, real-time

  • Chetty et al. (2024): Opportunity Insights Economic Tracker — credit card, payroll, and job-posting data to measure COVID-19 impacts in real time

MEMORY MANAGEMENT

How large can be large?

Data processed in R has to be fully loaded into the RAM

  • Max size of data that you can process depends on the amount of free RAM available
    • Rule of thumb: free RAM = 2–3x size of data

How much free RAM do I have?

  • Windows (PowerShell/CMD): systeminfo | find "Available Physical Memory"
  • Mac: top
  • Linux: free -h

Object size

object.size(my_data)

chr_vect <- c("12", "11", "33")
dbl_vect <- c(12, 11, 33)
format(object.size(chr_vect), units = "auto")
# [1] "248 bytes"
format(object.size(dbl_vect), units = "auto")
# [1] "80 bytes"
memory.profile()
#        NULL      symbol    pairlist     closure environment     promise
#           1       37875     1250989       24936        7117       34670
#    language     special     builtin        char     logical     integer
#      338840          45         685      122872       36574      196015
#      double     complex   character         ...         any        list
#       18644          40      353876          11           0       91695
#  expression    bytecode externalptr     weakref         raw          S4
#           1       76205        7455        2137        2187        3573

Factors vs Characters

Encode variables efficiently (e.g., factor instead of character):

gender <- c("female", "male", "other")
format(object.size(gender), units = "auto")
# [1] "272 bytes"
format(object.size(as.factor(gender)), units = "auto")
# [1] "672 bytes"

gender <- rep(c("female", "male", "other"), 100)
format(object.size(gender), units = "auto")
# [1] "2.6 Kb"
format(object.size(as.factor(gender)), units = "auto")
# [1] "1.8 Kb"

Memory management tips

A few other memory-management tips in R:

  1. Sessions continue – and memory is occupied – until you log out.
  2. Manage sessions efficiently by tidying the R session workspace:
    • Load only the data you need
    • Remove redundant dataframe columns: dataframe$redundant <- NULL
    • Remove data objects from the workspace once you don’t need them: rm()
    • Force Garbage Collection gc() in loops (automatic gc is enough most of the time)

CHUNK AND PULL

Chunk and Pull

Chunk and Pull – Example

French SIREN data: stock of all French firms (23M+ records since 1973)

Idea: data too large to load entirely → split by year in the shell, process each chunk in R, combine results

Chunk and Pull in 3 Steps

  1. Chunk (shell): split large file into yearly CSVs using AWK or similar
  2. Process (R): write a function that reads one chunk and computes what you need
  3. Pull (R): apply the function to all chunks and bind the results
library(data.table)

# Step 2: function to process one chunk
process_chunk <- function(file) {
  dt <- fread(file)
  dt[, .(n_rows = .N, mean_val = mean(value, na.rm = TRUE),
         file = basename(file))]
}

# Step 3: apply to all chunks and bind
my_files <- list.files("temp/yearly_data", full.names = TRUE)
results <- rbindlist(lapply(my_files, process_chunk))

tidyverse

Tidy data

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

tidyverse

library(tidyverse)
# -- Attaching core tidyverse packages --- tidyverse 2.0.0 --
# dplyr     1.1.4     readr     2.1.5
# forcats   1.0.0     stringr   1.5.1
# ggplot2   3.5.1     tibble    3.2.1
# lubridate 1.9.4     tidyr     1.3.1
# purrr     1.0.2

Pipes!

mpg |>
  filter(manufacturer == "audi") |>
  group_by(model) |>
  summarise(hwy_mean = mean(hwy))
# # A tibble: 3 x 2
#   model      hwy_mean
#   <chr>         <dbl>
# 1 a4             28.3
# 2 a4 quattro     25.8
# 3 a6 quattro     24

dplyr verbs

  1. filter: Filter (i.e. subset) rows based on their values
  2. arrange: Arrange (i.e. reorder) rows based on their values
  3. select: Select (i.e. subset) columns by their names
  4. mutate: Create new columns

dplyr::filter

starwars |>
  filter(
    species == "Human",
    height >= 190
  )
# A tibble: 4 x 14
#   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
# 1 Darth Va~    202   136 none       white      yellow          41.9 male  mascu~
# 2 Qui-Gon ~    193    89 brown      fair       blue            92   male  mascu~
# 3 Dooku        193    80 white      fair       brown          102   male  mascu~
# 4 Bail Pre~    191    NA black      tan        brown           67   male  mascu~

dplyr::arrange

starwars |>
  arrange(birth_year)
# A tibble: 87 x 14
#   name     height  mass hair_color skin_color eye_color birth_year sex   gender
#   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
#  1 Wicket ~     88  20   brown      brown      brown            8   male  mascu~
#  2 IG-88       200 140   none       metal      red             15   none  mascu~
#  3 Luke Sk~    172  77   blond      fair       blue            19   male  mascu~
#  4 Leia Or~    150  49   brown      light      brown           19   fema~ femin~
#  5 Wedge A~    170  77   brown      fair       hazel           21   male  mascu~
#  ...

dplyr::select

starwars |>
  select(name:skin_color, species, -height)
# A tibble: 87 x 5
#   name                mass hair_color    skin_color  species
#   <chr>              <dbl> <chr>         <chr>       <chr>
#  1 Luke Skywalker        77 blond         fair        Human
#  2 C-3PO                 75 NA            gold        Droid
#  3 R2-D2                 32 NA            white, blue Droid
#  4 Darth Vader          136 none          white       Human
#  5 Leia Organa           49 brown         light       Human
#  ...

dplyr::mutate

starwars |>
  select(name, birth_year) |>
  mutate(dog_years = birth_year * 7) |>
  mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
# A tibble: 87 x 4
#   name               birth_year dog_years comment
#   <chr>                   <dbl>     <dbl> <chr>
#  1 Luke Skywalker           19        133  Luke Skywalker is 133 in dog years.
#  2 C-3PO                   112        784  C-3PO is 784 in dog years.
#  3 R2-D2                    33        231  R2-D2 is 231 in dog years.
#  4 Darth Vader              41.9      293. Darth Vader is 293.3 in dog years.
#  ...

dplyr::summarise

starwars |>
  group_by(species, gender) |>
  summarise(mean_height = mean(height, na.rm = TRUE))
# `summarise()` has grouped output by 'species'.
# A tibble: 42 x 3
# Groups:   species [38]
#   species   gender    mean_height
#   <chr>     <chr>           <dbl>
#  1 Aleena    masculine          79
#  2 Besalisk  masculine         198
#  3 Cerean    masculine         198
#  4 Chagrian  masculine         196
#  5 Clawdite  feminine          168
#  ...

data.table

data.table

  • Concise
  • Insanely fast
  • Memory efficient
  • Feature rich (and stable)
  • Dependency free

Concise

  • tidyverse
data(starwars, package = "dplyr")
starwars |>
  filter(species == "Human") |>
  group_by(homeworld) |>
  summarise(mean_height = mean(height))
  • data.table
setDT(starwars)
starwars_dt[species == "Human", mean(height), by = homeworld]

Fast

collapse_dplyr <- function() {
  storms |>
    group_by(name, year, month, day) |>
    summarize(wind = mean(wind),
              pressure = mean(pressure),
              category = dplyr::first(category))
}
storms_dt <- as.data.table(storms)
collapse_dt <- function() {
  storms_dt[, .(wind = mean(wind),
                pressure = mean(pressure),
                category = first(category)),
            by = .(name, year, month, day)]
}
microbenchmark(collapse_dplyr(), collapse_dt(), times = 1)
# Unit: milliseconds
#              expr       min    median       max neval
#  collapse_dplyr() 67.061792 67.061792 67.061792     1
#     collapse_dt()  1.943042  1.943042  1.943042     1

Syntax: DT[i, j, by]

  • i: On which rows?
    • tidyverse: filter(), slice(), arrange()
  • j: What to do?
    • tidyverse: select(), mutate()
  • by: Grouped by what?
    • tidyverse: group_by()

Subsetting

  • Subset to rows where variable x equals “string”
DT[x == "string", ]
  • Subset to rows where variable y is greater than 5
DT[y > 5, ]
  • Subset to the first 10 rows
DT[1:10, ]

Ordering

  • Sort by youngest to oldest
starwars_dt[order(birth_year)]
  • Sort by youngest to oldest, by reference
setorder(starwars_dt, birth_year, na.last = TRUE)

Modifying columns

  • LHS := RHS form
DT[, c("var1", "var2") := .(val1, val2)]
  • Functional form
DT[, ':=' (var1 = val1, var2 = val2)]
  • Chaining
DT[, z := 5:6][, z_sq := z^2][]

DT |>
  _[, xyz := x + y + z] |>
  _[, xyz_sq := xyz^2] |>
  _[]

Subsetting columns

  • Subset by column position
starwars_dt[1:2, c(1:3, 10)]
  • Subset by name
starwars_dt[1:2, .(name, height, mass, homeworld)]
  • .() is just a data.table shortcut for list()

by

  • Collapse by variable
starwars_dt[, .(species_height = mean(height, na.rm = TRUE)), by = species]
  • …and by reference
starwars_dt[, species_n := .N, by = species][]

Efficient subsetting with .SD

starwars_dt[,
            .(mean(height, na.rm = TRUE),
              mean(mass, na.rm = TRUE),
              mean(birth_year, na.rm = TRUE)),
            by = species]

starwars_dt[,
            lapply(.SD, mean, na.rm = TRUE),
            .SDcols = c("height", "mass", "birth_year"),
            by = species]

tidyverse vs. data.table vs. base R

tidyverse data.table base
readr::read_csv data.table::fread utils::read.csv
tibble::tibble data.table::data.table base::data.frame
dplyr::if_else base::ifelse

\(\rightarrow\) mix and win.

Combine

storms_dt[, .(wind = mean(wind),
              pressure = mean(pressure),
              category = first(category)),
          by = .(name, year, month, day)] |>
  ggplot(aes(x = pressure, y = wind, col = category)) +
  geom_point(alpha = 0.3) +
  theme_minimal()

MODERN WORKFLOWS: PARQUET AND DUCKPLYR

The problem with CSV

CSV files are the default format for tabular data, but they have limitations at scale:

  • Slow to read/write – text parsing is expensive
  • No type information – everything is a string on disk
  • No compression – files are much larger than necessary
  • Full scan required – cannot read a subset of columns efficiently

Apache Parquet is a columnar storage format that solves all of these problems.

Writing and reading Parquet

library(arrow)
data(storms, package = "dplyr")

# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")

# Compare file sizes
cat("CSV:    ", round(file.size("storms.csv") / 1e6, 2), "MB\n")
cat("Parquet:", round(file.size("storms.parquet") / 1e6, 2), "MB\n")

# Read it back
df <- read_parquet("storms.parquet")

# Read only specific columns (much faster than CSV)
df_small <- read_parquet("storms.parquet",
  col_select = c("name", "year", "wind", "pressure"))

Key advantages:

  • Typically 2–10x smaller than gzipped CSV
  • Columns are stored separately – reading a subset of columns is very fast
  • Types are preserved (no more col_types headaches)

duckplyr: a modern backend for dplyr

duckplyr uses DuckDB as a backend for dplyr – same syntax, dramatically faster on large data:

library(duckplyr)
library(arrow)

df <- read_parquet("storms.parquet")

df |>
  filter(year >= 2010) |>
  group_by(name, year) |>
  summarise(
    max_wind = max(wind, na.rm = TRUE),
    min_pressure = min(pressure, na.rm = TRUE),
    n_obs = n()
  ) |>
  arrange(desc(max_wind))
  • Queries are lazily evaluated and optimized before execution
  • Works directly on Parquet files without loading everything into RAM
  • Falls back to dplyr when DuckDB cannot handle an operation

A complete modern pipeline

library(arrow)
library(duckplyr)

# 1. Convert CSV to Parquet (once)
data(storms, package = "dplyr")
write_parquet(storms, "storms.parquet")

# 2. Work with the Parquet file directly
df <- read_parquet("storms.parquet")

# 3. Analyze with dplyr syntax, DuckDB speed
result <- df |>
  filter(wind > 50) |>
  group_by(status, year) |>
  summarise(
    mean_wind = mean(wind),
    n = n(),
    .groups = "drop"
  ) |>
  collect()

\(\rightarrow\) Same tidyverse code you already know, but scales to billions of rows.

DuckDB also speaks SQL

DuckDB supports SQL natively — useful if you already know SQL or need complex queries:

library(DBI)
library(duckdb)

con <- dbConnect(duckdb())

# Query a Parquet file directly with SQL --- no loading step
dbGetQuery(con, "
  SELECT status, year,
         AVG(wind) AS mean_wind,
         COUNT(*) AS n_obs
  FROM read_parquet('storms.parquet')
  WHERE year >= 2010
  GROUP BY status, year
  ORDER BY mean_wind DESC
")

dbDisconnect(con, shutdown = TRUE)

Benchmarking: Why format and tool choice matter

library(arrow); library(data.table)
data(storms, package = "dplyr")

# Write both formats
write.csv(storms, "storms.csv", row.names = FALSE)
write_parquet(storms, "storms.parquet")

# Reading: CSV vs Parquet
system.time(read.csv("storms.csv"))              # base R CSV
system.time(data.table::fread("storms.csv"))     # fread
system.time(read_parquet("storms.parquet"))       # Parquet

# Summarising: dplyr vs data.table
storms_dt <- as.data.table(storms)
system.time(storms |> dplyr::group_by(name, year) |>
              dplyr::summarise(w = mean(wind), .groups = "drop"))
system.time(storms_dt[, .(w = mean(wind)), by = .(name, year)])

Rule of thumb: Start with duckplyr + Parquet. Reach for data.table when you need maximum speed or in-place modification.

Wrap up

Summary

  • Large structured data is central to modern empirical economics: trade, firms, admin records, prices
  • Memory management matters: know your RAM, encode efficiently, chunk and pull
  • tidyverse for readable, expressive data wrangling
  • data.table for speed and memory efficiency
  • Parquet + duckplyr for modern large-data workflows without leaving R

Further reading

  • Bernard, Jensen, Redding & Schott (2007), “Firms in International Trade”, JEP
  • Melitz (2003), “The Impact of Trade on Intra-Industry Reallocations”, Econometrica
  • Card, Heining & Kline (2013), “Workplace Heterogeneity and the Rise of West German Wage Inequality”, QJE
  • Chetty, Hendren, Kline & Saez (2014), “Where is the Land of Opportunity?”, QJE
  • NCEAS (2025), Parquet + DuckDB + Arrow tutorial: learning.nceas.ucsb.edu
  • DuckDB docs on Parquet tips: duckdb.org/docs/stable/data/parquet/tips