03. Large Structured Data

Data Science for Economists

2026-03-01

Overview of topics

Trade data: Large data

  • Firm level data
  • Fitting power-law distributions

Technical questions:

  • Memory management with R
  • Good practices with large data
  • tidyverse and data.table
  • Modern workflows: Parquet and duckplyr

FIRM TRADE DATA

“Countries don’t trade. Firms trade.”

— Hallak and Levinsohn, 2005

Traditional trade theories

  1. H-O model: countries trade because of different factor endowments
  2. Ricardian model: countries trade because of different technologies

Data availability from late 1970s and early 1980s provide evidence of unexplained facts:

  1. Similar countries trade extensively
  1. Intra-industry trade is prominent:
    • Japan exports Toyota vehicles to Germany and imports Mercedes-Benz automobiles from Germany

Intra-industry Trade

OECD 2002

Firms in the New Trade Theories

Toyota and Mercedes-Benz offer different varieties of the same good

Krugman (1979 and 1980) introduces:

  • Monopolistic Competition: firms produce differentiated products, this differentiation allows for a love of variety by consumers.
  • Economies of Scale: Production under economies of scale permits firms to produce a wide variety of goods more cost effectively, leading to an increase in international trade.

Firms in the New Trade Theories

Implications

  • The love of variety leads to increased trade volumes, with countries importing many different types of goods rather than producing them domestically.
  • All firms participate in exporting

How frequent is exporting?

Bernard et al. (2007)

Stylized facts on exporters

Increasing availability since the 90s of firm/plant level data showed:

  • Exporting is extremely rare.
  • Exporters are different than non-exporters:
    • They are larger.
    • They are more productive.
    • They use factors differently.
    • They pay higher wages.
  • Even among exporters a large heterogeneity persists…

Melitz Model: Key Intuition

Melitz (2003): Firms differ in productivity, and this heterogeneity drives trade patterns

  • Selection: Fixed costs of exporting → only the most productive firms export
  • Trade liberalization → more competition → least productive firms exit → average productivity rises
  • Explains why exporters are bigger, more productive, and pay higher wages

Trade theories vs stylized facts

Facts Old trade (Ricardo, H-O) New trade (Krugman) Hetero. firms (Melitz)
Intra-industry trade No Yes Yes
Exporters ≠ non-exporters No No Yes
Exporters more productive No No Yes

Firm heterogeneity and the Pareto distribution

  • Firm-level productivity heterogeneity is well approximated by the Pareto distribution
  • Analytically tractable and empirically validated
  • Is this a good approximation of what we observe in the real world?

Colombian firms’ export value

Arithmetic scale

Log-log scale

A matter of scales

  1. Arithmetic scale: histogram is highly right-skewed
    • The bulk of the distribution occurs for fairly small size (in terms of export value) but there is a small number of firms with a much higher value than the typical value (origin of the long tail)
  2. Log-log scale: if we replot the same histogram with logarithmic horizontal and vertical axes, the histogram follows quite closely a straight line.

What does it mean?

POWER LAWS AND PARETO DISTRIBUTIONS

Power Laws

Let us define \(p(x)dx\) as the fraction of firms with export value between \(x\) and \(x + dx\). Then observing a straight line in a log-log scale means

\[\log p(x) = c - (\alpha + 1) \log x\]

where \(c\) and \(-(\alpha + 1)\) represent the intercept and the slope of the line. If we take exponential on both sides we get

\[p(x) = C x^{-(\alpha+1)}\]

Probability distributions with this functional form are said to follow a power law and \(-(\alpha + 1)\) is the exponent of the power law.

Understanding Pareto Distribution Parameters

  • Scale Parameter (\(x_m\)):
    • Minimum possible value or “location” parameter.
    • Distribution begins at \(x_m\) and extends to infinity.
    • Must be a positive real number (\(x_m > 0\)).
  • Shape Parameter (\(\alpha\)):
    • Also known as the Pareto Index or “shape” parameter.
    • Determines the shape of the distribution curve, particularly the “tail”.
    • Must be a positive real number (\(\alpha > 0\)).

Estimating the power law exponent

Two practical methods (+ MLE for the technically minded):

  1. Binning: histogram on log-log axes → straight line → slope gives \(-(\alpha+1)\)
  2. CCDF: complementary CDF on log-log axes → slope gives \(-\alpha\) (less sensitive to bin choice)
  3. MLE: maximum likelihood — most rigorous but details in Newman (2005)

Remark: Estimation typically applied above a minimum threshold \(x_m\).

Full vs top sample

Full Sample

Only top 50%

CCDF: Visual Test for Power Laws

The CCDF \(P(x)\) = fraction of firms with export value \(\geq x\). On log-log axes, a power law appears as a straight line:

\[\log P(x_i) = c - \alpha \log(x_i) + \epsilon_i\]

References

Newman, M.E.J., 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46, pp. 323-351.

MEMORY MANAGEMENT

How large can be large?

Data processed in R has to be fully loaded into the RAM

  • Max size of data that you can process depends on the amount of free RAM available
    • Rule of thumb: free RAM = 2–3x size of data

How much free RAM do I have?

  • Windows (PowerShell/CMD): systeminfo | find "Available Physical Memory"
  • Mac: top
  • Linux: free -h

Object size

object.size(my_data)

chr_vect <- c("12", "11", "33")
dbl_vect <- c(12, 11, 33)
format(object.size(chr_vect), units = "auto")
# [1] "248 bytes"
format(object.size(dbl_vect), units = "auto")
# [1] "80 bytes"
memory.profile()
#        NULL      symbol    pairlist     closure environment     promise
#           1       37875     1250989       24936        7117       34670
#    language     special     builtin        char     logical     integer
#      338840          45         685      122872       36574      196015
#      double     complex   character         ...         any        list
#       18644          40      353876          11           0       91695
#  expression    bytecode externalptr     weakref         raw          S4
#           1       76205        7455        2137        2187        3573

Factors vs Characters

Encode variables efficiently (e.g., factor instead of character):

gender <- c("female", "male", "other")
format(object.size(gender), units = "auto")
# [1] "272 bytes"
format(object.size(as.factor(gender)), units = "auto")
# [1] "672 bytes"

gender <- rep(c("female", "male", "other"), 100)
format(object.size(gender), units = "auto")
# [1] "2.6 Kb"
format(object.size(as.factor(gender)), units = "auto")
# [1] "1.8 Kb"

Memory management tips

A few other memory-management tips in R:

  1. Sessions continue – and memory is occupied – until you log out.
  2. Manage sessions efficiently by tidying the R session workspace:
    • Load only the data you need
    • Remove redundant dataframe columns: dataframe$redundant <- NULL
    • Remove data objects from the workspace once you don’t need them: rm()
    • Force Garbage Collection gc() in loops (automatic gc is enough most of the time)

CHUNK AND PULL

Chunk and Pull

Chunk and Pull – Example

French SIREN data: stock of all French firms (23M+ records since 1973)

Idea: data too large to load entirely → split by year in the shell, process each chunk in R, combine results

Chunk and Pull in 3 Steps

  1. Chunk (shell): split large file into yearly CSVs using AWK or similar
  2. Process (R): write a function that reads one chunk and computes what you need
  3. Pull (R): apply the function to all chunks and bind the results
# Step 2: function to process one chunk
compute_share_F <- function(file) {
  year <- as.numeric(gsub(".*?([0-9]+).*", "\\1", file))
  read_csv(file) |>
    summarise(F_share = mean(gender_founder == "F", na.rm = TRUE)) |>
    mutate(year = year)
}

# Step 3: apply to all chunks
my_files <- list.files("temp/yearly_data", full.names = TRUE)
pull_data <- map_df(my_files, compute_share_F)

tidyverse

Tidy data

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

tidyverse

library(tidyverse)
# -- Attaching core tidyverse packages --- tidyverse 2.0.0 --
# dplyr     1.1.4     readr     2.1.5
# forcats   1.0.0     stringr   1.5.1
# ggplot2   3.5.1     tibble    3.2.1
# lubridate 1.9.4     tidyr     1.3.1
# purrr     1.0.2

Pipes!

mpg |>
  filter(manufacturer == "audi") |>
  group_by(model) |>
  summarise(hwy_mean = mean(hwy))
# # A tibble: 3 x 2
#   model      hwy_mean
#   <chr>         <dbl>
# 1 a4             28.3
# 2 a4 quattro     25.8
# 3 a6 quattro     24

dplyr verbs

  1. filter: Filter (i.e. subset) rows based on their values
  2. arrange: Arrange (i.e. reorder) rows based on their values
  3. select: Select (i.e. subset) columns by their names
  4. mutate: Create new columns

dplyr::filter

starwars |>
  filter(
    species == "Human",
    height >= 190
  )
# A tibble: 4 x 14
#   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
# 1 Darth Va~    202   136 none       white      yellow          41.9 male  mascu~
# 2 Qui-Gon ~    193    89 brown      fair       blue            92   male  mascu~
# 3 Dooku        193    80 white      fair       brown          102   male  mascu~
# 4 Bail Pre~    191    NA black      tan        brown           67   male  mascu~

dplyr::arrange

starwars |>
  arrange(birth_year)
# A tibble: 87 x 14
#   name     height  mass hair_color skin_color eye_color birth_year sex   gender
#   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
#  1 Wicket ~     88  20   brown      brown      brown            8   male  mascu~
#  2 IG-88       200 140   none       metal      red             15   none  mascu~
#  3 Luke Sk~    172  77   blond      fair       blue            19   male  mascu~
#  4 Leia Or~    150  49   brown      light      brown           19   fema~ femin~
#  5 Wedge A~    170  77   brown      fair       hazel           21   male  mascu~
#  ...

dplyr::select

starwars |>
  select(name:skin_color, species, -height)
# A tibble: 87 x 5
#   name                mass hair_color    skin_color  species
#   <chr>              <dbl> <chr>         <chr>       <chr>
#  1 Luke Skywalker        77 blond         fair        Human
#  2 C-3PO                 75 NA            gold        Droid
#  3 R2-D2                 32 NA            white, blue Droid
#  4 Darth Vader          136 none          white       Human
#  5 Leia Organa           49 brown         light       Human
#  ...

dplyr::mutate

starwars |>
  select(name, birth_year) |>
  mutate(dog_years = birth_year * 7) |>
  mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
# A tibble: 87 x 4
#   name               birth_year dog_years comment
#   <chr>                   <dbl>     <dbl> <chr>
#  1 Luke Skywalker           19        133  Luke Skywalker is 133 in dog years.
#  2 C-3PO                   112        784  C-3PO is 784 in dog years.
#  3 R2-D2                    33        231  R2-D2 is 231 in dog years.
#  4 Darth Vader              41.9      293. Darth Vader is 293.3 in dog years.
#  ...

dplyr::summarise

starwars |>
  group_by(species, gender) |>
  summarise(mean_height = mean(height, na.rm = TRUE))
# `summarise()` has grouped output by 'species'.
# A tibble: 42 x 3
# Groups:   species [38]
#   species   gender    mean_height
#   <chr>     <chr>           <dbl>
#  1 Aleena    masculine          79
#  2 Besalisk  masculine         198
#  3 Cerean    masculine         198
#  4 Chagrian  masculine         196
#  5 Clawdite  feminine          168
#  ...

data.table

data.table

  • Concise
  • Insanely fast
  • Memory efficient
  • Feature rich (and stable)
  • Dependency free

Concise

  • tidyverse
data(starwars, package = "dplyr")
starwars |>
  filter(species == "Human") |>
  group_by(homeworld) |>
  summarise(mean_height = mean(height))
  • data.table
setDT(starwars)
starwars_dt[species == "Human", mean(height), by = homeworld]

Fast

collapse_dplyr <- function() {
  storms |>
    group_by(name, year, month, day) |>
    summarize(wind = mean(wind),
              pressure = mean(pressure),
              category = dplyr::first(category))
}
storms_dt <- as.data.table(storms)
collapse_dt <- function() {
  storms_dt[, .(wind = mean(wind),
                pressure = mean(pressure),
                category = first(category)),
            by = .(name, year, month, day)]
}
microbenchmark(collapse_dplyr(), collapse_dt(), times = 1)
# Unit: milliseconds
#              expr       min    median       max neval
#  collapse_dplyr() 67.061792 67.061792 67.061792     1
#     collapse_dt()  1.943042  1.943042  1.943042     1

Syntax

DT[i, j, by]

  • i: On which rows?
    • tidyverse: filter(), slice(), arrange()
  • j: What to do?
    • tidyverse: select(), mutate()
  • by: Grouped by what?
    • tidyverse: group_by()

Subsetting

  • Subset to rows where variable x equals “string”
DT[x == "string", ]
  • Subset to rows where variable y is greater than 5
DT[y > 5, ]
  • Subset to the first 10 rows
DT[1:10, ]

Ordering

  • Sort by youngest to oldest
starwars_dt[order(birth_year)]
  • Sort by youngest to oldest, by reference
setorder(starwars_dt, birth_year, na.last = TRUE)

Modifying columns

  • LHS := RHS form
DT[, c("var1", "var2") := .(val1, val2)]
  • Functional form
DT[, ':=' (var1 = val1, var2 = val2)]
  • Chaining
DT[, z := 5:6][, z_sq := z^2][]

DT |>
  _[, xyz := x + y + z] |>
  _[, xyz_sq := xyz^2] |>
  _[]

Subsetting columns

  • Subset by column position
starwars_dt[1:2, c(1:3, 10)]
  • Subset by name
starwars_dt[1:2, .(name, height, mass, homeworld)]
  • .() is just a data.table shortcut for list()

by

  • Collapse by variable
starwars_dt[, .(species_height = mean(height, na.rm = TRUE)), by = species]
  • …and by reference
starwars_dt[, species_n := .N, by = species][]

Efficient subsetting with .SD

starwars_dt[,
            .(mean(height, na.rm = TRUE),
              mean(mass, na.rm = TRUE),
              mean(birth_year, na.rm = TRUE)),
            by = species]

starwars_dt[,
            lapply(.SD, mean, na.rm = TRUE),
            .SDcols = c("height", "mass", "birth_year"),
            by = species]

tidyverse vs. data.table vs. base R

tidyverse data.table base
readr::read_csv data.table::fread utils::read.csv
tibble::tibble data.table::data.table base::data.frame
dplyr::if_else base::ifelse

\(\rightarrow\) mix and win.

Combine

storms_dt[, .(wind = mean(wind),
              pressure = mean(pressure),
              category = first(category)),
          by = .(name, year, month, day)] |>
  ggplot(aes(x = pressure, y = wind, col = category)) +
  geom_point(alpha = 0.3) +
  theme_minimal()

MODERN WORKFLOWS: PARQUET AND DUCKPLYR

The problem with CSV

CSV files are the default format for tabular data, but they have limitations at scale:

  • Slow to read/write – text parsing is expensive
  • No type information – everything is a string on disk
  • No compression – files are much larger than necessary
  • Full scan required – cannot read a subset of columns efficiently

Apache Parquet is a columnar storage format that solves all of these problems.

Writing and reading Parquet

library(arrow)

# Write a data frame to Parquet
write_parquet(my_large_data, "data/my_large_data.parquet")

# Read it back
df <- read_parquet("data/my_large_data.parquet")

# Read only specific columns (much faster than CSV)
df_small <- read_parquet(
  "data/my_large_data.parquet",
  col_select = c("firm_id", "export_value", "year")
)

Key advantages:

  • Typically 2–10x smaller than gzipped CSV
  • Columns are stored separately – reading a subset of columns is very fast
  • Types are preserved (no more col_types headaches)

duckplyr: a modern backend for dplyr

duckplyr uses DuckDB as a backend for dplyr – same syntax, dramatically faster on large data:

library(duckplyr)

# duckplyr overrides dplyr verbs transparently
df <- read_parquet("data/firms.parquet")

df |>
  filter(year >= 2010) |>
  group_by(country, year) |>
  summarise(
    total_exports = sum(export_value, na.rm = TRUE),
    n_firms = n()
  ) |>
  arrange(desc(total_exports))
  • Queries are lazily evaluated and optimized before execution
  • Works directly on Parquet files without loading everything into RAM
  • Falls back to dplyr when DuckDB cannot handle an operation

A complete modern pipeline

library(arrow)
library(duckplyr)

# 1. Convert CSV to Parquet (once)
read_csv_arrow("data/large_trade_data.csv") |>
  write_parquet("data/large_trade_data.parquet")

# 2. Work with the Parquet file directly
trade <- read_parquet("data/large_trade_data.parquet")

# 3. Analyze with dplyr syntax, DuckDB speed
result <- trade |>
  filter(export_value > 1000) |>
  group_by(country, year) |>
  summarise(
    mean_exports = mean(export_value),
    n_exporters = n(),
    .groups = "drop"
  ) |>
  collect()

\(\rightarrow\) Same tidyverse code you already know, but scales to billions of rows.

DuckDB also speaks SQL

DuckDB supports SQL natively — useful if you already know SQL or need complex queries:

library(DBI)
library(duckdb)

con <- dbConnect(duckdb())

# Query a Parquet file directly with SQL --- no loading step
dbGetQuery(con, "
  SELECT country, year,
         SUM(export_value) AS total_exports,
         COUNT(*) AS n_firms
  FROM read_parquet('data/firms.parquet')
  WHERE year >= 2010
  GROUP BY country, year
  ORDER BY total_exports DESC
")

Benchmarking: Why format and tool choice matter

# Reading 10M rows: CSV vs Parquet
system.time(read_csv("data/large.csv"))            #  ~12 sec
system.time(read_parquet("data/large.parquet"))     #  ~0.8 sec

# Summarising: dplyr vs duckplyr vs data.table
system.time(df |> group_by(x) |> summarise(m = mean(y)))     # dplyr:    ~3.2 sec
system.time(df |> group_by(x) |> summarise(m = mean(y)))     # duckplyr: ~0.4 sec
system.time(dt[, .(m = mean(y)), by = x])                    # data.table: ~0.3 sec

Rule of thumb: Start with duckplyr + Parquet. Reach for data.table when you need maximum speed or in-place modification.

Wrap up

Summary

  • Firm trade data reveals heavy-tailed distributions (Pareto / power laws)
  • Memory management matters: know your RAM, encode efficiently, chunk and pull
  • tidyverse for readable, expressive data wrangling
  • data.table for speed and memory efficiency
  • Parquet + duckplyr for modern large-data workflows without leaving R

Further reading