Data Science for Economists
2026-03-01
Trade data: Large data
Technical questions:
FIRM TRADE DATA
“Countries don’t trade. Firms trade.”
— Hallak and Levinsohn, 2005
Data availability from late 1970s and early 1980s provide evidence of unexplained facts:

Toyota and Mercedes-Benz offer different varieties of the same good
Krugman (1979 and 1980) introduces:
Implications

Increasing availability since the 90s of firm/plant level data showed:
Melitz (2003): Firms differ in productivity, and this heterogeneity drives trade patterns
| Facts | Old trade (Ricardo, H-O) | New trade (Krugman) | Hetero. firms (Melitz) |
|---|---|---|---|
| Intra-industry trade | No | Yes | Yes |
| Exporters ≠ non-exporters | No | No | Yes |
| Exporters more productive | No | No | Yes |


What does it mean?
POWER LAWS AND PARETO DISTRIBUTIONS
Let us define \(p(x)dx\) as the fraction of firms with export value between \(x\) and \(x + dx\). Then observing a straight line in a log-log scale means
\[\log p(x) = c - (\alpha + 1) \log x\]
where \(c\) and \(-(\alpha + 1)\) represent the intercept and the slope of the line. If we take exponential on both sides we get
\[p(x) = C x^{-(\alpha+1)}\]
Probability distributions with this functional form are said to follow a power law and \(-(\alpha + 1)\) is the exponent of the power law.

Two practical methods (+ MLE for the technically minded):
Remark: Estimation typically applied above a minimum threshold \(x_m\).


The CCDF \(P(x)\) = fraction of firms with export value \(\geq x\). On log-log axes, a power law appears as a straight line:
\[\log P(x_i) = c - \alpha \log(x_i) + \epsilon_i\]

Newman, M.E.J., 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46, pp. 323-351.
MEMORY MANAGEMENT
Data processed in R has to be fully loaded into the RAM
How much free RAM do I have?
systeminfo | find "Available Physical Memory"topfree -hobject.size(my_data)
chr_vect <- c("12", "11", "33")
dbl_vect <- c(12, 11, 33)
format(object.size(chr_vect), units = "auto")
# [1] "248 bytes"
format(object.size(dbl_vect), units = "auto")
# [1] "80 bytes"
memory.profile()
# NULL symbol pairlist closure environment promise
# 1 37875 1250989 24936 7117 34670
# language special builtin char logical integer
# 338840 45 685 122872 36574 196015
# double complex character ... any list
# 18644 40 353876 11 0 91695
# expression bytecode externalptr weakref raw S4
# 1 76205 7455 2137 2187 3573Encode variables efficiently (e.g., factor instead of character):
gender <- c("female", "male", "other")
format(object.size(gender), units = "auto")
# [1] "272 bytes"
format(object.size(as.factor(gender)), units = "auto")
# [1] "672 bytes"
gender <- rep(c("female", "male", "other"), 100)
format(object.size(gender), units = "auto")
# [1] "2.6 Kb"
format(object.size(as.factor(gender)), units = "auto")
# [1] "1.8 Kb"A few other memory-management tips in R:
dataframe$redundant <- NULLrm()gc() in loops (automatic gc is enough most of the time)CHUNK AND PULL

French SIREN data: stock of all French firms (23M+ records since 1973)
Idea: data too large to load entirely → split by year in the shell, process each chunk in R, combine results
# Step 2: function to process one chunk
compute_share_F <- function(file) {
year <- as.numeric(gsub(".*?([0-9]+).*", "\\1", file))
read_csv(file) |>
summarise(F_share = mean(gender_founder == "F", na.rm = TRUE)) |>
mutate(year = year)
}
# Step 3: apply to all chunks
my_files <- list.files("temp/yearly_data", full.names = TRUE)
pull_data <- map_df(my_files, compute_share_F)tidyverse
filter: Filter (i.e. subset) rows based on their valuesarrange: Arrange (i.e. reorder) rows based on their valuesselect: Select (i.e. subset) columns by their namesmutate: Create new columnsstarwars |>
filter(
species == "Human",
height >= 190
)
# A tibble: 4 x 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Darth Va~ 202 136 none white yellow 41.9 male mascu~
# 2 Qui-Gon ~ 193 89 brown fair blue 92 male mascu~
# 3 Dooku 193 80 white fair brown 102 male mascu~
# 4 Bail Pre~ 191 NA black tan brown 67 male mascu~starwars |>
arrange(birth_year)
# A tibble: 87 x 14
# name height mass hair_color skin_color eye_color birth_year sex gender
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
# 1 Wicket ~ 88 20 brown brown brown 8 male mascu~
# 2 IG-88 200 140 none metal red 15 none mascu~
# 3 Luke Sk~ 172 77 blond fair blue 19 male mascu~
# 4 Leia Or~ 150 49 brown light brown 19 fema~ femin~
# 5 Wedge A~ 170 77 brown fair hazel 21 male mascu~
# ...starwars |>
select(name:skin_color, species, -height)
# A tibble: 87 x 5
# name mass hair_color skin_color species
# <chr> <dbl> <chr> <chr> <chr>
# 1 Luke Skywalker 77 blond fair Human
# 2 C-3PO 75 NA gold Droid
# 3 R2-D2 32 NA white, blue Droid
# 4 Darth Vader 136 none white Human
# 5 Leia Organa 49 brown light Human
# ...starwars |>
select(name, birth_year) |>
mutate(dog_years = birth_year * 7) |>
mutate(comment = paste0(name, " is ", dog_years, " in dog years."))
# A tibble: 87 x 4
# name birth_year dog_years comment
# <chr> <dbl> <dbl> <chr>
# 1 Luke Skywalker 19 133 Luke Skywalker is 133 in dog years.
# 2 C-3PO 112 784 C-3PO is 784 in dog years.
# 3 R2-D2 33 231 R2-D2 is 231 in dog years.
# 4 Darth Vader 41.9 293. Darth Vader is 293.3 in dog years.
# ...starwars |>
group_by(species, gender) |>
summarise(mean_height = mean(height, na.rm = TRUE))
# `summarise()` has grouped output by 'species'.
# A tibble: 42 x 3
# Groups: species [38]
# species gender mean_height
# <chr> <chr> <dbl>
# 1 Aleena masculine 79
# 2 Besalisk masculine 198
# 3 Cerean masculine 198
# 4 Chagrian masculine 196
# 5 Clawdite feminine 168
# ...data.table
tidyversedata.tablecollapse_dplyr <- function() {
storms |>
group_by(name, year, month, day) |>
summarize(wind = mean(wind),
pressure = mean(pressure),
category = dplyr::first(category))
}
storms_dt <- as.data.table(storms)
collapse_dt <- function() {
storms_dt[, .(wind = mean(wind),
pressure = mean(pressure),
category = first(category)),
by = .(name, year, month, day)]
}
microbenchmark(collapse_dplyr(), collapse_dt(), times = 1)
# Unit: milliseconds
# expr min median max neval
# collapse_dplyr() 67.061792 67.061792 67.061792 1
# collapse_dt() 1.943042 1.943042 1.943042 1DT[i, j, by]
filter(), slice(), arrange()select(), mutate()group_by()LHS := RHS form.() is just a data.table shortcut for list()| tidyverse | data.table | base |
|---|---|---|
readr::read_csv |
data.table::fread |
utils::read.csv |
tibble::tibble |
data.table::data.table |
base::data.frame |
dplyr::if_else |
– | base::ifelse |
\(\rightarrow\) mix and win.
MODERN WORKFLOWS: PARQUET AND DUCKPLYR
CSV files are the default format for tabular data, but they have limitations at scale:
Apache Parquet is a columnar storage format that solves all of these problems.
library(arrow)
# Write a data frame to Parquet
write_parquet(my_large_data, "data/my_large_data.parquet")
# Read it back
df <- read_parquet("data/my_large_data.parquet")
# Read only specific columns (much faster than CSV)
df_small <- read_parquet(
"data/my_large_data.parquet",
col_select = c("firm_id", "export_value", "year")
)Key advantages:
col_types headaches)duckplyr uses DuckDB as a backend for dplyr – same syntax, dramatically faster on large data:
library(arrow)
library(duckplyr)
# 1. Convert CSV to Parquet (once)
read_csv_arrow("data/large_trade_data.csv") |>
write_parquet("data/large_trade_data.parquet")
# 2. Work with the Parquet file directly
trade <- read_parquet("data/large_trade_data.parquet")
# 3. Analyze with dplyr syntax, DuckDB speed
result <- trade |>
filter(export_value > 1000) |>
group_by(country, year) |>
summarise(
mean_exports = mean(export_value),
n_exporters = n(),
.groups = "drop"
) |>
collect()\(\rightarrow\) Same tidyverse code you already know, but scales to billions of rows.
DuckDB supports SQL natively — useful if you already know SQL or need complex queries:
library(DBI)
library(duckdb)
con <- dbConnect(duckdb())
# Query a Parquet file directly with SQL --- no loading step
dbGetQuery(con, "
SELECT country, year,
SUM(export_value) AS total_exports,
COUNT(*) AS n_firms
FROM read_parquet('data/firms.parquet')
WHERE year >= 2010
GROUP BY country, year
ORDER BY total_exports DESC
")# Reading 10M rows: CSV vs Parquet
system.time(read_csv("data/large.csv")) # ~12 sec
system.time(read_parquet("data/large.parquet")) # ~0.8 sec
# Summarising: dplyr vs duckplyr vs data.table
system.time(df |> group_by(x) |> summarise(m = mean(y))) # dplyr: ~3.2 sec
system.time(df |> group_by(x) |> summarise(m = mean(y))) # duckplyr: ~0.4 sec
system.time(dt[, .(m = mean(y)), by = x]) # data.table: ~0.3 secRule of thumb: Start with duckplyr + Parquet. Reach for data.table when you need maximum speed or in-place modification.
Wrap up