01. Getting Started

Data Science for Economists

2026-04-01

What this course is about

  • Data science skills complementary to standard econometrics
  • Data cleaning and wrangling, visualization, databases, machine learning, etc.
  • Research in (broadly defined) International Economics shifts towards empirics
    • I never had this course but wish I did.

WHO WE ARE

name / program / coding background

This Course

  • Weekly sessions over the summer semester (Apr–Jul)
  • Each session: mix of slides and hands-on coding
  • Goal: leave with a working toolkit, not just slides
  • with Irene Iodice and Hendrik Malko

Course overview

Schedule

Week Module
Apr 15 Getting Started – Reproducibility, Git, Docker, IDE setup
Apr 22 Toolkit – Shell basics, R fundamentals, Quarto
Apr 29 Large Structured Data – Millions of rows: data.table, Parquet, duckplyr
May 06 Web Scraping & APIs – HTML parsing, APIs, online prices
May 13 Text as Data – Tokenization, bag-of-words, policy uncertainty
May 20 Spatial & Satellite Data – CRS, nightlights, satellite imagery in R

Schedule (cont.)

Week Module
May 27 TBD
Jun 03 Time as Data – Event studies, diff-in-diff, causal inference
Jun 10 Machine Learning – Model selection, regularization, causal forests
Jun 17 Large Language Models – LLM APIs, structured output, training a mini-LLM
Jun 24 no class
Jul 01 AI-Assisted Research – CLAUDE.md, agents, skills, LLM workflows

Web Scraping

  • “One Billion Prices Project”: Web-scraped prices for many stores and countries (Cavallo & Rigobon, 2016)
  • Are online prices different than offline prices?

Satellite Imagery

Large Structured Data

Arithmetic scale

Log-log scale

Text as Data

Text as Data

Time as Data

Machine Learning

LLMs and Historical Data

Göttlich, Jiang, Loibner & Voth (2025): Ranke-4B — language models trained exclusively on time-stamped historical text

  • 4-billion-parameter models with knowledge cutoffs at 1913, 1929, 1933, 1939, 1946
  • Trained on ~80 billion tokens of curated historical sources
  • “Time-locked”: the model literally cannot know what happens after its cutoff
  • Avoids hindsight contamination of modern LLMs roleplaying a historical period

GitHub: DGoettlich/history-llms

Course logistics

Format

  • Weekly sessions, ~3 hours each
  • Mix of slides + live-coding
  • All code available on GitHub — clone the repo and follow along

Resources

Questions?

GOOD RESEARCH PRACTICE

Session Roadmap

  • Reproducibility
  • Git and GitHub
  • Docker and Dev Containers
  • Testing
  • Setting up your machines

Reproducibility

Reproducibility

  • Mostly for your future self!
  • but of course also: Science.

“Trying to replicate the estimates from an early draft of a paper, we discover that the code that produced the estimates no longer works because it calls files that have since been moved.

Now: No longer works.”

“Between regressions number of observations falling. After much sleuthing, we find that many observations were dropped in a merge because they had missing values for the county identifier we were merging on. When we correct the mistake and include the dropped observations, the results change dramatically.”

“Me and my coauthor write code that refers to a common set of data files stored on a shared folder. Our work is constantly interrupted because changes one of us makes to the data files causes the others’ code to break.”

8 building blocks of reproducibility

Code and Data in the Social Sciences (Gentzkow and Shapiro):

  1. Automation — Automate everything; write a single script that executes all code from beginning to end. Use a “master” file or, even better, use make.
  2. Version Control — Store code and data under version control. Run the whole directory before checking it back in. Use Git.
  3. Directories — Separate directories by function. Separate files into inputs and outputs. Make directories portable. Use code, input, output and temp folders.
  4. Keys — Store cleaned data in tables with unique, non-missing keys. Keep data normalized as far into your code pipeline as you can.

8 building blocks (cont.)

  1. Abstraction — Abstract to eliminate redundancy and improve clarity. Otherwise, don’t abstract.
  2. Documentation — Don’t write documentation you will not maintain. Code should be self-documenting.
  3. Management — Manage tasks with a task management system. E-mail is not a task management system.
  4. Code Style — Keep it short and purposeful. Use descriptive names. Be consistent. Stick to a style guide.

Quick aside: Style guides

# bad
f=function(x,y){
  d=x[x$v>3,]
  m=lm(y~x,data=d)
  return(m)}
# good
fit_model <- function(data, threshold) {
  filtered <- data[value > threshold]
  model <- lm(outcome ~ treatment,
              data = filtered)
  return(model)
}

Git and GitHub

Git

  • Git is a distributed version control system
    • “Dropbox and the ‘Track changes’ feature in MS Word have a baby: Git”
  • Optimized for code (not data, actually)

Git’s Data Model: Snapshots, not Diffs

  • Each commit is a snapshot of every file at that point in time
  • A commit stores: the snapshot, author, message, and pointer(s) to parent commit(s)
  • History is a directed acyclic graph (DAG):
o <-- o <-- o <-- o          (main)
            ^
             \
              --- o <-- o    (feature branch)

What is a DAG?

  • Directed: edges have a direction (each commit points to its parent)
  • Acyclic: no loops — you can never walk from a commit back to itself
  • Graph: nodes (commits) connected by edges (parent links)
  • Merging = a commit with two parents (two arrows coming in)
o <-- o <-- o <-- o <---- o   (merge commit)
            ^            /
             \          v
              --- o <-- o

GitHub

  • Online hosting platform that provides services built on top of the Git system
    • Similar: Bitbucket and GitLab
  • Makes Git a lot more user friendly
  • Seamless integration into lots of other software: VS Code

4 Main Git Operations

  1. Stage (or “add”): Add changes to the repo history
    • file edits, additions, deletions, etc.
  2. Commit: Yes, you are sure these changes should be part of the repo history
    • need to add a message (and optionally a description)
  3. Pull: Download new changes made on the GitHub repo (i.e. the upstream remote)
    • either by your collaborators or you on another machine
  4. Push: Upload any (committed) local changes to the GitHub repo

Merge Conflicts

# README
Some text here.
<<<<<<< HEAD
Text added by Partner 2.
=======
Text added by Partner 1.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
More text here.
  • Delete lines that you don’t want, then the special Git merge conflict symbols
  • Then: stage, commit, pull and push

Branches and Forks

Branch

  • Take snapshot of existing repo and try out a whole new idea without affecting your main branch
  • If new idea works, merge back into main branch
    • fix bugs
    • implement new empirical strategies, robustness checks, …
  • If it doesn’t work, just delete the experimental branch

Branches and Forks

Fork

  • Forking a repo is similar to a branch, but creates a copy of the entire repo
  • Upstream pull request makes merge back into origin repo possible
    • Easy to do on GitHub

.gitignore

  • Tells Git what to ignore
    • exclude whole folders or a class of files (e.g. based on size or type)
  • Simply add names of files or folders that should be ignored

Useful Git Commands

Command What it does
git log --all --graph --oneline Visualize the DAG
git diff Show unstaged changes
git diff --staged Show what will be committed
git stash Temporarily shelve changes
git blame <file> Who changed each line, and when
git bisect Binary search for the commit that introduced a bug

Docker and Dev Containers

Docker: Why?

  • “It works on my machine” — Docker ensures it works on every machine
  • Packages your code + dependencies + OS into a portable container
  • Reproducibility: collaborators (and your future self) get the exact same environment

Images vs Containers

  • Image = a blueprint — frozen snapshot of an OS + installed software
    • Think: a recipe, or a class definition
    • Examples: rocker/tidyverse:4.4, python:3.12
  • Container = a running instance of an image
    • Think: a dish made from the recipe, or an object from the class
    • You can run many containers from the same image
  • docker pull downloads an image; docker run creates a container from it

Dev Containers: The Modern Workflow

  • Dev Containers = Docker + your IDE, seamlessly integrated
  • VSCode supports Dev Containers natively
  • Open a project → IDE detects .devcontainer/ config → launches container automatically
  • You code inside the container as if it were your local machine
// .devcontainer/devcontainer.json (minimal example)
{
  "name": "R Data Science",
  "image": "rocker/tidyverse:4.4",
  "customizations": {
    "vscode": { "extensions": ["REditorSupport.r"] }
  }
}

Testing

Why Test?

  • Tests catch bugs before they reach your results
  • They let you refactor with confidence — change code, re-run tests, know nothing broke
  • The simplest test: an assertion
gravity_estimate <- -1.2

# does our estimate have the expected sign?
stopifnot(gravity_estimate < 0)

# does the merge keep all observations?
stopifnot(nrow(merged_dt) == nrow(original_dt))

testthat: The R Testing Framework

library(testthat)

# a function we want to test
calc_trade_share <- function(exports, gdp) {
  if (gdp <= 0) stop("GDP must be positive")
  return(exports / gdp)
}

test_that("trade share is computed correctly", {
  expect_equal(calc_trade_share(50, 200), 0.25)
  expect_equal(calc_trade_share(0, 100), 0)
})

test_that("invalid GDP raises an error", {
  expect_error(calc_trade_share(50, 0), "GDP must be positive")
  expect_error(calc_trade_share(50, -10), "GDP must be positive")
})

Setting up your machines

What you need installed

  • R (4.4+) and Quarto (1.6+)
  • Git — we’ll use it throughout the course
  • VSCode with the R extension

macOS / Linux: Use Homebrew

# Install Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install R, Git, and Quarto
brew install --cask r quarto
brew install git
  • Homebrew: brew.sh — package manager for macOS and Linux
  • Keeps everything up to date with brew upgrade

R packages

install.packages(c("tidyverse", "data.table", "quarto"))
# For the LLM module later in the semester:
install.packages(c("ellmer", "torch"))
torch::install_torch()  # one-time download of the C++ backend