01. Getting Started

Data Science for Economists

Julian Hinz

2026-04-01

What this course is about

Data science skills complementary to standard econometrics
Data cleaning and wrangling, visualization, databases, machine learning, etc.
Research in (broadly defined) International Economics shifts towards empirics
- I never had this course but wish I did.

WHO WE ARE

name / program / coding background

This Course

Weekly sessions over the summer semester (Apr–Jul)
Each session: mix of slides and hands-on coding
Goal: leave with a working toolkit, not just slides
with Irene Iodice and Hendrik Malko

Course overview

Schedule

Week	Module
Apr 15	Getting Started – Reproducibility, Git, Docker, IDE setup
Apr 22	Toolkit – Shell basics, R fundamentals, Quarto
Apr 29	Large Structured Data – Millions of rows: data.table, Parquet, duckplyr
May 06	Web Scraping & APIs – HTML parsing, APIs, online prices
May 13	Text as Data – Tokenization, bag-of-words, policy uncertainty
May 20	Spatial & Satellite Data – CRS, nightlights, satellite imagery in R

Schedule (cont.)

Week	Module
May 27	TBD
Jun 03	Time as Data – Event studies, diff-in-diff, causal inference
Jun 10	Machine Learning – Model selection, regularization, causal forests
Jun 17	Large Language Models – LLM APIs, structured output, training a mini-LLM
~~Jun 24~~	no class
Jul 01	AI-Assisted Research – CLAUDE.md, agents, skills, LLM workflows

Web Scraping

“One Billion Prices Project”: Web-scraped prices for many stores and countries (Cavallo & Rigobon, 2016)
Are online prices different than offline prices?

Satellite Imagery

Large Structured Data

Text as Data

Time as Data

Machine Learning

LLMs and Historical Data

Göttlich, Jiang, Loibner & Voth (2025): Ranke-4B — language models trained exclusively on time-stamped historical text

4-billion-parameter models with knowledge cutoffs at 1913, 1929, 1933, 1939, 1946
Trained on ~80 billion tokens of curated historical sources
“Time-locked”: the model literally cannot know what happens after its cutoff
Avoids hindsight contamination of modern LLMs roleplaying a historical period

GitHub: DGoettlich/history-llms

Course logistics

Format

Weekly sessions, ~3 hours each
Mix of slides + live-coding
All code available on GitHub — clone the repo and follow along

Resources

Website: datascience.julianhinz.com
GitHub: github.com/julianhinz/data-science-for-economists-bielefeld-2026
All slides, code, and data links in the repository
Dreber (2025), “A Framework for Evaluating Reproducibility”, Economic Inquiry
~30% of economics studies meet basic reproducibility standards (NBER WP 33753, 2025)

Questions?

GOOD RESEARCH PRACTICE

Session Roadmap

Reproducibility
Git and GitHub
Docker and Dev Containers
Testing
Setting up your machines

Reproducibility

Mostly for your future self!
but of course also: Science.

“Trying to replicate the estimates from an early draft of a paper, we discover that the code that produced the estimates no longer works because it calls files that have since been moved.

Now: No longer works.”

“Between regressions number of observations falling. After much sleuthing, we find that many observations were dropped in a merge because they had missing values for the county identifier we were merging on. When we correct the mistake and include the dropped observations, the results change dramatically.”

“Me and my coauthor write code that refers to a common set of data files stored on a shared folder. Our work is constantly interrupted because changes one of us makes to the data files causes the others’ code to break.”

8 building blocks of reproducibility

Code and Data in the Social Sciences (Gentzkow and Shapiro):

Automation — Automate everything; write a single script that executes all code from beginning to end. Use a “master” file or, even better, use make.
Version Control — Store code and data under version control. Run the whole directory before checking it back in. Use Git.
Directories — Separate directories by function. Separate files into inputs and outputs. Make directories portable. Use code, input, output and temp folders.
Keys — Store cleaned data in tables with unique, non-missing keys. Keep data normalized as far into your code pipeline as you can.

8 building blocks (cont.)

Abstraction — Abstract to eliminate redundancy and improve clarity. Otherwise, don’t abstract.
Documentation — Don’t write documentation you will not maintain. Code should be self-documenting.
Management — Manage tasks with a task management system. E-mail is not a task management system.
Code Style — Keep it short and purposeful. Use descriptive names. Be consistent. Stick to a style guide.

Quick aside: Style guides

Google: google.github.io/styleguide/Rguide.html
tidyverse: style.tidyverse.org

# bad
f=function(x,y){
  d=x[x$v>3,]
  m=lm(y~x,data=d)
  return(m)}

# good
fit_model <- function(data, threshold) {
  filtered <- data[value > threshold]
  model <- lm(outcome ~ treatment,
              data = filtered)
  return(model)
}

Git and GitHub

Git

Git is a distributed version control system
- “Dropbox and the ‘Track changes’ feature in MS Word have a baby: Git”
Optimized for code (not data, actually)

Git’s Data Model: Snapshots, not Diffs

Each commit is a snapshot of every file at that point in time
A commit stores: the snapshot, author, message, and pointer(s) to parent commit(s)
History is a directed acyclic graph (DAG):

o <-- o <-- o <-- o          (main)
            ^
             \
              --- o <-- o    (feature branch)

What is a DAG?

Directed: edges have a direction (each commit points to its parent)
Acyclic: no loops — you can never walk from a commit back to itself
Graph: nodes (commits) connected by edges (parent links)
Merging = a commit with two parents (two arrows coming in)

o <-- o <-- o <-- o <---- o   (merge commit)
            ^            /
             \          v
              --- o <-- o

GitHub

Online hosting platform that provides services built on top of the Git system
- Similar: Bitbucket and GitLab
Makes Git a lot more user friendly
Seamless integration into lots of other software: VS Code

4 Main Git Operations

Stage (or “add”): Add changes to the repo history
- file edits, additions, deletions, etc.
Commit: Yes, you are sure these changes should be part of the repo history
- need to add a message (and optionally a description)
Pull: Download new changes made on the GitHub repo (i.e. the upstream remote)
- either by your collaborators or you on another machine
Push: Upload any (committed) local changes to the GitHub repo

Merge Conflicts

# README
Some text here.
<<<<<<< HEAD
Text added by Partner 2.
=======
Text added by Partner 1.
>>>>>>> 814e09178910383c128045ce67a58c9c1df3f558.
More text here.

Delete lines that you don’t want, then the special Git merge conflict symbols
Then: stage, commit, pull and push

Branches and Forks

Branch

Take snapshot of existing repo and try out a whole new idea without affecting your main branch
If new idea works, merge back into main branch
- fix bugs
- implement new empirical strategies, robustness checks, …
If it doesn’t work, just delete the experimental branch

Branches and Forks

Fork

Forking a repo is similar to a branch, but creates a copy of the entire repo
Upstream pull request makes merge back into origin repo possible
- Easy to do on GitHub

.gitignore

Tells Git what to ignore
- exclude whole folders or a class of files (e.g. based on size or type)
Simply add names of files or folders that should be ignored

Useful Git Commands

Command	What it does
`git log --all --graph --oneline`	Visualize the DAG
`git diff`	Show unstaged changes
`git diff --staged`	Show what will be committed
`git stash`	Temporarily shelve changes
`git blame <file>`	Who changed each line, and when
`git bisect`	Binary search for the commit that introduced a bug

Docker and Dev Containers

Docker: Why?

“It works on my machine” — Docker ensures it works on every machine
Packages your code + dependencies + OS into a portable container
Reproducibility: collaborators (and your future self) get the exact same environment

Images vs Containers

Image = a blueprint — frozen snapshot of an OS + installed software
- Think: a recipe, or a class definition
- Examples: rocker/tidyverse:4.4, python:3.12
Container = a running instance of an image
- Think: a dish made from the recipe, or an object from the class
- You can run many containers from the same image
docker pull downloads an image; docker run creates a container from it

Dev Containers: The Modern Workflow

Dev Containers = Docker + your IDE, seamlessly integrated
VSCode supports Dev Containers natively
Open a project → IDE detects .devcontainer/ config → launches container automatically
You code inside the container as if it were your local machine

// .devcontainer/devcontainer.json (minimal example)
{
  "name": "R Data Science",
  "image": "rocker/tidyverse:4.4",
  "customizations": {
    "vscode": { "extensions": ["REditorSupport.r"] }
  }
}

Testing

Why Test?

Tests catch bugs before they reach your results
They let you refactor with confidence — change code, re-run tests, know nothing broke
The simplest test: an assertion

gravity_estimate <- -1.2

# does our estimate have the expected sign?
stopifnot(gravity_estimate < 0)

# does the merge keep all observations?
stopifnot(nrow(merged_dt) == nrow(original_dt))

testthat: The R Testing Framework

library(testthat)

# a function we want to test
calc_trade_share <- function(exports, gdp) {
  if (gdp <= 0) stop("GDP must be positive")
  return(exports / gdp)
}

test_that("trade share is computed correctly", {
  expect_equal(calc_trade_share(50, 200), 0.25)
  expect_equal(calc_trade_share(0, 100), 0)
})

test_that("invalid GDP raises an error", {
  expect_error(calc_trade_share(50, 0), "GDP must be positive")
  expect_error(calc_trade_share(50, -10), "GDP must be positive")
})

Setting up your machines

Recommended IDE: VSCode

Lightweight, fast, and extensible
R extension provides console, plots pane, and variable explorer
Copilot, Git, Quarto, and thousands of other extensions
Download: code.visualstudio.com

What you need installed

R (4.4+) and Quarto (1.6+)
Git — we’ll use it throughout the course
VSCode with the R extension

macOS / Linux: Use Homebrew

# Install Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install R, Git, and Quarto
brew install --cask r quarto
brew install git

Homebrew: brew.sh — package manager for macOS and Linux
Keeps everything up to date with brew upgrade

R packages

install.packages(c("tidyverse", "data.table", "quarto"))
# For the LLM module later in the semester:
install.packages(c("ellmer", "torch"))
torch::install_torch()  # one-time download of the C++ backend