“Machine Learning is the science of getting computers to learn without being explicitly programmed.”
\(\qquad\) – Arthur Samuel, 1959
Samuel built a checkers program in the 1950s that played better the more games it had seen. No new code, just more data.
A Working Definition
Mitchell (1997)
A computer program learns from experience \(E\) with respect to a class of tasks \(T\) and a performance measure \(P\), if its performance on \(T\), measured by \(P\), improves with \(E\).
Handwriting recognition
\(T\): classify handwritten words in images
\(P\): share of correctly classified words
\(E\): labelled dataset of handwritten words
Autonomous driving
\(T\): drive a highway from camera input
\(P\): average distance before human takes over
\(E\): sequence of (image, steering command) pairs from a human driver
Three Flavours of ML
Supervised learning – learn \(f: X \to Y\) from labelled data
Regression (continuous \(Y\)): predict rent from features
Classification (discrete \(Y\)): spam vs ham, sepsis vs not
Unsupervised learning – find structure in unlabelled data
Clustering, dimension reduction
Reinforcement learning – an agent learns by trial-and-error from rewards
AlphaGo, robot locomotion
Today: mostly (1), a quick tour of (2), and (3) only by name.
Why economists care
Healthcare: predict sepsis risk from hundreds of sensor streams (Kleinberg et al., 2015).
Finance: predict loan default from mobile-phone metadata when credit history is missing (Bjorkegren & Grissen, 2017).
Urban policy: map poverty from satellite imagery with millions of pixel features (Naik et al., 2017).
Labour: identify minimum-wage workers from rich demographics (Cengiz et al., 2024).
Common pattern
Hundreds to millions of predictors, often \(p \gg n\). Classical tools wobble.
Supervised Learning
Regression vs Classification
Regression – continuous outcome
House prices from size, location, age
Wages from education, experience
GDP growth from macro indicators
Classification – discrete label
Spam vs ham email
Sepsis vs healthy patient
Iris setosa vs versicolor vs virginica
Same workflow
Split data into training and test sets.
Fit the model on training data.
Score performance on test data.
The test split protects against overfitting – a model that memorises the training set but generalises poorly.
Types of Classification
Binary
Spam / not spam
Multi-class
Iris species
Multi-label
Movie genres (one film can be both thriller and romance)
Barro-Lee data, 90 countries, 1960–85. The bivariate slope is essentially zero.
Estimated \(\hat\beta = +0.001\), \(p = 0.83\). No unconditional convergence. Did Solow fail?
Conditional Convergence (Barro, 1991)
Maybe countries converge – but each to its own steady state, which depends on:
Human capital: schooling, life expectancy.
Demographics: fertility, population growth.
Fiscal policy: government consumption, taxes.
Openness: trade share, FDI.
Institutions / stability: rule of law, inflation, political instability.
Add the right controls and the conditional coefficient turns negative and significant – the famous “roughly 2% per year” iron law.
Problem: the Barro-Lee dataset has ~60 candidate controls and only 90 countries. Which controls?
Why This Is a Machine-Learning Problem
With 60 candidate controls and \(n = 90\), classical OLS is nearly saturated and standard errors explode.
\(t\)-tests on individual controls are unreliable: multiple testing, multicollinearity.
“Pick the right controls and you get convergence” is uncomfortably close to specification searching.
PDS-LASSO (Belloni-Chernozhukov-Hansen, 2014)
LASSO of growth on controls \(\rightarrow\) keep what predicts \(Y\).
LASSO of \(\ln y_0\) on controls \(\rightarrow\) keep what predicts \(D\).
OLS of \(Y\) on \(D\) and the union of selected controls.
This gives valid inference on the convergence rate \(\beta\), even when \(p\) is comparable to \(n\).
PDS-LASSO Recovers Convergence
Barro-Lee 1960–85. Bivariate OLS sees nothing; full OLS is noisy; PDS-LASSO recovers \(\hat\beta \approx -0.05\), \(p \approx 0.002\).
Interpretation: controlling for human capital, demographics, fiscal policy and openness, poor countries did catch up at ~5% per year.
Does It Still Hold Today? PWT 1995–2019
We rebuild the analysis with Penn World Table 10.01 + World Bank WDI:
172 countries, 25 years (1995–2019).
11 base controls \(\times\) squares and interactions \(\Rightarrow\)\(p = 77\).
This time even the bivariate slope is negative and significant: \(\hat\beta = -0.006\), \(p < 0.001\). Convergence is now visible in the raw data – rise of China, India, and other emerging markets.
Does It Still Hold Today? Results
Bivariate OLS, full OLS, and PDS-LASSO across both periods.
Pedagogical pay-off
The same ML pipeline – bivariate, full OLS, Ridge/LASSO, PDS-LASSO – runs unchanged on a 60-year-old dataset and on data ending six years ago. Conditional convergence is robust.
ML in Empirical Econ: Other Examples
Cengiz, Dube, Lindner & Zentler-Munro (2024, JLE) – minimum wages: Elastic Net predicts \(\Pr(\text{wage}_i \le \text{MW})\) from CPS demographics; event-study DiD around 159 state hikes shows +2–3% wages, no employment loss, ~75% coverage.
Angrist & Frandsen (2022, JLE) – “Machine Labor”: PDS-LASSO with ~384 application-process controls confirms there’s no elite-college wage premium.
Gilchrist & Sands (2016, JPE): LASSO selects weather-based instruments for movie opening-weekend attendance; +1% opening lifts five-week attendance by ~2%.
Common recipe: predictive ML for the nuisance part, classical inference for the causal parameter. Full slides on each in the appendix.
Unsupervised Learning
Supervised vs Unsupervised
Supervised: we have labels \(y_i\), we learn \(f: X \to Y\).
Unsupervised: no labels – find structure (groups, low-D representations).
Two Big Families
Dimension reduction – compress many features into a few
PCA, t-SNE, UMAP, autoencoders
Use: visualisation, multicollinearity, feature construction
by growing an ensemble of trees that split to maximise treatment-effect heterogeneity – with sample-splitting that yields valid confidence intervals.
library(grf)cf <-causal_forest(X =as.matrix(covariates), Y = outcome, W = treatment)tau_hat <-predict(cf)$predictions
More in the appendix; for today, just know the tool exists.
Wrap-Up
Key Take-Aways
ML = improving at a task \(T\) from experience \(E\), measured by \(P\). Supervised / unsupervised / reinforcement.
Bias–variance trade-off drives every modelling choice; use cross-validation to land in the sweet spot.
Shrinkage: Ridge keeps everything, LASSO selects, Elastic Net does both. The default workhorse when \(p \gg n\).
PCA and \(k\)-means are the unsupervised counterparts – compress and cluster.
For applied work: tidymodels in R is the unified pipeline.
Causal forests estimate heterogeneous effects when treatment is exogenous.
ML helps identification; it does not create it.
References
Belloni, A., Chernozhukov, V., & Hansen, C. (2011). Inference for high-dimensional sparse econometric models. Advances in Economics and Econometrics.
Cengiz, D., Dube, A., Lindner, A., & Zentler-Munro, R. (2024). Seeing Beyond the Trees: Using Machine Learning to Estimate the Impact of Minimum Wages on Labor Market Outcomes. Journal of Labor Economics.
Chernozhukov, V., et al. (2019). Inference in factorial experiments with high-dimensional covariates. arXiv:1903.10075.
Gilchrist, D. S. & Sands, E. G. (2016). Something to Talk About: Social Spillovers in Movie Consumption. JPE, 124(5): 1268–1304.
Angrist, J. D. & Frandsen, B. (2022). Machine Labor. Journal of Labor Economics, 40(S1): S97–S140.
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The LASSO and Generalizations.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). www.statlearning.com
Wager, S. & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. JASA, 113(523): 1228–1242.
Chernozhukov et al., Applied Causal Inference Powered by ML and AI: causalml-book.org
Causal step: event-study DiD around 159 state minimum-wage hikes.
Detail: Angrist & Frandsen (2022)
Outcome:\(Y_i = \log(\text{weekly wage}_i)\) for male college graduates.
Treatment: college attributes (private, elite, …).
Controls: ~384 application-process variables (schools applied to, SATs, interactions).
Method: post-double-selection LASSO (Belloni et al., 2014).
Findings: PDS estimates match full-model OLS; private-college premium \(\approx 0.02\)–\(0.04\) (PDS) vs \(0.017\) (OLS). Tuning is robust: 18, 100, 112 selected variables all give similar \(\hat\alpha\). ML helps check identification, not generate it.
Detail: Gilchrist & Sands (2016)
Outcome: total movie attendance over 5 post-release weekends.
Treatment: opening-weekend attendance (endogenous – driven by quality).
Instruments: local weather + interactions (rain \(\times\) weekend \(\times\) region) – many candidates.
Method: LASSO selects relevant instruments; 2SLS uses the selected set.
Findings: +1% opening attendance \(\Rightarrow\) +2% cumulative 5-week attendance. Local to the city; independent of reviews \(\Rightarrow\) pure social experience motive.