09. Machine Learning

Data Science for Economists

Irene Iodice

2026-06-10

Road-map

  1. What is machine learning?
  2. Supervised learning: regression vs classification
  3. A first classifier: \(k\)-NN and the bias–variance trade-off
  4. Decision trees
  5. Evaluating model performance
  6. High-dimensional data and the limits of OLS
  7. Shrinkage: Ridge, LASSO, Elastic Net
  8. An economics application: growth and convergence
  9. Unsupervised learning: PCA and \(k\)-means
  10. The tidymodels pipeline
  11. A teaser: causal forests

Appendix: bias–variance math, OLS breakdown, LASSO geometry, post-double-selection LASSO, Double ML, more applications.

What is Machine Learning?

A Short Definition

“Machine Learning is the science of getting computers to learn without being explicitly programmed.”

\(\qquad\) – Arthur Samuel, 1959


Samuel built a checkers program in the 1950s that played better the more games it had seen. No new code, just more data.

A Working Definition

Mitchell (1997)

A computer program learns from experience \(E\) with respect to a class of tasks \(T\) and a performance measure \(P\), if its performance on \(T\), measured by \(P\), improves with \(E\).

Handwriting recognition

  • \(T\): classify handwritten words in images
  • \(P\): share of correctly classified words
  • \(E\): labelled dataset of handwritten words

Autonomous driving

  • \(T\): drive a highway from camera input
  • \(P\): average distance before human takes over
  • \(E\): sequence of (image, steering command) pairs from a human driver

Three Flavours of ML

  1. Supervised learning – learn \(f: X \to Y\) from labelled data
    • Regression (continuous \(Y\)): predict rent from features
    • Classification (discrete \(Y\)): spam vs ham, sepsis vs not
  2. Unsupervised learning – find structure in unlabelled data
    • Clustering, dimension reduction
  3. Reinforcement learning – an agent learns by trial-and-error from rewards
    • AlphaGo, robot locomotion

Today: mostly (1), a quick tour of (2), and (3) only by name.

Why economists care

  • Healthcare: predict sepsis risk from hundreds of sensor streams (Kleinberg et al., 2015).
  • Finance: predict loan default from mobile-phone metadata when credit history is missing (Bjorkegren & Grissen, 2017).
  • Urban policy: map poverty from satellite imagery with millions of pixel features (Naik et al., 2017).
  • Labour: identify minimum-wage workers from rich demographics (Cengiz et al., 2024).

Common pattern

Hundreds to millions of predictors, often \(p \gg n\). Classical tools wobble.

Supervised Learning

Regression vs Classification

Regression – continuous outcome

  • House prices from size, location, age
  • Wages from education, experience
  • GDP growth from macro indicators

Classification – discrete label

  • Spam vs ham email
  • Sepsis vs healthy patient
  • Iris setosa vs versicolor vs virginica

Same workflow

  1. Split data into training and test sets.
  2. Fit the model on training data.
  3. Score performance on test data.

The test split protects against overfitting – a model that memorises the training set but generalises poorly.

Types of Classification

Binary

Spam / not spam

Multi-class

Iris species

Multi-label

Movie genres (one film can be both thriller and romance)

Running Example: Iris

Iris setosa

Iris versicolor

Iris virginica
  • 150 flowers, 50 of each species
  • 4 features: sepal length/width, petal length/width
  • Goal: predict the species
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) +
  geom_point() + theme_minimal()

A First Classifier: \(k\)-NN

\(k\)-Nearest Neighbours

Idea: to classify a new point \(x_0\), look at the \(k\) closest training points and take a majority vote.

Euclidean distance

\[ d(p, q) = \sqrt{(p_1 - q_1)^2 + \cdots + (p_n - q_n)^2} \]

  • No parameters, no training – just store the data.
  • Choice of \(k\) controls how “smooth” the decision boundary is.
  • Sensitive to feature scale – standardise first.

The Bias–Variance Trade-Off, Visualised

\(k = 1\)

Very wiggly boundary. Fits training data perfectly. Low bias, high variance.

\(k = 20\)

Smooth boundary. Misses local structure. High bias, low variance.

Important

The “right” \(k\) balances the two. We’ll pick it by cross-validation.

Bias–Variance Trade-Off

  • More flexibility (smaller \(k\), more predictors, deeper trees) \(\Rightarrow\) lower bias, higher variance.
  • Test error is U-shaped: it falls as bias drops, then rises when variance dominates.
  • The sweet spot is somewhere in the middle and depends on the data.
  • Formal decomposition into bias\(^2\) + variance + irreducible noise is in the appendix.

\(k\)-NN in R

library(class)

set.seed(1234)
train_idx  <- sample(seq_len(nrow(iris)), size = 0.7 * nrow(iris))
train_data <- iris[train_idx, ]
test_data  <- iris[-train_idx, ]

pred <- knn(train = train_data[, 1:4],
            test  = test_data[, 1:4],
            cl    = train_data$Species,
            k     = 5)

table(test_data$Species, pred)

Decision Trees

Decision Trees

  • Recursively split the feature space along one variable at a time.
  • Each split picks the variable + threshold that best separates the classes.
  • Final prediction = majority class in the leaf.

Pros: very interpretable, handles numeric and categorical features, no scaling needed.

Cons: unstable, prone to overfitting.

Many of the most successful ML methods (random forests, gradient boosting, causal forests) are ensembles of trees.

A Tree on Iris

library(rpart)
library(rpart.plot)

fit <- rpart(Species ~ ., data = iris, method = "class")
rpart.plot(fit)

Evaluating Model Performance

Confusion Matrix

For a binary classifier with classes positive / negative:

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Derived metrics

  • Accuracy \(= \frac{TP + TN}{\text{all}}\)
  • Precision \(= \frac{TP}{TP + FP}\) – of those we flagged, how many were right?
  • Recall \(= \frac{TP}{TP + FN}\) – of all true positives, how many did we catch?
  • F1 = harmonic mean of precision and recall.

Train Error Lies

  • Training error is non-increasing in model complexity – it always looks better with more parameters.
  • We need test-set performance, but data is scarce.

Solution: resampling

Cross-validation simulates the test scenario from training data.

\(k\)-Fold Cross-Validation

Validation-set split

  • One random train/test split.
  • Fast, but high-variance estimate of test error.

\(K\)-fold CV

  • Split data into \(K\) chunks.
  • Each chunk takes a turn as the test fold.
  • Average the \(K\) test errors.
  • Typical: \(K = 5\) or \(10\).

\[ \mathrm{CV}_K = \frac{1}{K} \sum_{k=1}^{K} \mathrm{MSE}_k \]

Wage–Age: Why CV Matters

A single validation split gives a noisy answer.

\(K\)-Fold Smooths Things Out

10-fold CV across 9 repeats: the polynomial-degree choice is now clear.

High-Dimensional Data

What Do We Mean by \(n\) and \(p\)?

\(n\) – observations, \(p\) – predictors

Rows vs columns in your data matrix.

Low-dimensional

  • \(n = 2{,}000\) patients
  • \(p = 3\) covariates (age, sex, BMI)

High-dimensional

  • \(n = 200\) individuals
  • \(p = 500{,}000\) SNPs (genetics)

Important

Same statistical questions – prediction, inference – become much harder as \(p\) grows.

OLS Breaks Down When \(p\) Is Large

  1. Rank deficiency (\(p \ge n\)): infinitely many least-squares solutions.
  2. Multicollinearity: highly correlated predictors \(\Rightarrow\) huge coefficient variance.
  3. Noise variables: irrelevant \(X_j\) inflate variance without adding signal.
  4. Rarely sparse: OLS almost never returns exact zeros, so it’s hard to interpret.

MC simulation, \(n = 50\), correlated predictors, \(\rho = 0.9\): OLS test MSE grows with \(p\).

Shrinkage: Ridge, LASSO, Elastic Net

The General Idea

\[ \min_{\boldsymbol\beta}\;\; \underbrace{\frac{1}{n}\sum_{i=1}^{n}\bigl(y_i - \alpha - x_i'\boldsymbol\beta\bigr)^2}_{\text{loss}} \;+\; \underbrace{\lambda \sum_{j=1}^{p} k(|\beta_j|)}_{\text{penalty}} \]

  • The penalty shrinks coefficients toward zero.
  • \(\lambda \ge 0\) controls the strength of shrinkage; choose by cross-validation.
  • Different choices of \(k(\cdot)\) give Ridge, LASSO, or Elastic Net.

Key intuition

A little bias, in exchange for a big drop in variance – often a great deal in high dimensions.

Ridge Regression (\(\ell_2\))

\[ \hat{\boldsymbol\beta}^{\text{ridge}} = \arg\min_{\boldsymbol\beta}\; \mathrm{RSS}(\boldsymbol\beta) + \lambda \sum_{j=1}^{p} \beta_j^{2} \]

  • Quadratic penalty \(\Rightarrow\) all \(\beta_j\) get shrunk, but none are exactly zero.
  • Great when predictors are many and strongly correlated – shrinks them toward each other.
  • Standardise predictors before fitting.

LASSO Regression (\(\ell_1\))

\[ \hat{\boldsymbol\beta}^{\text{lasso}} = \arg\min_{\boldsymbol\beta}\; \mathrm{RSS}(\boldsymbol\beta) + \lambda \sum_{j=1}^{p} |\beta_j| \]

  • \(\ell_1\) geometry has corners – many coefficients shrink to exactly 0.
  • Built-in variable selection: sparse, interpretable models.
  • Workhorse when \(p \gg n\).

Important

LASSO approximates best-subset selection without the \(2^p\) cost.

Elastic Net (\(\ell_1 + \ell_2\))

\[ \hat{\boldsymbol\beta}^{\text{EN}} = \arg\min_{\boldsymbol\beta}\; \mathrm{RSS}(\boldsymbol\beta) + \lambda \Bigl[ \alpha \sum_{j} |\beta_j| + (1 - \alpha)\sum_{j} \beta_j^{2} \Bigr] \]

  • \(\alpha = 1\): LASSO. \(\alpha = 0\): Ridge. In between: a mix.
  • Keeps sparsity and a grouping effect – correlated predictors enter together.
  • Two hyperparameters \((\lambda, \alpha)\), tuned with a 2-D CV grid.

Cheat-Sheet

Method Penalty Sparsity? Best for
Ridge \(\sum \beta_j^2\) No Multicollinearity, \(p < n\)
LASSO \(\sum |\beta_j|\) Yes Interpretation, \(p \gg n\)
Elastic Net \(\ell_1 + \ell_2\) Yes Correlated groups, \(p \gg n\)

LASSO with CV in R

library(glmnet)

X <- model.matrix(wage ~ ., data = Wage)[, -1]
y <- Wage$wage

fit <- cv.glmnet(X, y, alpha = 1, standardize = TRUE)
# alpha = 1 : LASSO
# alpha = 0 : Ridge
# 0 < alpha < 1 : Elastic Net

plot(fit)
coef(fit, s = "lambda.min")

Afternoon Application: Growth and Convergence

Solow and the Convergence Hypothesis

  • The neoclassical (Solow) growth model has diminishing returns to capital.
  • A country far below its steady state has high marginal product of capital \(\Rightarrow\) invests, accumulates, grows fast.
  • A country near its steady state earns lower returns \(\Rightarrow\) grows slowly.

\(\beta\)-convergence

Poor countries should grow faster than rich countries. In a cross-section:

\[ \underbrace{\frac{1}{T}\bigl(\ln y_{i,T} - \ln y_{i,0}\bigr)}_{\text{growth rate}} = \alpha + \beta \ln y_{i,0} + \varepsilon_i, \qquad \text{Solow predicts } \beta < 0. \]

The Empirical Puzzle

Barro-Lee data, 90 countries, 1960–85. The bivariate slope is essentially zero.

Estimated \(\hat\beta = +0.001\), \(p = 0.83\). No unconditional convergence. Did Solow fail?

Conditional Convergence (Barro, 1991)

Maybe countries converge – but each to its own steady state, which depends on:

  • Human capital: schooling, life expectancy.
  • Demographics: fertility, population growth.
  • Fiscal policy: government consumption, taxes.
  • Openness: trade share, FDI.
  • Institutions / stability: rule of law, inflation, political instability.

Add the right controls and the conditional coefficient turns negative and significant – the famous “roughly 2% per year” iron law.

Problem: the Barro-Lee dataset has ~60 candidate controls and only 90 countries. Which controls?

Why This Is a Machine-Learning Problem

  • With 60 candidate controls and \(n = 90\), classical OLS is nearly saturated and standard errors explode.
  • \(t\)-tests on individual controls are unreliable: multiple testing, multicollinearity.
  • “Pick the right controls and you get convergence” is uncomfortably close to specification searching.

PDS-LASSO (Belloni-Chernozhukov-Hansen, 2014)

  1. LASSO of growth on controls \(\rightarrow\) keep what predicts \(Y\).
  2. LASSO of \(\ln y_0\) on controls \(\rightarrow\) keep what predicts \(D\).
  3. OLS of \(Y\) on \(D\) and the union of selected controls.

This gives valid inference on the convergence rate \(\beta\), even when \(p\) is comparable to \(n\).

PDS-LASSO Recovers Convergence

Barro-Lee 1960–85. Bivariate OLS sees nothing; full OLS is noisy; PDS-LASSO recovers \(\hat\beta \approx -0.05\), \(p \approx 0.002\).

Interpretation: controlling for human capital, demographics, fiscal policy and openness, poor countries did catch up at ~5% per year.

Does It Still Hold Today? PWT 1995–2019

We rebuild the analysis with Penn World Table 10.01 + World Bank WDI:

  • 172 countries, 25 years (1995–2019).
  • 11 base controls \(\times\) squares and interactions \(\Rightarrow\) \(p = 77\).

This time even the bivariate slope is negative and significant: \(\hat\beta = -0.006\), \(p < 0.001\). Convergence is now visible in the raw data – rise of China, India, and other emerging markets.

Does It Still Hold Today? Results

Bivariate OLS, full OLS, and PDS-LASSO across both periods.

Pedagogical pay-off

The same ML pipeline – bivariate, full OLS, Ridge/LASSO, PDS-LASSO – runs unchanged on a 60-year-old dataset and on data ending six years ago. Conditional convergence is robust.

ML in Empirical Econ: Other Examples

  • Cengiz, Dube, Lindner & Zentler-Munro (2024, JLE) – minimum wages: Elastic Net predicts \(\Pr(\text{wage}_i \le \text{MW})\) from CPS demographics; event-study DiD around 159 state hikes shows +2–3% wages, no employment loss, ~75% coverage.
  • Angrist & Frandsen (2022, JLE) – “Machine Labor”: PDS-LASSO with ~384 application-process controls confirms there’s no elite-college wage premium.
  • Gilchrist & Sands (2016, JPE): LASSO selects weather-based instruments for movie opening-weekend attendance; +1% opening lifts five-week attendance by ~2%.

Common recipe: predictive ML for the nuisance part, classical inference for the causal parameter. Full slides on each in the appendix.

Unsupervised Learning

Supervised vs Unsupervised

  • Supervised: we have labels \(y_i\), we learn \(f: X \to Y\).
  • Unsupervised: no labels – find structure (groups, low-D representations).

Two Big Families

Dimension reduction – compress many features into a few

  • PCA, t-SNE, UMAP, autoencoders
  • Use: visualisation, multicollinearity, feature construction

Clustering – group similar observations

  • \(k\)-means, hierarchical, DBSCAN, Gaussian mixtures
  • Use: customer segmentation, disease subtypes, anomaly detection

Principal Component Analysis

  • Find directions (lines / planes / hyper-planes) that capture maximum variance.
  • Components are orthogonal and ordered by how much variance they explain.
  • Often a handful of components capture >90% of the variation.

PCA: The Geometric Idea

  • Each principal component is a linear combination of the original features.
  • The first PC minimises the squared perpendicular distance from the points to the line.
  • Equivalently: it’s the leading eigenvector of the covariance matrix.

Important

Always standardise first. Otherwise PCA is dominated by whichever variable has the largest units (kg vs g, EUR vs cents).

PCA in R

iris_scaled <- scale(iris[, 1:4])

pca <- prcomp(iris_scaled)
summary(pca)

pca_data <- as.data.frame(pca$x)
pca_data$Species <- iris$Species

ggplot(pca_data, aes(PC1, PC2, color = Species)) +
  geom_point() +
  theme_minimal() +
  labs(title = "PCA of iris")

\(k\)-Means Clustering

Algorithm

  1. Pick \(K\) (the number of clusters).
  2. Initialise \(K\) cluster centres at random.
  3. Assign each point to the nearest centre.
  4. Recompute each centre as the mean of its assigned points.
  5. Repeat 3–4 until assignments stop changing.

Choosing \(K\): The Elbow Method

  • Run \(k\)-means for \(K = 1, 2, \dots, K_{\max}\).
  • For each \(K\) compute the within-cluster sum of squares (WCSS).
  • WCSS always falls with \(K\). Look for the elbow – where extra clusters stop helping much.
  • Heuristic; combine with silhouette / gap-statistic in practice.

\(k\)-Means in R

set.seed(123)
km <- kmeans(iris_scaled, centers = 3)

cluster_data <- as.data.frame(pca$x)
cluster_data$Cluster <- factor(km$cluster)

ggplot(cluster_data, aes(PC1, PC2, color = Cluster)) +
  geom_point() +
  theme_minimal() +
  labs(title = "k-means on iris (shown in PCA space)")

The tidymodels Pipeline

tidymodels in R

recipe() \(\rightarrow\) workflow() \(\rightarrow\) tune_grid()

library(tidymodels)

rec <- recipe(wage ~ ., data = train_data) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

lasso_spec <- linear_reg(penalty = tune(), mixture = 1) |>
  set_engine("glmnet")

wf <- workflow() |> add_recipe(rec) |> add_model(lasso_spec)

folds <- vfold_cv(train_data, v = 10)
grid  <- grid_regular(penalty(), levels = 50)

results    <- tune_grid(wf, resamples = folds, grid = grid)
best       <- select_best(results, metric = "rmse")
final_fit  <- finalize_workflow(wf, best) |> fit(data = train_data)

Same API works for Ridge, decision trees, random forests, boosted trees, \(k\)-NN, …

Teaser: Causal Forests

Heterogeneous Treatment Effects

LASSO and friends estimate average effects. What if a policy helps some people and hurts others?

Causal forests (Wager & Athey, 2018) estimate the conditional average treatment effect

\[ \tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x] \]

by growing an ensemble of trees that split to maximise treatment-effect heterogeneity – with sample-splitting that yields valid confidence intervals.

library(grf)
cf <- causal_forest(X = as.matrix(covariates), Y = outcome, W = treatment)
tau_hat <- predict(cf)$predictions

More in the appendix; for today, just know the tool exists.

Wrap-Up

Key Take-Aways

  • ML = improving at a task \(T\) from experience \(E\), measured by \(P\). Supervised / unsupervised / reinforcement.
  • Bias–variance trade-off drives every modelling choice; use cross-validation to land in the sweet spot.
  • Shrinkage: Ridge keeps everything, LASSO selects, Elastic Net does both. The default workhorse when \(p \gg n\).
  • PCA and \(k\)-means are the unsupervised counterparts – compress and cluster.
  • For applied work: tidymodels in R is the unified pipeline.
  • Causal forests estimate heterogeneous effects when treatment is exogenous.
  • ML helps identification; it does not create it.

References

  • Belloni, A., Chernozhukov, V., & Hansen, C. (2011). Inference for high-dimensional sparse econometric models. Advances in Economics and Econometrics.
  • Cengiz, D., Dube, A., Lindner, A., & Zentler-Munro, R. (2024). Seeing Beyond the Trees: Using Machine Learning to Estimate the Impact of Minimum Wages on Labor Market Outcomes. Journal of Labor Economics.
  • Chernozhukov, V., et al. (2019). Inference in factorial experiments with high-dimensional covariates. arXiv:1903.10075.
  • Gilchrist, D. S. & Sands, E. G. (2016). Something to Talk About: Social Spillovers in Movie Consumption. JPE, 124(5): 1268–1304.
  • Angrist, J. D. & Frandsen, B. (2022). Machine Labor. Journal of Labor Economics, 40(S1): S97–S140.
  • Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical Learning with Sparsity: The LASSO and Generalizations.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). www.statlearning.com
  • Wager, S. & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. JASA, 113(523): 1228–1242.
  • Chernozhukov et al., Applied Causal Inference Powered by ML and AI: causalml-book.org

Appendix

Material below is for reference and deeper dives.

Measuring Quality of Fit: MSE

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2 \]

  • \(y_i\) is the true value for observation \(i\).
  • \(\hat{f}(x_i)\) is the predicted value from our model.
  • Lower MSE means smaller average prediction error.

Bias–Variance Decomposition (Formal)

For any fixed test point \(x_0\), the expected test MSE decomposes into three non-negative parts:

\[ \mathbb{E}\!\bigl[(y_0 - \hat{f}(x_0))^2\bigr] = \underbrace{\operatorname{Var}\!\bigl[\hat{f}(x_0)\bigr]}_{\text{variance}} + \underbrace{\bigl(\operatorname{Bias}\!\bigl[\hat{f}(x_0)\bigr]\bigr)^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}(\varepsilon)}_{\text{irreducible error}} \]

  • Variance: how much \(\hat{f}\) changes if we refit on a new training sample.
  • Bias: error from approximating an unknown \(f\) with a simpler model.
  • Irreducible error: noise we cannot model away.

Bias–Variance Trade-Off in R

flexibility <- seq(1, 10, by = 0.1)
bias2    <- (10 - flexibility)^2 / 10
variance <- flexibility
test_mse <- bias2 + variance + 2  # include irreducible error

df <- data.frame(flexibility, bias2, variance, test_mse)

ggplot(df, aes(x = flexibility)) +
  geom_line(aes(y = bias2),    color = "blue", linetype = "dashed") +
  geom_line(aes(y = variance), color = "red",  linetype = "dashed") +
  geom_line(aes(y = test_mse), color = "black", linewidth = 1.2) +
  labs(y = "Error", x = "Model flexibility") +
  theme_minimal()

OLS Variance Explosion: Monte Carlo

library(tidyverse)
set.seed(1234)
test_mse <- sapply(seq(5, 50, by = 5), function(p) {
  beta  <- rep(c(1, 0), length.out = p)
  Sigma <- matrix(0.9, p, p) + diag(p) * 0.1

  mse <- replicate(200, {
    X       <- MASS::mvrnorm(50, rep(0, p), Sigma)
    y       <- X %*% beta + rnorm(50)
    test_X  <- MASS::mvrnorm(100, rep(0, p), Sigma)
    test_y  <- test_X %*% beta + rnorm(100)
    y_hat   <- predict(lm(y ~ X), newdata = data.frame(X = test_X))
    mean((test_y - y_hat)^2)
  })
  mean(mse)
})

Classical Diagnostics Fail in High Dimensions

  • Adding predictors always drives training RSS down and \(R^2\) up – even pure noise.
  • \(R^2\) and adjusted \(R^2\) are not reliable for picking a high-dimensional model.
  • AIC / BIC depend on asymptotics where \(p\) is fixed.
  • \(t\)-tests on individual \(\beta_j\) are distorted by multiple testing and multicollinearity.

Important

In high dimensions OLS gives neither stable predictions nor valid inference.

Information Criteria

  • AIC: \(\tfrac{1}{n}(\text{RSS} + 2 p \hat\sigma^2)\) – lower is better.
  • BIC: \(\tfrac{1}{n}(\text{RSS} + \log(n)\, p \hat\sigma^2)\) – heavier penalty than AIC.
  • Adjusted \(R^2\): \(1 - \tfrac{\text{RSS}/(n - p - 1)}{\text{TSS}/(n - 1)}\) – higher is better.

All push back against overfitting, but rely on \(p\) being small relative to \(n\).

Best Subset & Stepwise Selection

Best subset

  1. Start with the null model \(\mathcal{M}_0\).
  2. For each \(k = 1, \dots, p\) fit all \(\binom{p}{k}\) models and keep the best.
  3. Pick the size with the lowest CV / IC.
Method Pros Cons
Best subset Conceptually clean, gives sparse models Computationally infeasible if \(p \gtrsim 40\)
Stepwise Cheap (~\(p(p+1)/2\) fits) Greedy, sensitive to entry/removal order

LASSO vs Ridge: Geometry

The constrained forms:

\[ \min_{\boldsymbol\beta}\;\mathrm{RSS}(\boldsymbol\beta) \quad \text{s.t.}\;\sum_j |\beta_j| \le s \quad \text{(LASSO)} \]

\[ \min_{\boldsymbol\beta}\;\mathrm{RSS}(\boldsymbol\beta) \quad \text{s.t.}\;\sum_j \beta_j^2 \le s \quad \text{(Ridge)} \]

  • Ridge ball is round – RSS contours touch it on an edge \(\Rightarrow\) no zeros.
  • LASSO ball is a diamond – contours touch a corner \(\Rightarrow\) sparsity.

Log-Spaced \(\lambda\) Grid

  • Shrinkage is highly nonlinear near zero – small \(\lambda\) moves cause big coefficient shifts.
  • A log grid samples densely where the action is.
  • Linear grids waste effort at large \(\lambda\) where every \(\beta_j = 0\).

Typical grid:

\[ \lambda \in \{10^{-4}, 10^{-3.9}, \dots, 10^{2}\}\quad\text{(100 values, log-spaced)} \]

Picking \(\lambda\) and Doing Inference After Selection

  • \(K\)-fold CV is the default in glmnet / scikit-learn.
  • Information criteria: AICc, BIC, EBIC.
  • Theory-driven: \(\lambda \propto \sqrt{\log p / n}\) (Belloni et al., 2011).

After selection:

  1. Post-LASSO OLS: re-estimate on the selected support to undo shrinkage bias.
  2. Debiased / desparsified LASSO: delivers asymptotically normal estimates for individual \(\beta_j\).
  3. Double Machine Learning: orthogonal scores + sample-splitting for causal parameters.

Post-Double-Selection LASSO

Estimate the treatment effect \(\alpha\) in

\[ Y = \alpha D + X'\beta + \varepsilon, \qquad p \gg n \]

  1. LASSO of \(Y\) on \(X\) \(\rightarrow\) selected set \(X_Y\).
  2. LASSO of \(D\) on \(X\) \(\rightarrow\) selected set \(X_D\).
  3. \(X_S = X_Y \cup X_D\).
  4. OLS of \(Y\) on \(D\) and \(X_S\).

Captures variables predictive of \(Y\) or \(D\) – protects against omitted-variable bias and keeps inference valid.

Detail: Cengiz et al. (2024)

  1. Outcome: \(y_i = \mathbb{1}\{\text{hourly wage}_i \le \text{MW}_{st}\}\) on CPS 2013 micro data.
  2. Features: age splines, gender, race, education, marital status, industry, occupation, hours, state FE, interactions (\(p \approx 150\)).
  3. Model: Logistic Elastic Net, predictors \(z\)-scored.
  4. Tuning: 10-fold CV on \((\lambda, \alpha)\); optimum near \(\alpha \approx 0.5\).
  5. Output: predicted score \(\hat p_i = \Pr(y_i = 1 \mid X_i)\) defines high-probability and high-recall treatment groups.
  6. Causal step: event-study DiD around 159 state minimum-wage hikes.

Detail: Angrist & Frandsen (2022)

  • Outcome: \(Y_i = \log(\text{weekly wage}_i)\) for male college graduates.
  • Treatment: college attributes (private, elite, …).
  • Controls: ~384 application-process variables (schools applied to, SATs, interactions).
  • Method: post-double-selection LASSO (Belloni et al., 2014).

Findings: PDS estimates match full-model OLS; private-college premium \(\approx 0.02\)\(0.04\) (PDS) vs \(0.017\) (OLS). Tuning is robust: 18, 100, 112 selected variables all give similar \(\hat\alpha\). ML helps check identification, not generate it.

Detail: Gilchrist & Sands (2016)

  • Outcome: total movie attendance over 5 post-release weekends.
  • Treatment: opening-weekend attendance (endogenous – driven by quality).
  • Instruments: local weather + interactions (rain \(\times\) weekend \(\times\) region) – many candidates.
  • Method: LASSO selects relevant instruments; 2SLS uses the selected set.

Findings: +1% opening attendance \(\Rightarrow\) +2% cumulative 5-week attendance. Local to the city; independent of reviews \(\Rightarrow\) pure social experience motive.

Causal Forests in Detail

library(grf)

cf <- causal_forest(
  X = as.matrix(covariates),
  Y = outcome,
  W = treatment,
  num.trees = 2000
)

tau_hat <- predict(cf)$predictions
average_treatment_effect(cf)
variable_importance(cf)
  • “Honest” trees: one subsample grows the tree, another estimates leaf effects.
  • Yields valid asymptotic confidence intervals for \(\tau(x)\).
  • Use when treatment \(W\) is (conditionally) exogenous and you want who benefits how much.

Practical Tips

  • Always visualise coefficient paths against \(\lambda\).
  • Compare out-of-sample \(R^2\) to a vanilla OLS baseline.
  • Use stability selection (Meinshausen & Bühlmann) to gauge which variables matter.
  • glmnet::relax re-fits OLS on the selected support to reduce shrinkage bias.
  • For causal work: combine ML with orthogonal scores (Double ML).