Road-map
- Why high-dimensional data?
- Limitations of OLS
- Bias–variance tradeoff
- Evaluating model performance
- Beyond OLS: subset selection and shrinkage
- LASSO, Ridge, and Elastic Net
- Post-double-selection LASSO
- Applications in economics
- The tidymodels pipeline
- Causal forests
Why Should We Care About High-Dimensional Data?
- Modern data sets often measure \(p \gg n\) features.
- Opportunities: richer signals, personalised predictions, automated text/image analysis.
- Challenges: overfitting, interpretability, computational burden.
- Central question of this lecture: How can we learn reliably when \(p\) is large?
What Do We Mean by \(n\) and \(p\)?
\(n\) – observations, \(p\) – predictors
Rows vs. columns in your data matrix.
Low-dimensional example
- \(n = 2{,}000\) patients
- \(p = 3\) covariates (age, sex, BMI)
High-dimensional example
- \(n = 200\) individuals
- \(p = 500{,}000\) SNPs (genetics)
Same statistical questions – prediction, inference – become harder when \(p\) grows.
Applications in Economics
Selected Use-Cases
- Healthcare: predicting sepsis risk from hundreds of sensor streams (Kleinberg et al., 2015).
- Finance: Predict loan default in Latin America where credit history is absent with mobile phone metadata (Bjorkegren & Grissen, 2017).
- Urban policy: mapping poverty from satellite imagery with millions of pixel features (Naik et al., 2017).
- Policing: estimating weapon possession likelihood from rich incident reports (Goel et al., 2016).
All tasks involve hundreds to millions of predictors.
Measuring the Quality of Fit
Evaluate how well our model’s predictions align with actual outcomes.
Key metric in regression: Mean Squared Error (MSE)
\[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2
\]
- \(y_i\) is the true value for observation \(i\).
- \(\hat{f}(x_i)\) is the predicted value from our model.
- MSE quantifies the average squared difference between predictions and actual values.
A lower MSE means better fit – smaller prediction errors on average.
Bias–Variance Decomposition
- For any fixed test point \(x_0\), the expected test MSE splits into three non-negative parts:
\[
\mathbb{E}\!\bigl[(y_0 - \hat{f}(x_0))^2\bigr]
= \underbrace{\operatorname{Var}\!\bigl[\hat{f}(x_0)\bigr]}_{\text{variance}}
+ \underbrace{\bigl(\operatorname{Bias}\!\bigl[\hat{f}(x_0)\bigr]\bigr)^2}_{\text{bias}^2}
+ \underbrace{\operatorname{Var}(\varepsilon)}_{\text{irreducible error}}
\]
- Variance: how much \(\hat{f}\) would change if we refit on a new training set.
- Bias: error introduced by approximating the unknown, possibly complex \(f\) with a simpler model.
- The irreducible error \(\operatorname{Var}(\varepsilon)\) comes from intrinsic noise.
The Bias–Variance Trade-Off
- Increasing model flexibility (number of predictors, polynomial degree, splits in a tree) lowers bias but raises variance.
- Test MSE is U-shaped: initially falls as bias drops, then rises when variance dominates.
- Optimal flexibility balances the two curves.
- Practical rule: very simple methods risk high bias; highly flexible ones risk high variance. We need to balance.
Bias–Variance Trade-Off in R
Why Ordinary Least Squares Breaks Down
\[
\hat{\boldsymbol{\beta}}_{\text{OLS}}
= \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n}\bigl(y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\bigr)^2
\]
- Rank deficiency (\(p \ge n\))
- Design matrix \(X\) loses full rank.
- \(X^\top X \boldsymbol\beta = X^\top y\) has infinitely many solutions.
- Variance explosion (multicollinearity)
- Highly correlated predictors lead to small eigenvalues of \(X^\top X\).
- Coefficient estimates fluctuate wildly across samples.
- Noise variables galore
- Including many irrelevant \(X_j\) adds variance but no signal.
- Test MSE rises even when \(p < n\).
- Zero coefficients are rare
- OLS seldom yields exact zeros, hurting interpretability and parsimony.
Visual: Variance Explosion
![]()
MC simulation: For each \(p \in \{5, 10, \dots, 50\}\), we simulate \(n=50\) observations with correlated predictors (\(\rho = 0.9\)). Half of the coefficients are set to zero. We compute the test MSE of OLS across 200 repetitions.
Result: MSE increases with dimensionality (overfitting).
Classical Model Diagnostics Are Ineffective
Training fit \(\neq\) Generalisation
- Least squares chooses \(\hat{\boldsymbol\beta}\) to minimise in-sample RSS, so the training MSE is optimistically biased.
- Adding predictors always drives \(\text{RSS}_{\text{train}}\) down and \(R^2_{\text{train}}\) up – even if the new variables contain only noise.
- The test error, by contrast, follows the U-shape from the bias–variance trade-off and may rise once variance dominates.
In high dimensions, metrics computed on the training data (\(R^2\), RSS, adjusted \(R^2\)) are not reliable for model selection.
Alternative Model Diagnostics
- AIC: \(\frac{1}{n}\bigl(\text{RSS} + 2p\hat\sigma^2\bigr)\)
- BIC: \(\frac{1}{n}\bigl(\text{RSS} + \log(n)\,p\,\hat\sigma^2\bigr)\) – heavier penalty than AIC.
- Adjusted \(R^2\): \(1 - \frac{\text{RSS}/(n-p-1)}{\text{TSS}/(n-1)}\), penalises added predictors.
AIC and BIC: lower is better. Adjusted \(R^2\): higher is better. All help against overfitting.
When Diagnostics Lose Their Bite
- \(R^2\) and adjusted \(R^2\) always increase with additional predictors.
- Information criteria (AIC/BIC) rely on asymptotics where \(p\) is fixed.
- Hypothesis tests for individual \(\beta_j\) become unreliable due to multiple-testing and multicollinearity.
OLS provides neither stable predictions nor valid inference in high dimensions.
Evaluating Model Performance
Why Simple Training Error Is Misleading
- Training error is always non-increasing in model complexity.
- Need test-set performance – but data is scarce.
Resampling methods emulate the test-set scenario.
Cross-Validation in a Nutshell
Validation-set split
- One random split into train/test.
- Fast but high variance.
\(K\)-fold CV
- Partition data into \(K\) chunks.
- Cycle each chunk as test fold.
- Typical: \(K = 5\) or 10.
Leave-One-Out CV (LOOCV) – extreme case \(K = n\): minimal bias, maximal computation.
\[
\mathrm{CV}(n) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}^{(-i)}(x_i))^2
\]
Validation of Mincerian Equation
![]()
Best approximation of the wage–age relationship is quadratic, but there is high variability in estimated test error across different validation splits.
K-Fold Validation of Mincerian Equation
![]()
Estimated test error across different validation splits is similar.
Three Solutions for Many Predictors
When \(p\) is large, ordinary least squares (OLS) becomes unreliable:
- Overfitting
- High variance
- Poor generalization
Three major classes of solutions:
- Subset Selection
- Shrinkage (Regularization)
- Dimension Reduction
Each tackles high-dimensionality differently.
Best Subset Selection
Algorithm:
- Let \(\mathcal{M}_0\) denote the null model, containing no predictors. It predicts the sample mean for all observations.
- For \(k = 1, 2, \dots, p\):
- Fit all \(\binom{p}{k}\) models with exactly \(k\) predictors.
- Select the model \(\mathcal{M}_k\) with the lowest RSS (or highest \(R^2\)).
- Choose the best model among \(\{\mathcal{M}_0, \dots, \mathcal{M}_p\}\) using:
- \(C_p\), AIC, BIC, or Adjusted \(R^2\)
- \(K\)-fold Cross-Validation.
Best-Subset vs Stepwise: Pros and Cons
| Best Subset |
Conceptually simple, can yield sparse interpretable models. |
Computationally infeasible if \(p \gtrsim 40\). Requires evaluating all \(2^p\) models. |
| Forward/Backward Stepwise |
Much cheaper: only \(\sim \frac{p(p+1)}{2}\) models. Fast even for large \(p\). |
Greedy: may miss the globally best model. Sensitive to order of entry/removal. |
Model selection criteria: \(C_p\), BIC, Adjusted \(R^2\) can help pick the best subset size.
Penalized Linear Models
\[
\min_{\boldsymbol{\beta} \in \mathbb{R}^p}\left\{\underbrace{l(\alpha,\beta)}_{\text{loss function}} + n\lambda \sum_{j=1}^p \underbrace{k_j(|\beta_{j}|)}_{\text{penalty (shrinkage)}} \right\}
\]
where:
- \(l(\alpha,\beta) = \frac{1}{n} \sum_{i=1}^n \bigl(y_i - (\alpha + \mathbf{x}'_i \boldsymbol{\beta})\bigr)^2\) in Gaussian linear regression (RSS)
- \(k_j(\cdot)\) is an increasing cost function that penalises deviation of \(\beta_j\) from zero
- \(\lambda \ge 0\) adjusts the complexity of the solution (typically chosen via held-out sample or K-fold CV)
- The sample size \(n\) term scales down the penalty to compensate for larger datasets
Ridge Regression (\(\ell_2\) shrinkage)
- Adds a quadratic penalty to keep coefficients small:
\[
\hat{\boldsymbol\beta}^{\text{ridge}}
= \arg\min_{\beta}
\Bigl\{
\text{RSS}(\beta)
+ \lambda \sum_{j=1}^{p} \beta_j^{2}
\Bigr\}
\]
- No variable selection: all \(\beta_j \neq 0\) (unless \(\lambda \to \infty\)).
- Great when predictors are many and strongly correlated; shrinks them toward each other to reduce variance.
- Choose tuning parameter \(\lambda\) via \(K\)-fold CV.
Shrinkage trades a little bias for a large drop in variance, yielding lower test error.
LASSO Regression (\(\ell_1\) shrinkage & selection)
- Penalises absolute values:
\[
\hat{\boldsymbol\beta}^{\text{lasso}}
= \arg\min_{\beta}
\Bigl\{
\text{RSS}(\beta)
+ \lambda \sum_{j=1}^{p} |\beta_j|
\Bigr\}
\]
- Automatic variable selection: \(\ell_1\) geometry creates corners, so many coefficients shrink to exactly 0.
- Produces sparse, interpretable models – convenient when \(p \gg n\).
- Same tuning workflow: search \(\lambda\) on a log-grid with CV.
LASSO can mimic best-subset selection without the \(2^p\) cost.
Why LASSO Selects but Ridge Does Not
The constrained forms:
\[
\min_{\boldsymbol{\beta}} \left\{ \text{RSS}(\beta) \right\}
\quad \text{s.t.} \quad \sum_{j=1}^{p} |\beta_j| \leq s
\qquad \text{(LASSO)}
\]
\[
\min_{\boldsymbol{\beta}} \left\{ \text{RSS}(\beta) \right\}
\quad \text{s.t.} \quad \sum_{j=1}^{p} \beta_j^2 \leq s
\qquad \text{(Ridge)}
\]
- Elliptical RSS contours first touch the Ridge circle along an edge – all coefficients non-zero.
- For LASSO, they hit a corner of the diamond – sparsity (some \(\beta_j = 0\)).
Elastic Net: \(\ell_1 + \ell_2\)
\[
\hat{\boldsymbol\beta}^{\text{EN}}
= \arg\min_{\beta}
\Bigl\{
\text{RSS}(\beta)
+ \lambda\,
\bigl[\, \alpha \sum_{j}|\beta_j|
+ (1-\alpha)\sum_{j}\beta_j^{2} \bigr]
\Bigr\}
\]
- \(\alpha \in [0,1]\): mixing parameter. \(\alpha=1\) = LASSO, \(\alpha=0\) = Ridge.
- Keeps sparsity and grouping effect: correlated predictors tend to enter or drop together.
- Recommended when \(p\) is large and predictors are correlated.
Tune \(\lambda\) and \(\alpha\) with nested CV or a 2-D grid search.
Shrinkage Cheat-Sheet
| Ridge |
\(\sum \beta_j^2\) |
No |
Multicollinearity, \(p < n\) |
| LASSO |
\(\sum |\beta_j|\) |
Yes |
Interpretation, \(p \gg n\) |
| Elastic Net |
\(\ell_1 + \ell_2\) |
Yes |
Correlated groups, \(p \gg n\) |
When to Use Ridge, LASSO, or Elastic Net
- LASSO: Sparse solutions – you want variable selection and interpretability.
- Ridge: Dense solutions – all predictors matter, but may be highly correlated.
- Elastic Net: Mix of both – many predictors, some collinear, some irrelevant.
Example use cases:
- LASSO: Selecting the most predictive demographics in a wage model.
- Ridge: Forecasting GDP growth with 120 macro indicators (FRED-MD).
- Elastic Net: Modelling wage exposure with state x industry dummies.
Pipeline for Penalized Regression
- Preprocess Data
- Standardize predictors (\(z\)-scores): essential for LASSO/Ridge penalties.
- Dummy-encode categorical variables; impute or remove missing values.
- Define Tuning Grid
- LASSO/Ridge: \(\lambda \in \{10^{-4}, \dotsc, 10^{2}\}\), log-spaced.
- Elastic Net: Cross-product grid of \(\lambda\) and \(\alpha \in \{0, 0.25, 0.5, 0.75, 1\}\).
- Cross-Validation
- Use \(K = 5\) or \(10\)-fold CV to select \((\lambda^\star, \alpha^\star)\).
- Select hyperparameters minimizing CV error (or deviance).
- Refit Final Model on full training set using best parameters.
- Evaluate Performance on a held-out test set or via nested CV.
Feature Engineering Before Regularization
- Standardize: All predictors should have zero mean and unit variance.
- Dummy-encode: Convert factors to 0/1 indicators.
- Create interactions: Especially for theoretically relevant terms (e.g., gender \(\times\) occupation).
- Handle nonlinearity: Use polynomial terms or (preferably) splines.
- Deal with missingness: Impute (mean/median/model-based) or drop rows/columns as appropriate.
LASSO and Ridge penalize raw coefficient magnitudes – this only makes sense when predictors are on the same scale.
Why Log-Spaced Grid for \(\lambda\)?
Why use a log grid?
- Nonlinear shrinkage: Small changes in \(\lambda\) near zero cause large shifts in coefficients.
- Efficient resolution: Denser sampling in the sensitive range (e.g., \(10^{-4}\) to \(1\)).
- Avoid waste: Linear spacing overrepresents large \(\lambda\), where all \(\beta_j = 0\).
- Invariance to scale: Log grid works well across data magnitudes – especially after standardization.
Typical search grid:
\[
\lambda \in \{10^{-4}, 10^{-3.9}, \dotsc, 10^{2}\} \quad \text{(100 values, log-spaced)}
\]
Example: LASSO with Cross-Validation in R
Choosing the Penalty Parameter \(\lambda\)
- \(K\)-fold cross-validation (default in
glmnet and scikit-learn).
- Information criteria: AICc, BIC, EBIC.
- Theory-driven: \(\lambda \propto \sqrt{\log p / n}\) (Belloni et al., 2011).
Practical tip: always standardise predictors before fitting.
Inference After Selection
- Post-LASSO OLS: re-estimate on the selected support.
- Debiased/Desparsified LASSO: asymptotically normal estimates for components of \(\beta\).
- Double Machine Learning (DML): orthogonal scores + sample-splitting for causal parameters.
See Chernozhukov et al. (2019) for a review.
Strengths and Weaknesses
- Handles \(p \gg n\) via regularisation.
- Automatic variable selection exposes hidden structure.
- Facilitates plug-in steps for causal estimation (Double ML).
- Regularisation bias complicates interpretation.
- Post-selection inference requires care.
- Hyper-parameter tuning (\(\lambda\)) is essential.
Case Study: ML and the Minimum-Wage Effect
Paper: Seeing Beyond the Trees – Cengiz et al. (2024) JLE
- Goal: estimate wage and employment effects on all workers likely to earn the minimum wage, not just teens.
- Problem: true treatment status (being bound by the minimum wage) is latent.
- ML step: use gradient-boosted trees to predict, for every CPS individual, the probability \(p_i\) of earning \(\le\) current minimum wage, based on rich demographics (age, gender, race, education, industry).
- Construct two data-driven groups:
- High-probability (top 10%)
- High-recall (captures 75% of all min-wage workers)
- Apply event-study (DiD) around 159 state-level minimum-wage hikes, using ML groups as treated cohorts.
Elastic-Net Step in Dube & Lindner (2021)
Goal: Predict the latent probability an individual earns \(\le\) the binding minimum wage.
- Outcome: \(y_i = \mathbb{1}\{\text{hourly wage}_i \le \text{MW}_{st}\}\), built from CPS 2013 micro data.
- Features (\(X\)): Age (splines), gender, race, education, marital status, industry, occupation, hours, state FE, and their interactions (\(p \approx 150\) predictors).
- Model: Logistic Elastic-Net with predictors \(z\)-scored.
- Tuning: \(K=10\)-fold CV on a \((\lambda, \alpha)\) grid; optimal \(\alpha \approx 0.5\).
- Output: Predicted score \(\hat{p}_i = \Pr(y_i = 1 \mid X_i)\).
Balances sparsity (\(\ell_1\)) with group-shrinkage (\(\ell_2\)), ideal when predictors are many and correlated (e.g., industry x state dummies).
Findings from the ML-Enhanced Design
- Wage effect: +2–3% average real wages for high-probability group over five years.
- Employment, Unemployment, Participation:
- No systematic job losses in either ML group.
- Unemployment and labour-force participation essentially unchanged.
- Why ML mattered:
- Increases coverage: \(\approx 75\%\) of all affected workers vs. traditional teen-only designs.
- Improves precision: larger treated sample leads to tighter confidence bands.
- Flexible, replicable treatment assignment.
Combine predictive ML (to learn who is treated) with causal DiD (to estimate policy impact) for a scalable framework.
Case: Machine Labor (Angrist & Frandsen 2022, JLE)
- Outcome: \(Y_i = \log(\text{weekly wage}_i)\) of male college graduates.
- Treatment: \(D\): college attributes (e.g., private/elite attendance).
- Controls: \(X\): high-dimensional college-application variables (\(\approx 384\) features: number of schools applied to/accepted, SAT scores, interactions).
- Model: \[
Y = \alpha D + X'\beta + \varepsilon
\]
- Use post-double-selection LASSO (Belloni et al., 2014): run LASSO of \(Y\) on \(X\) and of \(D\) on \(X\), take union of selected \(X_S\), then OLS on \((D, X_S)\).
- Tuning: Penalty \(\lambda\) chosen by plug-in rule and 10-fold CV (via
lassopack).
Post-Double-Selection (PDS) LASSO
Goal: Estimate the treatment effect \(\alpha\) in high-dimensional settings:
\[
Y = \alpha D + X'\beta + \varepsilon
\]
Steps:
- Run LASSO of \(Y\) on \(X\) \(\rightarrow\) select controls \(X_Y\)
- Run LASSO of \(D\) on \(X\) \(\rightarrow\) select controls \(X_D\)
- Define \(X_S = X_Y \cup X_D\) (union of selected variables)
- Run OLS of \(Y\) on \(D\) and \(X_S\)
Why this works:
- Captures variables predictive of \(Y\) or \(D\) (helps control for confounding).
- Avoids overfitting by reducing dimensionality via LASSO.
- Allows valid inference on \(\alpha\) even if \(p \gg n\).
Main Findings: Machine Labor
- OLS + LASSO (PDS): College effects from PDS match full-model OLS. E.g., private-college premium \(\approx 0.02\)–\(0.04\) (PDS) vs. \(0.017\) (full OLS). Conclusion: no elite/quality premium.
- Tuning robustness: Different \(\lambda\) values change variable count (e.g., 18 vs. 100 vs. 112), but estimates of \(\alpha\) remain stable.
- Single vs. Double selection: LASSO on \(Y\) alone yields inflated effect (\(\approx 0.08\)). Double-selection corrects bias.
- IV first-stage: LASSO IV helps reduce bias but is outperformed by LIML and split-sample IV. Risk of pretest bias with ML-selected instruments.
- Conclusion: ML controls replicate baseline. ML helps check, not generate, identification.
Application: Social Spillovers in Movie Consumption
Paper: Gilchrist & Sands (2016), Journal of Political Economy, 124(5): 1268–1304.
Goal: Identify causal effect of early movie attendance on subsequent viewership – i.e., social momentum effects.
Challenge: Early attendance may reflect unobserved movie quality, so OLS is biased.
Identification Strategy with ML-IV
- Outcome (\(Y_i\)): Total movie attendance over 5 weekends post-release.
- Treatment (\(D_i\)): Opening-weekend attendance.
- Instruments (\(Z\)): High-dimensional local weather variables (rainfall, temperature, etc.) including interactions (e.g., rain x weekend x region).
- ML step: Use LASSO to select relevant weather instruments from many candidates.
Two-Stage Least Squares with LASSO-selected instruments:
\[
\text{First stage:} \quad D = Z\pi + \upsilon \quad \text{(select } Z \text{ using LASSO)}
\] \[
\text{Second stage:} \quad Y = \alpha D + X'\beta + \varepsilon
\]
Main Findings: Social Spillovers
- Social Effect: A 1% increase in opening-week attendance leads to a 2% increase in cumulative 5-week attendance.
- Local Containment: Effects are local to the city – no cross-city contagion.
- No Quality Learning: Effect is independent of movie reviews or quality, suggesting a social experience motive.
- Naive OLS fails to isolate exogenous variation – confounded by appeal.
- ML-IV (LASSO-2SLS) delivers more credible identification.
Machine Learning with tidymodels in R
The tidymodels ecosystem provides a unified interface for ML workflows in R.
Core pipeline: recipe() \(\rightarrow\) workflow() \(\rightarrow\) tune_grid()
tidymodels: Key Concepts
| Preprocessing |
recipe() + step_*() |
Normalize, dummy-encode, impute, create interactions |
| Model |
linear_reg(), rand_forest(), … |
Specify model type, engine, and hyperparameters |
| Workflow |
workflow() |
Bundle recipe + model for portability |
| Tuning |
tune_grid() / tune_bayes() |
Search hyperparameter space via CV |
| Evaluation |
collect_metrics(), select_best() |
Compare and pick the best configuration |
| Final fit |
finalize_workflow() + fit() |
Refit on full training data |
Advantage over raw glmnet: consistent API across model types, tidy output, built-in resampling.
Causal Forests: Heterogeneous Treatment Effects
Problem: LASSO/Ridge estimate average effects. What if treatment effects vary across individuals?
Causal forests (Wager & Athey, 2018) estimate conditional average treatment effects (CATE):
\[
\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]
\]
Key ideas:
- Build an ensemble of “honest” trees that split to maximize treatment effect heterogeneity.
- Sample splitting: one subsample grows the tree, another estimates leaf-level effects.
- Valid asymptotic confidence intervals for \(\tau(x)\) at each point \(x\).
Causal Forests with the grf Package
When to use causal forests
- You have a credible source of exogenous variation (experiment or quasi-experiment).
- You want to understand who benefits most, not just the average effect.
- Useful for policy targeting and resource allocation.
Practical Tips and Diagnostics
- Always visualise coefficient paths vs. \(\lambda\).
- Check out-of-sample \(R^2\) against baseline OLS.
- Stability selection (Meinshausen & Buhlmann) to gauge variable importance.
glmnet::relax re-fits OLS on selected variables to reduce bias.
- For causal work: combine ML with orthogonal scores (DML).
Key Take-aways
- High-dimensional shrinkage estimators are now standard in applied micro, macro, and finance.
- Understand the bias–variance trade-off; choose \(\lambda\) transparently.
- LASSO selects variables; Ridge shrinks densely; Elastic Net combines both.
- Post-double-selection LASSO (Belloni et al., 2011) enables valid causal inference in high dimensions.
- tidymodels provides a clean
recipe() \(\rightarrow\) workflow() \(\rightarrow\) tune_grid() pipeline for R users.
- Causal forests (
grf) estimate heterogeneous treatment effects with valid inference.
- Always report predictive fit, selected support, and robustness checks.
References
- Belloni, A., Chernozhukov, V., & Hansen, C. (2011). Inference for high-dimensional sparse econometric models. Advances in Economics and Econometrics.
- Cengiz, D., Dube, A., Lindner, A., & Zentler-Munro, R. (2022). The minimum-wage employment effect reconsidered. Journal of Labor Economics.
- Chernozhukov, V., et al. (2019). Inference in factorial experiments with high-dimensional covariates. arXiv:1903.10075.
- Gilchrist, D. S. & Sands, E. G. (2016). Something to Talk About: Social Spillovers in Movie Consumption. Journal of Political Economy, 124(5): 1268–1304.
- Angrist, J. D. & Frandsen, B. (2022). Machine Labor. Journal of Labor Economics, 40(S1): S97–S140.
- Hastie, T., Tibshirani, R., & Friedmann, J. (2023). Statistical Learning with Sparsity: The LASSO and Generalizations.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). www.statlearning.com
- Wager, S. & Athey, S. (2018). Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association, 113(523): 1228–1242.
- Chernozhukov et al., Applied Causal Inference Powered by ML and AI: causalml-book.org