Causal Machine Learning — Dr. Melvyn Weeks, University of Cambridge

MPhil · Lent 2026

Data Science Cambridge

DS300: Causal Inference and Machine Learning. Core module of the MPhil in Economics and Data Science, Faculty of Economics, University of Cambridge.

There are two cultures in the use of statistical modelling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

Leo Breiman, 2001 — the organising tension of this course

The course covers topics at the intersection of machine learning and econometrics, covering a mix of theory and applications. In making the distinction between models used to solve a prediction problem and models used to estimate a causal effect, we demonstrate how empirical strategies such as unconfoundedness, instrumental variables, and difference-in-difference can be used alongside machine learning methods for prediction.

The tension between parametric and nonparametric approaches reflects fundamental disciplinary differences. Econometricians prioritise interpretable parameters and structural understanding of economic relationships. Machine learning practitioners prioritise nonparametric flexibility and generalisation. Modern causal machine learning confronts the challenge of reconciling these competing objectives.

Course Sessions

Introduction

Best Predictor and the Conditional Expectation Function

Estimation and Inference for Causal Effects

High Dimensional Methods for Linear Models

Applications of Regularised Regression

Double Machine Learning

Treatment Effects and Double Robust Estimators

Random Forests

Architecture of Causal Trees and Generalised Random Forests

Generalised Causal Forests

10b

Testing for Heterogeneity

Introduction to Generative AI and Large Language Models

Applications

Labour Economics

Wages and gender (Lasso). Children and parental labour supply. Impact of job training on earnings. Fertility and labour supply with causal forests.

Finance

Credit card default classification. Forecasting financial crises with tree ensembles. Post-earnings announcement drift. Corporate cash holdings.

NLP & Policy

Central bank communication (FinBERT). Sentiment and Tesla stock price. Impact of microcredit (Crépon et al.). Time-of-use tariffs and smart meter data.

Summer 2026

Summer School 2026

Details to follow. Drawing on DS300 course material, with applied sessions tailored for practitioners and researchers across disciplines.

Format

Intensive sessions covering Double Machine Learning, causal forests, and applications in labour, finance, and policy evaluation.

Audience

Graduate students, applied economists, and data scientists seeking to extend prediction-focused ML skills toward causal estimation.

Cambridge University Press · Under Proposal

Machine Learning for Causal Inference

View Proposal →

A textbook proposal submitted to Cambridge University Press, April 2026. Drawing on course materials developed and tested at the Faculty of Economics, University of Cambridge.

Traditional econometric methods struggle with high-dimensional data and complex heterogeneity. Machine learning approaches lack the architecture for causal estimation. This fundamental tension demands a synthesis.

Machine Learning for Causal Inference — Proposal Narrative

The book's central innovation lies in its presentation of modern causal machine learning methods within a coherent economic framework. Core themes include the reconciliation of prediction and causation, the treatment of high-dimensional nuisance parameter estimation, and the development of cross-fitting and sample-splitting procedures that enable valid statistical inference with flexible machine learning algorithms.

Book Structure

Part I — Introduction

Looking Ahead. Overview. The two statistical cultures. Prediction versus causation: the fundamental distinction.

Part II — Foundations

Best Predictor and the CEF. Estimation and Inference for Causal Effects. Frisch-Waugh-Lovell Theorem as the unifying bridge.

Part III — High-Dimensional Methods

Lasso and Ridge for linear models. Applications of regularised regression. Double Machine Learning.

Part IV — Modern Causal ML

Treatment Effects and Double Robust Estimators. Random Forests. Causal Trees and Generalised Random Forests. Heterogeneous effects.

Audience

Graduate students in economics. Data scientists moving from prediction to causal reasoning. Applied researchers in policy, finance, and labour economics.

Pedagogical Basis

Materials developed and tested through DS300 at Cambridge. Code throughout in R and Python. Applications drawn from real datasets across multiple domains.