9.520/6.7910: Statistical Learning Theory and Applications, Fall 2025

Course description

Understanding intelligence—and how to replicate it in machines—is arguably one of the greatest problems in science. Learning, through its theory and computational implementations, lies at the very core of intelligence. Over the last two decades, AI systems have been successfully developed to solve complex tasks that were once the exclusive domain of biological organisms. For instance, computer vision, speech recognition, or natural-language understanding and generation. These successes are driven by algorithms trained from examples rather than explicitly programmed to solve each task. This marks a once-in-a-lifetime paradigm shift in computer science: from hand-crafted programs to training-from-examples. This course— probably the oldest in continuous operation on (machine) learning at MIT—has been pushing for this shift since its inception in 1992.

Yet a comprehensive theory of learning—especially one that explains the surprising empirical puzzles raised by deep learning—remains incomplete. An eventual such theory, explaining why and how deep networks work and what their limitations are, could (i) enable the development of even more powerful learning approaches, (ii) guide the use (or not) of these algorithms in safety‑critical and high‑stakes settings, and (iii) inform our understanding of human intelligence.

In this spirit, the course covers foundations and recent advances in Statistical Learning Theory (SLT), with the dual goal (a) to equip students with the theoretical tools and intuitions needed to analyze and design effective ML methods, and (b) to prepare advanced students to contribute to the field—with emphasis on (b).

Part I develops classical SLT for linear predictors and related models:

Classical regularization and regularized least squares.
Kernel methods, SVM.
Logistic regression; squared and exponential loss.
Large margin theory and minimum norm solutions.
Stochastic gradient methods.
Overparameterization and implicit regularization for linear models.
Approximation/estimation errors.

Part II is about neural networks. It will examine:

Approximation: why deep networks universal parametric approximators and avoid the curse of dimensionality.
Optimization: how weights evolve in time and across layers during training and why.
Learning theory: if and how generalization in deep networks can be explained in terms a) of the complexity control implicit in regularized (or unregularized) SGD and b) of the sparse compositional architecture itself.

From a philosophical stand point, Part II will be more speculative and will deal with:

What breaks and what survives as model, task, or data complexity increases.
What the empirical evidence shows and what we currently understand.
What new mathematics is needed, what is enough.
How recent ML theory reconnects to the SLT framework.
New paradigms in learning (e.g., Diffusion models or autoregressing with transformers).