MIT 9.520

Statistical Learning Theory and Applications

A graduate course on the foundations of learning, from classical regularization and kernel methods to modern questions about deep networks, optimization, generalization, and the theory needed to understand today's AI systems.

Instructors

Teaching Assistants

Course Description

Learning theory as a route to understanding intelligence

Understanding intelligence, and how to replicate it in machines, is arguably one of the greatest problems in science. Learning, through its theory and computational implementations, lies at the core of intelligence.

Over the last two decades, AI systems have learned to solve complex tasks that were once the exclusive domain of biological organisms: computer vision, speech recognition, and natural language understanding and generation. These successes are driven by algorithms trained from examples rather than explicitly programmed to solve each task. This course, probably the oldest continuously running machine learning course at MIT, has been pushing toward this shift since its inception in 1992.

Yet a comprehensive theory of learning, especially one that explains the empirical puzzles raised by deep learning, remains incomplete. Such a theory could enable more powerful learning approaches, guide the use of learning algorithms in high-stakes settings, and inform our understanding of human intelligence.

Part I: Classical SLT

  • Classical regularization and regularized least squares
  • Kernel methods and support vector machines
  • Logistic regression, squared loss, and exponential loss
  • Large margin theory and minimum norm solutions
  • Stochastic gradient methods
  • Overparameterization and implicit regularization for linear models
  • Approximation and estimation errors

Part II: Neural Networks

  • Approximation: why deep networks can be universal parametric approximators and avoid the curse of dimensionality
  • Optimization: how weights evolve in time and across layers during training
  • Learning theory: if and how generalization in deep networks can be explained by implicit complexity control and sparse compositional structure
  • What breaks and what survives as model, task, and data complexity increase
  • How recent machine learning theory reconnects to the Statistical Learning Theory framework
  • New paradigms in learning, including diffusion models and autoregressive transformers

Rules and Expectations

Project-centered grading and research practice

The current format removes traditional problem sets to give more time to projects and introduces an oral presentation. The goal is to understand how well students own their project, how clearly they can position it within Statistical Learning Theory, and how carefully they can connect theory, experiments, and implications.

Prerequisites

Part II is designed for students with a good background in ML. The course uses calculus, linear algebra, probability, basic optimization, and some functional or convex analysis. For course 6 students, expected background includes 6.041, 18.06, and an introductory ML course such as 6.036, 6.401, or 6.867.

AI Tools

Students are expected to use modern LLM-based tools when useful, but must still read the relevant papers and be able to explain, rework, and defend the work offline.

Teams

Projects may be individual or in teams of two. Groups of two are encouraged. Multiple teams may work on related problems, but authorship and submission plans should be coordinated with the staff.

Timeline

Deliverables are designed to move projects early

Deadlines are intended to make the project research process concrete: choose a problem, understand the literature, plan the path, show early evidence, present the work, and submit a paper.

Sep 26

Groups and proposals

Submit your group and indicate three project choices, or two listed projects plus one self-proposed project.

Oct 10

Literature reviews and implications

For each of the three indicated projects, submit 3-4 pages covering related work and consequences for theory and practice.

Oct 17

Project plan

Submit a concise plan explaining the chosen problem, expected result, proof or experiment strategy, and milestones.

Oct 31

Initial checkpoint

Submit early results: first plots, proof sketches, ablations, or a short account of what has been learned.

First two weeks of November

Project discussions

Meet during office hours to discuss progress, roadblocks, positioning, and next steps.

Dec 2-4

Oral presentation

Give an 8-minute presentation with up to 10 content slides covering motivation, related work, results, and implications.

Dec 10

Final paper

Submit the final paper and a link to a public code repository or runnable notebook.

Grading

Participation plus project work

The grading scheme is project-based: 10 points for participation and up to 90 points for project-related activities, with possible bonus points for a strong project plan.

Participation

Up to 10 points for active attendance, engagement, and project discussion.

Literature reviews

Up to 15 points across the three project reviews and implications documents.

Initial checkpoint

Up to 10 points for early evidence that the project is on track.

Presentation

Up to 25 points for motivation, results, clarity, organization, and answering questions.

Final paper

Up to 40 points for execution, positioning, clarity, novelty, limitations, and significance.

Projects

Research questions for the semester

Some projects are well-defined with a clear path toward a paper; others are intentionally exploratory. Students should reach out to Pier with questions and use the project form to indicate their preferences.

01

Non-vacuous bounds for random labels

Revisit random-label experiments with overparameterized ReLU networks and test whether modern Rademacher-style bounds can predict generalization from training data.

02

Norm-based vs rank-based bounds

Compare generalization bounds based on norms and ranks across the same networks and problems.

03

Neural collapse and loss functions

Study when regularization is needed for neural collapse under square loss and exponential loss.

04

Intermediate neural collapse

Investigate whether gradient descent can achieve intermediate neural collapse, and whether stochasticity or regularization is necessary.

05

Kolmogorov-Arnold representations

Analyze approximation properties of KA-style representations and compare them with standard MLPs.

06

Adversarial examples

Critically examine recent work that may clarify the puzzle of adversarial examples.

07

Double descent

Review double descent claims in the context of recent theory and empirical evidence.

08

SGD vs layerwise optimization

Compare standard feedforward training against staged polynomial residual regression on simple low-dimensional polynomial targets.

09

Invariant representations

Explore whether transformation-invariant preprocessing can reduce sample complexity without relying on data augmentation.

10

PDEs and PINNs

Study the approximation-theoretic foundations of deep networks for solving partial differential equations.

11

Definitions of superintelligence

Formulate definitions that are achievable through supervised learning and definitions that are not.

12

Large Embedding Models and memory

Test whether reconstructing full memories from partial fragments can model aspects of recall, dreams, and imagination.

13

Step-by-step learning with simple predictors

Generate algorithmic step datasets and compare autoregressive and diffusion-style learning with linear threshold predictors and small baselines.

14

Associative memory and hippocampus

Connect recent key-value and attention mechanisms to classic associative memory models and hippocampal theories.

15

Beneficial misalignment

Study whether increasingly capable AI systems may benefit from representations that are less human-like.

16

Unsupervised contrastive learning in vision

Investigate why augmentation-based contrastive pretraining improves downstream visual classification.

17

Prefix optimization for mathematics

Optimize fixed-length token prefixes to improve mathematical output.

18

Attention heads across task families

Test whether many attention heads help language tasks differently than arithmetic or algorithmic tasks.

19

Depth-width tradeoffs

Under fixed compute or parameter budgets, compare deeper-narrow and shallower-wide transformers across language and arithmetic tasks.

Readings and Resources

Primary references for the course

Lecture notes are provided as independent draft chapters. The references below are useful background reading, especially from the theoretical viewpoint.

Machine Learning: a Regularization Approach

L. Rosasco and T. Poggio, MIT 9.520 lecture notes, draft manuscript.

Understanding Machine Learning: From Theory to Algorithms

S. Shalev-Shwartz and S. Ben-David, Cambridge University Press, 2014.

Introduction to Statistical Learning Theory

O. Bousquet, S. Boucheron, and G. Lugosi, Advanced Lectures on Machine Learning, 2004.

On The Mathematical Foundations of Learning

F. Cucker and S. Smale, Bulletin of the AMS, 2002.

A Probabilistic Theory of Pattern Recognition

L. Devroye, L. Gyorfi, and G. Lugosi, Springer, 1997.

Regularization Networks and Support Vector Machines

T. Evgeniou, M. Pontil, and T. Poggio, Advances in Computational Mathematics, 2000.

The Mathematics of Learning: Dealing with Data

T. Poggio and S. Smale, Notices of the AMS, 2003.

Statistical Learning Theory

V. N. Vapnik, Wiley, 1998.

Course materials and updates

Slides, notes, readings, project updates, and any schedule changes can be added here as the semester evolves. For administrative questions, email 9.520@mit.edu.