Non-vacuous bounds for random labels
Revisit random-label experiments with overparameterized ReLU networks and test whether modern Rademacher-style bounds can predict generalization from training data.
MIT 9.520
A graduate course on the foundations of learning, from classical regularization and kernel methods to modern questions about deep networks, optimization, generalization, and the theory needed to understand today's AI systems.
Course Description
Understanding intelligence, and how to replicate it in machines, is arguably one of the greatest problems in science. Learning, through its theory and computational implementations, lies at the core of intelligence.
Over the last two decades, AI systems have learned to solve complex tasks that were once the exclusive domain of biological organisms: computer vision, speech recognition, and natural language understanding and generation. These successes are driven by algorithms trained from examples rather than explicitly programmed to solve each task. This course, probably the oldest continuously running machine learning course at MIT, has been pushing toward this shift since its inception in 1992.
Yet a comprehensive theory of learning, especially one that explains the empirical puzzles raised by deep learning, remains incomplete. Such a theory could enable more powerful learning approaches, guide the use of learning algorithms in high-stakes settings, and inform our understanding of human intelligence.
Rules and Expectations
The current format removes traditional problem sets to give more time to projects and introduces an oral presentation. The goal is to understand how well students own their project, how clearly they can position it within Statistical Learning Theory, and how carefully they can connect theory, experiments, and implications.
Part II is designed for students with a good background in ML. The course uses calculus, linear algebra, probability, basic optimization, and some functional or convex analysis. For course 6 students, expected background includes 6.041, 18.06, and an introductory ML course such as 6.036, 6.401, or 6.867.
Students are expected to use modern LLM-based tools when useful, but must still read the relevant papers and be able to explain, rework, and defend the work offline.
Projects may be individual or in teams of two. Groups of two are encouraged. Multiple teams may work on related problems, but authorship and submission plans should be coordinated with the staff.
Timeline
Deadlines are intended to make the project research process concrete: choose a problem, understand the literature, plan the path, show early evidence, present the work, and submit a paper.
Sep 26
Submit your group and indicate three project choices, or two listed projects plus one self-proposed project.
Oct 10
For each of the three indicated projects, submit 3-4 pages covering related work and consequences for theory and practice.
Oct 17
Submit a concise plan explaining the chosen problem, expected result, proof or experiment strategy, and milestones.
Oct 31
Submit early results: first plots, proof sketches, ablations, or a short account of what has been learned.
First two weeks of November
Meet during office hours to discuss progress, roadblocks, positioning, and next steps.
Dec 2-4
Give an 8-minute presentation with up to 10 content slides covering motivation, related work, results, and implications.
Dec 10
Submit the final paper and a link to a public code repository or runnable notebook.
Grading
The grading scheme is project-based: 10 points for participation and up to 90 points for project-related activities, with possible bonus points for a strong project plan.
Up to 10 points for active attendance, engagement, and project discussion.
Up to 15 points across the three project reviews and implications documents.
Up to 10 points for early evidence that the project is on track.
Up to 25 points for motivation, results, clarity, organization, and answering questions.
Up to 40 points for execution, positioning, clarity, novelty, limitations, and significance.
Projects
Some projects are well-defined with a clear path toward a paper; others are intentionally exploratory. Students should reach out to Pier with questions and use the project form to indicate their preferences.
Revisit random-label experiments with overparameterized ReLU networks and test whether modern Rademacher-style bounds can predict generalization from training data.
Compare generalization bounds based on norms and ranks across the same networks and problems.
Study when regularization is needed for neural collapse under square loss and exponential loss.
Investigate whether gradient descent can achieve intermediate neural collapse, and whether stochasticity or regularization is necessary.
Analyze approximation properties of KA-style representations and compare them with standard MLPs.
Critically examine recent work that may clarify the puzzle of adversarial examples.
Review double descent claims in the context of recent theory and empirical evidence.
Compare standard feedforward training against staged polynomial residual regression on simple low-dimensional polynomial targets.
Explore whether transformation-invariant preprocessing can reduce sample complexity without relying on data augmentation.
Study the approximation-theoretic foundations of deep networks for solving partial differential equations.
Formulate definitions that are achievable through supervised learning and definitions that are not.
Test whether reconstructing full memories from partial fragments can model aspects of recall, dreams, and imagination.
Generate algorithmic step datasets and compare autoregressive and diffusion-style learning with linear threshold predictors and small baselines.
Connect recent key-value and attention mechanisms to classic associative memory models and hippocampal theories.
Study whether increasingly capable AI systems may benefit from representations that are less human-like.
Investigate why augmentation-based contrastive pretraining improves downstream visual classification.
Optimize fixed-length token prefixes to improve mathematical output.
Test whether many attention heads help language tasks differently than arithmetic or algorithmic tasks.
Under fixed compute or parameter budgets, compare deeper-narrow and shallower-wide transformers across language and arithmetic tasks.
Readings and Resources
Lecture notes are provided as independent draft chapters. The references below are useful background reading, especially from the theoretical viewpoint.
L. Rosasco and T. Poggio, MIT 9.520 lecture notes, draft manuscript.
S. Shalev-Shwartz and S. Ben-David, Cambridge University Press, 2014.
O. Bousquet, S. Boucheron, and G. Lugosi, Advanced Lectures on Machine Learning, 2004.
F. Cucker and S. Smale, Bulletin of the AMS, 2002.
L. Devroye, L. Gyorfi, and G. Lugosi, Springer, 1997.
T. Evgeniou, M. Pontil, and T. Poggio, Advances in Computational Mathematics, 2000.
T. Poggio and S. Smale, Notices of the AMS, 2003.
V. N. Vapnik, Wiley, 1998.
Slides, notes, readings, project updates, and any schedule changes can be added here as the semester evolves. For administrative questions, email 9.520@mit.edu.