9.520/6.7910: Statistical Learning Theory and Applications Fall 2024: Projects

Project Logistics

Guidelines and key dates. Online form for project proposal (complete by Mon, Oct. 14).

Final project reports (5 pages for individuals, 8 pages for teams, NeurIPS style) are due on Tue. Dec. 10.

Suggested Project Ideas for Fall 2024

Students will be awarded a 5% bonus for choosing one of the recommended projects. The CBMM Memos referenced below can be found at this link. Email 9.520@mit.edu if you have any questions about these topics. Some of the projects are marked with *: since they are ongoing projects in our group, if your project becomes a paper in a conference, we ask you to get in touch with us about coordinating authorship and submission.

Approximation

Extend Poggio-Mhaskar theorem to non-smooth RELU: use Yarotsky result in "Optimal approximation of continuous functions by very deep ReLU networks" about using RELUs to approximate polynomials (and smooth functions) together with the "good propagation of error" for a hierarchical network. This was done in 2022.
Lower bound on approximation of product function: Related paper - Appendix A of "Why does deep and cheap learning work so well?'' by Lin, Tegmark, Rolnik. The proof that 2^q neurons are necessary to exactly reproduce $\prod_{j=1}^q x_j$ using a smooth activation function is correct. However, it remained unclear whether a smaller network cannot be constructed to approximate it arbitrarily closely. It turns out that a project in the 2021 class solved this problem and more, see https://openreview.net/pdf?id=AV8FPoMTTa. Thus this project is now vacuous. You are welcome, however, to propose a related topic if you can come up with a good one.
From approximation to regression: Two recent papers extend the idea of compositionality to regression and provide estimates of sample size in the underparametrized case. The project is a critical review of the following two papers. Alternatively, try to extend theese papers to the ovrparametrized case using the tools discussed in the class. In particular, you will need to express approximation rates in terms of Radamacher complexity bounds.
1. "Nonparametric regression using deep neural networks with ReLU activation function" (J Schmidt-Hieber, 2017)
2. "On the rate of convergence of fully connected very deep neural network regression estimates" (Michael Kohler, Sophie Langer, 2020)
The results described in Class 12 about compositional sparsity of efficiently Turing computable functions should be adapted to provide a foundation for tensor decompositions such as the hierarchical Tucker decomposition. The goal is to provide mathematical foundations to the basic motivations in the tensor literature of the kind you can find in the first 2 pages of papers such as Hierarchical Singular Value Decomposition of Tensors by Lars Grasedyck. In particular, could you prove that such hierarchical decompositions are possible for every efficiently computable function?

Optimization

Layer-wise Optimization*: Layer-wise optimization can be done using SGD on the weights of a layer using the error signal from the output of the network (suggestion: always have a linear classifier W_L that pools the output of the final layer. Use the Neural Collapse property to set the W_L during training). First a single RELU layer followed by W_L is optimized. Then a second RELU layer is added (followed by W_L) and optimized while the first layer is "frozen" and so on until layer L-1. Each layer has a skip connection to the output. Once this network is built via a sequence of optimizations, there are two options: a) the first layer is optimized again keeping everything else frozen etc. b) alternatively, one can build a second network -- later to be linearly added to the first -- regressing on the residual error left by the first. Again the first layer is first optimized, then a second is added etc. Notice connection to residual networks. This is a computational experiment that Brian and Yulu have begun. They will lead further explorations of the technique in terms of architectures and parameters, considered as hyperparameters to be found by cross-validation. A relevant paper is "Greedy Layerwise Learning Can Scale to ImageNet" by Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon . A theoretical comparison of this "alternating gradient descent" with standard gradient descent or standard SGD would also be great!
Regularized loss landscape (problem in 2022): we know from Yaim Cooper work that the landscape of the square loss for deep overparametrized networks has degenerate zero loss minima. How does this landscape change when the loss function includes a small regularization term \lambda \rho^2 (with \rho given by the product of the Frobenius norms of the weight matrices)? The conjecture is that the square loss function -- which has degenerate global minima for \lambda=0 -- induces a Morse-Bott function. The relevant papers are "Global Minima of Overparameterized Neural Networks" by Yaim Cooper in https://epubs.siam.org/doi/10.1137/19M1308943 and "Plenty of Morse functions by perturbing with sums of squares" by Antonio Lerario http://gokovagt.org/proceedings/2013/ggt13-lerario.pdf. The conjecture was proven to be wrong in a recent paper by Cooper and Lerario (bottman cooper lerario): with the multiplicative regularizer the function has degenerate critical points. The question of what happens with the additive regularizers is, however, open and represents a possible project.
Staircase Boolean functions: review "The staircase property: How hierarchical structure can guide deep learning" and connect to the notion of sparse compositionality developed in the class. In particular prove that the composition of staircase functions is staircase.
Why in a deep networks do simple components (linear or low Fourier frequenceis....) converge quicker than more complex ones?. Consider a toy model with different networks -- 1 layer, 2 layers, 3 layers.... Assume they are all trained at the same time. Depending on L -- the number of layers -- each has a different \rho_k (see CBMM memo 112 August 2020 version). The networks with smaller L will converge faster! Thus the less complex solutions associated with shallower networks will converge faster! Can you extend this argument from the toy model to residual networks?
Claim: solution to which NTK converges have higher \rho than those found in .https://cbmm.mit.edu/publications/dynamics-deep-classifiers-trained-square-loss-normalization-low-rank-neural-collapse. Show experimentally.
We know that SGD with weight decay introduces a bias towards small rank. Review other biases of SGD, such as a bias towards flat minima. Or critically review https://arxiv.org/abs/2210.07082 in the context of the work you have heard in the class.
Inherent SGD noise*: we know that SGD with weight decay introduces a inherent SGD noise in the output of the network. Can you characterize whether this noise is close to be Gaussian? What is the variance? Theory or more realistically experiments.
Analyze (prove or disprove) the following conjecture: a) SGD noise is higher for local minima than global minima (overparametrized case); b) local minima are more degenerate than global minima; a+b lead to SGD solutions to be closer to global minima than local minima with probability one asymptotically.
Experiment: SGD biases a weight matrix towards small rank, GD does not (Drs Poggio, Galanti)
Experiment: GD cannot achieve intermediate Neural Collapse. SGD is necessary.
A natural approximation to gradient descent within a continuous gradient flow formulation is to replace in the gradient descent equation the term x(t+\eta) with its taylor expansion. Thus x(t+\eta) - x(t)= -\eta F becomes \eta^2/2 d^2/dt^2 x+d/dt x=-F. Study the resulting equations in \rho and V_k in this approximation both experimentally and theoretically. Do oscillations occur?

Generalization

Reformulate the result about generalization of sparse networks you have seen in class to the case in which the weight matrices do not have zero weights but just very small weights. est the conjecture that this explains why pruning of deep nets compresses the network by reducing the number of weights but does not improve performance. Discuss role of L_2 vs L_1 optimization.
Check experimentally the role that a) \rho* b) structural sparsity of weight matrices (like in usual CNN) and c) rank of the weight matrices play in generalization.
Characterize the Radamacher complexity of rank-constrained neural networks and obtain a bound that depends on the rank (or show that rank does not matter!). Experiments to verify your theoretical conclusions. (Contact Akshay)

Invariant Autoencoders: consider an "oracle" that rectifies images of objects to images in standard viewpoint (see class on Invariance). Can you train a deep autoencoder to do the same? First try just scaling, translation and rotation in image plane. This should work. After this you can try 3D transformations such as rotations in depth.
Stability: extend stability theorems in memo 108 to more general loss functions - The stability of interpolating Kernel machines was proven for squared loss in CBMM memo 108. Is there a way to extend the results to other loss functions (hinge loss, etc.)?
Experiment: use available software to train and test simple deep network and show non-vacuos generalzation bounds (Dr Galanti)
Ambitious project: develop a stability theory for deep NNs (There is a CBMM memo with someideas about this). With the theory or without, do experiments traying to empirically justify that higher stability correlates with better generalization error.

Transformers and LLMs

Transformers sparsity: how sparse are the attention maps in various types of transformers? What about W_K and W_Q? This is an empirical project.
Confirm experimentally that self-attention is almost equivalent to normalized hyperbf (depending on the normalization used for q and k). This project consists of completing experiments initiated by Yulu Gan who will be leading the project.
Separately, confirm that normalized kernels (or approximation to them) is responsible for in-context learning by transformers.
Ablate learning-in-context ability in transformer by changing normalization of self-attention appropriately.

Wikipedia Projects

In past years this class has written entries for Wikipedia about various aspects of learning theory or updated some the existing entries. You are welcome to propose new entries or entries to be updated.