9.520/6.7910: Statistical Learning Theory and Applications Fall 2022: Projects

Project Logistics

Guidelines and key dates. Online form for project proposal (complete by Thu. Oct. 20).

Final project reports (5 pages for individuals, 8 pages for teams, NeurIPS style) are due on Tue. Dec. 13.

Suggested Project Ideas for Fall 2022

These projects will be awarded an extra-credit of 5% for choosing one of the recommended projects and extra bonus if the project is very good. The CBMM Memos referenced below can be found at this link. Please email 9.520@mit.edu if you have any questions/need clarifications about these topics. Some of the projects are marked with *: since they are ongoing projects in our group, if your project becomes a paper in a conference, we ask you to get in touch with us about coordinating authorship and submission.

Approximation

Extend Poggio-Mhaskar theorem to non-smooth RELU: Use Yarotsky result in “Optimal approximation of continuous functions by very deep ReLU networks” about using RELUs to approximate polynomials (and smooth functions) together with the “good propagation of error” for a hierarchical network.
Lower bound on approximation of product function: Related paper – Appendix A of “Why does deep and cheap learning work so well?” by Lin, Tegmark, Rolnik. The proof that 2^q neurons are necessary to exactly reproduce $\prod_{j=1}^q x_j$ using a smooth activation function is correct. However, it remained unclear whether a smaller network cannot be constructed to approximate it arbitrarily closely. It turns out that a project in the 2021 class solved this problem and more, see https://openreview.net/pdf?id=AV8FPoMTTa. Thus this project is now vacuous. You are welcome, however, to propose a related topic if you can come up with a good one.
From approximation to regression: Two recent papers extend the idea of compositionality to regression and provide estimates of sample size in the underparametrized case. The project is a critical review of the following two papers and how they connect to the results described in the approximation class and to the overparametrized case of deep networks. Sophie Longer will give a class in part about these papers.
1. “Nonparametric regression using deep neural networks with ReLU activation function” (J Schmidt-Hieber, 2017)
2. “On the rate of convergence of fully connected very deep neural network regression estimates” (Michael Kohler, Sophie Langer, 2020)
Staircase functions and compositionality*: review “The staircase property: How hierarchical structure can guide deep learning” and connect to the notion of sparse compositionality developed in the class.
Empirically it seems that CNNs had several success stories (images, text, speech…). Are there success stories for dense multilayer networks? This could be an interesting review of relevant literature.

Kernels and overparameterization

Extend stability theorems in memo 108 to more general loss functions – The stability of interpolating Kernel machines was proven for squared loss in CBMM memo 108. Is there a way to extend the results to other loss functions (hinge loss, etc.)?

Optimization

Layer-wise Optimization*: Layer-wise optimization can be done using SGD on the weights of a layer using the error signal from the output of the network (suggestion: always have a linear classifier W_L that pools the output of the final layer. Use the Neural Collapse property to set the W_L during training). First a single RELU layer followed by W_L is optimized. Then a second RELU layer is added (followed by W_L) and optimized while the first layer is “frozen” and so on until layer L-1. Each layer has a skip connection to the output. Once this network is built via a sequence of optimizations (are they convex?), there are two options: a) the first layer is optimized again keeping everything else frozen etc. This second iteration involves non-convex optimizations; b) alternatively, one can build a second network — later to be linearly added to the first — regressing on the residual error left by the first. Again the first layer is first optimized, then a second is added etc. This is a computational experiment. A theoretical comparison of this “alternating gradient descent” with standard gradient descent or standard SGD would be great! A relevant paper is “Greedy Layerwise Learning Can Scale to ImageNet” by Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon .
Regularized loss landscape*: we know from Yaim Cooper work (Class 23) that the landscape of the square loss for deep overparametrized networks has degenerate zero loss minima. How does this landscape change when the loss function includes a small regularization term \lambda \rho^2 (with \rho given by the product of the Frobenius norms of the weight matrices)? The conjecture is that the square loss function — which has degenerate global minima with \lambda=0 — becomes a Morse-Bott function (or perhaps even just Morse). The relevant papers are “Global Minima of Overparameterized Neural Networks” by Yaim Cooper in https://epubs.siam.org/doi/10.1137/19M1308943 and “Plenty of Morse functions by perturbing with sums of squares” by Antonio Lerario http://gokovagt.org/proceedings/2013/ggt13-lerario.pdf.
A theoretical analysis of the difference between Batch Normalization and Weight Normalization. This is probably difficult.

SGD and low-rank

We know that SGD with weight decay introduces a bias towards small rank. Review other biases of SGD, such as a bias towards flat minima. Or critically review https://arxiv.org/pdf/2210.07082.pdf in the context of the work you have heard in the class.
Inherent SGD noise*: we know that SGD with weight decay introduces a inherent SGD noise in the output of the network. Can you characterize whether this noise is close to be Gaussian? What is the variance? Theory and/or experiments.

Neural Collapse

Show how you can avoid learning the last layer classifier when you train a deep RELU classifier. How does this help? Experiments.

Generalization

Rademacher complexity of convolutional layers*: complexity as a function of the overlap of the convolutional patches (no overlap corresponds to stride equal to size of kernel). Software available from Mengjia Xu to perform experiments on generalization bounds.
Rank and Rademacher*: does the Rademacher complexity of a normalized RELU network (the Frobenius norm of all weight matrices is 1) depend on the rank of the weight matrices? Theory and especially experiments.
Rank and \rho*: is low rank or small \rho the main cause for generalization? Or something else? Experiments and possibly theory.
Theory*: is it possible to get a bound on Rad(F) — that is Radamacher complexity of normalized RELU networks — using the rank of the weight matrices

Invariance

Invariant Autoencoders: Look at slide 4 in class 16: the “oracle” rectifies images of objects to images in standard viewpoint. Can you train a deep autoencoder to do the same? First try just scaling, translation and rotation in image plane. This should work. After this you can try 3D transformations such as rotations in depth.

Wikipedia Projects

In past years this class has written entries for Wikipedia about various aspects of learning theory or updated some the existing entries. You are welcome to propose new entries or entries to be updated.

Future Projects

Analyze (prove or disprove) the following conjecture: a) SGD noise is higher for local minima than global minima (overparametrized case); b) local minima are more degenerate than global minima; a+b lead to SGD solutions to be closer to global minima than local minima with probability one asymptotically.

9.520/6.7910: Statistical Learning Theoryand Applications

Fall 2022