9.520/6.7910: Statistical Learning Theory and Applications Fall 2024: Projects

 

Guidelines and key dates. Online form for project proposal (complete by Mon, Nov. 4). Final project reports (5 pages for individuals, 8 pages for teams, NeurIPS style) are due on Tue. Dec. 10.

Suggested Project Ideas for Fall 2024

Every project proposal will need approval by the teaching staff. Students will be awarded a 5% bonus for choosing one of the recommended projects (highly recommended, multiple teams working on the same project are OK).  Email 9.520@mit.edu if you have any questions about these topics. Since most of the projects are  ongoing research projects in our group,  we ask you to get in touch with us about coordinating authorship and submission, if if you plan to submit the results of your project to a conference.

Below is a preliminary list of suggested project ideas.

Bounds

1. The goal in this project is to publish a paper in a major ML conference debunking this much cited  {\it Understanding deep learning (still) requires rethinking generalization} . The project consists of repeating

experiments with random labels with an overparametrized deep ReLU network and showing that classical generalization bounds predict generalization errror from training data. The plan is to use the Rademacher

bounds of https://proceedings.neurips.cc/paper_files/paper/2023/hash/8493e190ff1bbe3837eca821190b61ff-Abstract-Conference.htm

2. Empirical test of whether non-vacuous bounds are possible. Use sparsity (as in https://researchwith.njit.edu/en/publications/norm-based-generalization-bounds-for-sparse-neural-networks) . Test Mnist and CIFAR. Software may be available for MNIST

3. Compare empirically generalization bounds on same networks/problems...bounds with norms and bounds with rank.... compare with bounds for the same network based on norms...which one is better? The generalization bounds depending on rank are in https://cbmm.mit.edu/publications/generalization-bounds-neural-networks-low-rank-layers.

 

Neural Collapse

 

1. Show empirically that to get NC you need regularization for the square loss but you do not need it for the exponential loss (see https://spj.science.org/doi/full/10.34133/research.0024). Software available to test NC. Theoretical explanation should be easy.

2. Experiment or theory: GD cannot achieve intermediate Neural Collapse. SGD is necessary. Is regularization with square losss necessary?

 

SGD

 

1. Write a conference paper formalizing the arguments in https://cbmm.mit.edu/sites/default/files/publications/CBMM-Memo-072.pdf writing theorems and updating experiments

2. Inherent SGD noise*: we know that SGD with weight decay introduces a inherent SGD noise in the output of the network (see https://cbmm.mit.edu/sites/default/files/publications/The_Janus_effects_of_SGD_vs_GD__high_noise_and_low_rank_1.pdf). Can you characterize whether this noise -- asymptotically during training --  is close to be Gaussian? What is the variance? Theory or   experiments.

3. Consider Figure in https://cbmm.mit.edu/sites/default/files/publications/The_Janus_effects_of_SGD_vs_GD__high_noise_and_low_rank_1.pdf. Theory predicts that the noise in the gradient decreases

with increasing batch size: is this correct? Do experiments for a fixed \lamda and different batch sizes...measure the norm of the gradient and also the rank.

 

Approximation theory

 

1. Discuss the approximation properties of the KA representation in the light of recent papers. Compare with the standard MLPs. Look at this 2021 paper on The Kolmogorov-Arnold representation theorem https://www.dropbox.com/scl/fi/bms2njpjqmma5my4nf6hw/Kolmogorov.pdf?rlkey=hj9i793vm0tj0zqthnap7yca3&dl=0 and the last part of https://pinkus.net.technion.ac.il/files/2021/02/acta.pdf possibly with some references. We can certainly prove that its approximation properties are not based at all on the Kolmogorov-Arnold theorem (which is a

representation theorem and not an approximation one) and may even be not so good.

2. Consider the theorems about equivalence of efficient Turing computability with sparse compositionality in https://www.ams.org/journals/bull/2024-61-03/S0273-0979-2024-01820-5/. The goal is to write formal proofs in all details possibly leading to a publications (see also https://cbmm.mit.edu/publications/how-deep-sparse-networks-avoid-curse-dimensionality-efficiently-computable-functions).

 

Others

 

1. Consider Deep Net with multiplicative or additive or no regularization: visualize loss landscape on same problem like binary CIFAR. Theorems says global degenerate minima for NoReg and MultReg. Conjecture says isolated minima for AddReg.

2. Experiments with diffusion models with text or -- better-- with code.

3. Read https://arxiv.org/pdf/2407.13841. I think this paper -- if correct -- may help understand better the puzzle of adversarial examples.

4. Critically review https://arxiv.org/pdf/1803.00094 in the context of requring a "wing" architecture.

5. Critically review -- in the context of double descent claimed by Misha Belkin-- the paper https://neurips.cc/virtual/2023/poster/71830

6. Review https://arxiv.org/abs/2201.09418 and try to obtain the same results using the Frobenius norm instead of the spectral norm.

7. SGD vs layerwise optimization. Compare a standard feedforward network with a polynomial activation to: a network where you first

perform linear regression on the input \( x \), then regress the residuals on a quadratic polynomial of \( x \), and continue by

regressing the new residuals against a homogeneous polynomial of \( x \) with power 3. The activation function for the

standard feedforward network is \( x + x^2 \). The input \( x \) is a vector, and you're augmenting it with an extra dimension

that has a value of 1. This augmentation is common for implementing bias in neural networks.

8. Bounds for rank. Check whether this paper https://arxiv.org/pdf/2410.05754 may help in proving better bounds for rank depending on the size of the minibatch

(see https:// cbmm.mit.edu/publications/sgd-noise-and-implicit-low-rank-bias-deep-neural- networks)

9. Review from perspective of sparse compositionality theorems. Review wrt to the ideas of Sparse Compositionality

the following paper: https://arxiv.org/pdf/2410.01444

10. Learning invariant representations. How do present supervised learning algorithms compare with brains? One of the most obvious differences is the

ability of people and animals to learn from very few labeled examples. A child, or a monkey, can learn a recognition task from just a few examples.

The main motivation of this paper is the conjecture that a key to reducing the sample complexity of object recognition is invariance to

transformations. Images of the same object usually differ from each other because of simple transformations such as translation,

scale (distance) or more complex deformations such as viewpoint (rotation in depth) or change in pose (of a body) or expression (of a face).

The project consists of answeraddressingng any one of the following:

  • Simulate learning of invariance to rotations and scaling
  • Simulate learning of non-commutative groups such as translation and scaling or rotation
  • Describe how data on simple and complex cells fit the theory
  • Predict complex cells invariant to scaling instead of translation
  • Spell out predictions for experiments to be done in developing animals

11. PDEs and PINNs. Many recent papers use deep nets to solve partial differential equations…look at PNNs techniques.

Not clear what are the foundations in terms of approximation theory…for some PDEs the solution are $C^\inf$ thus smoothness does the trick…

what about the other cases? Do we need some version of compositional sparsity theorems?