Deep learning is a core machine learning technology that has driven the rapid improvement and broad adoption of artificial intelligence in recent years. It is based on learning with multilayer neural networks, and models with massive numbers of parameters, most notably large language models (LLMs), have shown remarkable performance. However, the theoretical foundations for why such large-scale models can be trained successfully and achieve high generalization performance are still incomplete, and many researchers are actively working on this problem. In particular, there is a gap between the insight from classical information criteria such as AIC and MDL, which suggests that preventing overfitting requires selecting a model of an appropriate size, and the empirical success of modern deep learning. Bridging this gap is an important challenge.
In this talk, we tackle these theoretical challenges in deep learning from the viewpoint of the Minimum Description Length (MDL) principle. We first focus on a simple two-layer neural network and present how one can obtain performance guarantees for an MDL estimator by leveraging a distinctive eigenvalue structure of the Fisher information matrix that we have recently identified. We then discuss prospects for extending this approach to more complex deep neural networks.
(The previous talk on 28 January will be provide useful background, but this talk will be self-contained.)