**Department of Systems Engineering and Operations Research**

**George Mason University**

**Spring 2022**

Course Material

**Instructor**: Vadim Sokolov (vsokolov(at)gmu.edu)

**Location and time**: Aquia, room 347; 7:20-10pm Mondays

**Office hours**: By appointment

Basics (Weeks 1-2)

Linear Algebra: Tai-Danae Bradley intro

Probability: OpenIntro Ch 3

Generalized Linear Models: OpenIntro Ch 8,9

PyTorch: PyTorch Basics

Feed Forward Architectures: WHAT IS TORCH.NN REALLY?; Ripley Ch 5; Bishop Ch 3,4

Convex Optimization (Weeks 3-4)

Backpropagation and matrix derivatives

Stochastic gradient descent and its variants (ADAM, RMSpropr, Nesterov acceleration): Bishop Ch 7, Goodfellow Ch 8

Second order methods: Bishop Ch 7

ADMM

Regularization (l1, l2 and dropout): dropout paper, Godfellow Ch 7

Batch normalization: paper

Conv Nets and Image Processing (Week 5): Goodfellow Ch 9

Recurrent Nets and Sequential Data (Week 6): Good Fellow Ch 10, seq2seq tutorial, Pytorch seq2seq

Theory of deep learning (Week 7): see theory section for the reading list

Universal approximators

Curse of dimensionality

Kernel spaces

Topology and geometry

Probabilistic DL (Weeks 8-9)

Conjugate distributions, exponential family Bishop: Ch 2

Model choice

Hierarchical linear and generalize linear models (regression and classification): Bishop Ch 10

Models for missing data (EM-algorithm)

Bayes computations (MCMC, Variational Bayes)

Additional Topics (Weeks 10-13)

Model Visualization Tensorboard

Generative Models (normalizing flows, GANs, recurrent nets): NF Paper; DCGAN; NF Tutorial

Attention and Transformers attention paper

Deep Reinforcement Learning DRL Tutorial

Bayesian Optimisation: Hyperparameter selection and parameter initialization Hyperopt

You will work in a team of up to 3 people on a Kaggle-like project and will apply deep learning to solve a prediction or data generation problem. By week 8 of the class you should have a team formed and data set + analysis problem identified. You need to submit a 0.5-1 page description of the data and problem you are trying to solve for my feedback and approval. Proposal has to have names and emails of the team members. Description of data set, problem to be solved and proposed architectures. You will post results of your analysis on the class blog post. The final project will be graded on presentation, writing and analysis.

Both projects and homework can be done in a groups of size of up to 3 people. You can change groups in between. If you do a homework in a gorup, it means that all of the members of the group do it individually and can consult with each other. You can also do 1 submission per group if you prefer. You can use “group” section of the piazza page to find teammates if you need any. If you need help finding a group, please email me.

Each hw is 10 points, project is 30.

This is a graduate level course focused on developing deep learning predictive models. We will learn both practical and theoretical aspects of deep learning. We will consider applications in engineering, finance and artificial intelligence. It is targeted towards the students who have completed an introductory courses in statistics and optimization. We will make extensive use of computational tools, such as the Python language, both for illustration in class and in homework problems. The class will consist of 9 lectures given by the instructor on several advanced topics in deep learning. At another 5 lectures students will present on a given topic.

Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.

Ripley, Brian D. Pattern recognition and neural networks. Cambridge university press, 2007.

Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Tuning CNN architecture (blog)

Sequence to Sequence Learning with Neural Networks (paper)

Skip RNN (blog and paper)

Learning the Enigma with Recurrent Neural Networks (blog)

LSTM blog

Generative Adversarial Networks (presentation)

GANs at OpenAI (blog)

Adaptive Neural Trees (paper)

Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

An intriguing failing of convolutional neural networks and the CoordConv solution

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent (paper)

SGD (link)

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Neural Architecture Search with Reinforcement Learning (code)

Regularized Evolution for Image Classifier Architecture Search

On the importance of initialization and momentum in deep learning

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Polyak, Boris, and Pavel Shcherbakov. “Why does Monte Carlo fail to work properly in high-dimensional optimization problems?.” Journal of Optimization Theory and Applications 173, no. 2 (2017): 612-627. (paper)

Leni, Pierre-Emmanuel, Yohan D. Fougerolle, and Frédéric Truchetet. “Kolmogorov superposition theorem and its application to multivariate function decompositions and image representation.” In Signal Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE International Conference on, pp. 344-351. IEEE, 2008. (paper)

Klartag, Bo'az. “A central limit theorem for convex sets.” Inventiones mathematicae 168, no. 1 (2007): 91-131. (paper, slides)

Sun, Chen, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. “Revisiting unreasonable effectiveness of data in deep learning era.” In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 843-852. IEEE, 2017. (blog)

Bengio, Yoshua, Aaron Courville, and Pascal Vincent. “Representation learning: A review and new perspectives.” IEEE transactions on pattern analysis and machine intelligence 35, no. 8 (2013): 1798-1828. (paper)

Braun, Jürgen. “An application of Kolmogorov's superposition theorem to function reconstruction in higher dimensions.” (2009). (dissertation)

Kolmogorov. “On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of a Smaller Number of Variables” (paper)

Arnold. “On functions of three variables” (collection of papers)

Bianchini, Monica, and Franco Scarselli. “On the complexity of shallow and deep neural network classifiers.” In ESANN. 2014.(paper)

Girosi, Federico, and Tomaso Poggio. “Representation properties of networks: Kolmogorov's theorem is irrelevant.” Neural Computation 1, no. 4 (1989): 465-469. (paper)

Kůrková, Věra. “Kolmogorov's theorem and multilayer neural networks.” Neural networks 5, no. 3 (1992): 501-506. (paper)

Poggio, Tomaso, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. “Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.” International Journal of Automation and Computing 14, no. 5 (2017): 503-519. (paper)

Telgarsky, Matus. “Representation benefits of deep feedforward networks.” arXiv preprint arXiv:1509.08101 (2015). (paper)

Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. “On the number of linear regions of deep neural networks.” In Advances in neural information processing systems, pp. 2924-2932. 2014. (paper)

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding deep learning requires rethinking generalization.” arXiv preprint arXiv:1611.03530 (2016). (paper)

Lin, Henry W., Max Tegmark, and David Rolnick. “Why does deep and cheap learning work so well?.” Journal of Statistical Physics 168, no. 6 (2017): 1223-1247. (paper)

Stéphane Mallat 1: Mathematical Mysteries of Deep Neural Networks (video)

Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning

Model-Ensemble Trust-Region Policy Optimization Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

VAE with a VampPrior (paper)

Bayesian DL (blog)

Recognition Networks for Approximate Inference in BN20 Networks (paper)

Non-linear regression models for Approximate Bayesian Computation (paper)

DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression (paper)

Fast ε-free Inference of Simulation Models with Bayesian Conditional Density Estimation (paper)

Auto-Encoding Variational Bayes (paper)

Composing graphical models with neural networks for structured representations and fast inference (paper)

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks (paper)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (paper)

Auto-Encoding Variational Bayes (paper)

Twin Networks: Using the Future as a Regularizer (paper)

Don't Decay the Learning Rate, Increase the Batch Size (paper)

DL Tuning (blog)

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

50 Years of Data Science by Donoho (paper)

Papers with code link

Security (blog)

Unsupervised learning (blog)

Cybersecurity (paper collection)

Stanford's CS231n (course page)

Stanford's STATS385 (course page)

UC Berkeley Stat241B (lectures)

UCUC CSE598 (course page)

TF Playground (Google)

SnakeViz (python profiler)

Pytorch resources (a curated list of tutorials, papers, projects)