# SYST/OR 610. Deep Leanring

**Department of Systems Engineering and Operations Research**

**George Mason University**

**Spring 2022**

**Instructor**: Vadim Sokolov

**Location and time**: Aquia, room 347; 7:20-10pm Mondays

**Office hours**: By appointment

##### Datacamp

If you are rusty on Python, I suggest you refresh your skills using Datacamp. Datacamp gave students in this class a free access to all of the courses. If you follow the link above you can get your free access using masonlive email. I also listed some of the Python courses I suggest #there.

#### List of topics and tentative schedule

Basics (Weeks 1-2)

- Linear Algebra: intro
- Probability: OpenIntro Ch 3
- Generalized Linear Models: OpenIntro Ch 8,9
- PyTorch: PyTorch Basics
- Feed Forward Architectures: WHAT IS TORCH.NN REALLY?; Ripley Ch 5; Bishop Ch 3,4

Convex Optimization (Weeks 3-4)

- Backpropagation and matrix derivatives
- Stochastic gradient descent and its variants (ADAM, RMSpropr, Nesterov acceleration): Bishop Ch 7, Goodfellow Ch 8
- Second order methods: Bishop Ch 7
- ADMM
- Regularization (l1, l2 and dropout): dropout paper, Godfellow Ch 7
- Batch normalization: paper

Conv Nets and Image Processing (Week 5): Goodfellow Ch 9

Recurrent Nets and Sequential Data (Week 6): Good Fellow Ch 10, seq2seq, Pytorch eq2seq tutorial

Theory of deep learning (Week 7): see theory section for the reading list

- Universal approximators
- Curse of dimensionality
- Kernel spaces
- Topology and geometry

Probabilistic DL (Weeks 8-9) Langevin, MCMC, , VB

- Conjugate distributions, exponential family Bishop: Ch 2
- Model choice
- Hierarchical linear and generalize linear models (regression and classification): Bishop Ch 10
- Models for missing data (EM-algorithm)
- Bayes computations (MCMC, Variational Bayes)

Additional Topics (Weeks 10-13)

- Model Visualization Tensorboard
- Generative Models (normalizing flows, GANs, recurrent nets): NF Paper Tutorial; NF;
- Attention and Transformers attention paper
- Deep Reinforcement Learning DRL Tutorial
- Bayesian Optimisation: Hyperparameter selection and parameter initialization Hyperopt

#### Data analysis projects

You will work in a team of up to 3 people on a Kaggle-like project and will apply deep learning to solve a prediction or data generation problem. By week 8 of the class you should have a team formed and data set + analysis problem identified. You need to submit a 0.5-1 page description of the data and problem you are trying to solve for my feedback and approval. Proposal has to have names and emails of the team members. Description of data set, problem to be solved and proposed architectures.

You will post results of your analysis on the class blog post. The final project will be graded on presentation, writing and analysis.

#### Group Work

Both projects and homework can be done in a groups of size of up to 3 people. You can change groups in between. If you do a homework in a gorup, it means that all of the members of the group do it individually and can consult with each other. You can also do 1 submission per group if you prefer. You can use “group” section of the piazza page to find teammates if you need any. If you need help finding a group, please email me.

#### Grading

Each hw is 10 points, project is 30.

This is a graduate level course focused on developing deep learning predictive models. We will learn both practical and theoretical aspects of deep learning. We will consider applications in engineering, finance and artificial intelligence. It is targeted towards the students who have completed an introductory courses in statistics and optimization. We will make extensive use of computational tools, such as the Python language, both for illustration in class and in homework problems. The class will consist of 9 lectures given by the instructor on several advanced topics in deep learning. At another 5 lectures students will present on a given topic.

#### Books

- Polson, Sokolov notes
- Dive into Deep Learning link
- Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
- Ripley, Brian D. Pattern recognition and neural networks. Cambridge university press, 2007.
- Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

#### Per Topic Resources

###### Architectures

- Tuning CNN architecture (blog)
- Sequence to Sequence Learning with Neural Networks (paper)
- Skip RNN (paper)
- Learning the Enigma with Recurrent Neural Networks (blog)
- LSTM blog
- Generative Adversarial Networks (presentation)
- GANs at OpenAI (blog)
- Adaptive Neural Trees (paper)
- Cortex
- Recognition
- Networks
- Modeling
- solution
- Need
- Networks
- Autoencoders
- WaveNet
- PixelCNN
- https://chrisorm.github.io/NGP.html

###### Optimization

###### Theory

- Polyak, Boris, and Pavel Shcherbakov. “Why does Monte Carlo fail to work properly in high-dimensional optimization problems?.” Journal of Optimization Theory and Applications 173, no. 2 (2017): 612-627. (paper)
- Leni, Pierre-Emmanuel, Yohan D. Fougerolle, and Frédéric Truchetet. “Kolmogorov superposition theorem and its application to multivariate function decompositions and image representation.” In Signal Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE International Conference on, pp. 344-351. IEEE, 2008. (paper)
- Klartag, Bo’az. “A central limit theorem for convex sets.” Inventiones mathematicae 168, no. 1 (2007): 91-131. (slides),
- Sun, Chen, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. “Revisiting unreasonable effectiveness of data in deep learning era.” In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 843-852. IEEE, 2017. (blog)
- Bengio, Yoshua, Aaron Courville, and Pascal Vincent. “Representation learning: A review and new perspectives.” IEEE transactions on pattern analysis and machine intelligence 35, no. 8 (2013): 1798-1828. (paper)
- Braun, Jürgen. “An application of Kolmogorov’s superposition theorem to function reconstruction in higher dimensions.” (2009). (dissertation)
- Kolmogorov. “On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of a Smaller Number of Variables” (paper)
- Arnold. “On functions of three variables” (papers)
- Bianchini, Monica, and Franco Scarselli. “On the complexity of shallow and deep neural network classifiers.” In ESANN. 2014.(paper)
- Girosi, Federico, and Tomaso Poggio. “Representation properties of networks: Kolmogorov’s theorem is irrelevant.” Neural Computation 1, no. 4 (1989): 465-469. (paper)
- Kůrková, Věra. “Kolmogorov’s theorem and multilayer neural networks.” Neural networks 5, no. 3 (1992): 501-506. (paper)
- Poggio, Tomaso, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. “Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.” International Journal of Automation and Computing 14, no. 5 (2017): 503-519. (paper)
- Telgarsky, Matus. “Representation benefits of deep feedforward networks.” arXiv preprint arXiv:1509.08101 (2015). (paper)
- Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. “On the number of linear regions of deep neural networks.” In Advances in neural information processing systems, pp. 2924-2932. 2014. (paper)
- Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding deep learning requires rethinking generalization.” arXiv preprint arXiv:1611.03530 (2016). (paper)
- Lin, Henry W., Max Tegmark, and David Rolnick. “Why does deep and cheap learning work so well?.” Journal of Statistical Physics 168, no. 6 (2017): 1223-1247. (paper)
- Stéphane Mallat 1: Mathematical Mysteries of Deep Neural Networks (video)
- Learning
- addition
- Networks
- Networks

###### Reinforcement Learning

Optimization Models - Truth - Yet

###### Bayesian DL

- VAE with a VampPrior (paper)
- Bayesian DL (blog)
- Recognition Networks for Approximate Inference in BN20 Networks (paper)
- Non-linear regression models for Approximate Bayesian Computation (paper)
- DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression (paper)
- Fast ε-free Inference of Simulation Models with Bayesian Conditional Density Estimation (paper)
- Auto-Encoding Variational Bayes (paper)
- Composing graphical models with neural networks for structured representations and fast inference (paper)
- Inference

###### Practical Tricks

- Averaging
- Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks (paper)

- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (paper)
- Auto-Encoding Variational Bayes (paper)
- Twin Networks: Using the Future as a Regularizer (paper)
- Don’t Decay the Learning Rate, Increase the Batch Size (paper)
- DL Tuning (blog)
- Survey

#### Other Resources

###### Additional Reading List

###### Blogs

###### Videos

###### Other courses with good web presence

###### Tools

###### Misc Links

- Pytorch resources (a curated list of tutorials, papers)
- Is artificial intelligence set to become art’s next medium?