SYST/OR 610. Deep Leanring

Department of Systems Engineering and Operations Research
George Mason University
Spring 2022

Instructor: Vadim Sokolov
Location and time: Aquia, room 347; 7:20-10pm Mondays
Office hours: By appointment

Datacamp

If you are rusty on Python, I suggest you refresh your skills using Datacamp. Datacamp gave students in this class a free access to all of the courses. If you follow the link above you can get your free access using masonlive email. I also listed some of the Python courses I suggest #there.

List of topics and tentative schedule

Basics (Weeks 1-2)
- Linear Algebra: intro
- Probability: OpenIntro Ch 3
- Generalized Linear Models: OpenIntro Ch 8,9
- PyTorch: PyTorch Basics
- Feed Forward Architectures: WHAT IS TORCH.NN REALLY?; Ripley Ch 5; Bishop Ch 3,4
Convex Optimization (Weeks 3-4)
- Backpropagation and matrix derivatives
- Stochastic gradient descent and its variants (ADAM, RMSpropr, Nesterov acceleration): Bishop Ch 7, Goodfellow Ch 8
- Second order methods: Bishop Ch 7
- ADMM
- Regularization (l1, l2 and dropout): dropout paper, Godfellow Ch 7
- Batch normalization: paper
Conv Nets and Image Processing (Week 5): Goodfellow Ch 9
Recurrent Nets and Sequential Data (Week 6): Good Fellow Ch 10, seq2seq, Pytorch eq2seq tutorial
Theory of deep learning (Week 7): see theory section for the reading list
- Universal approximators
- Curse of dimensionality
- Kernel spaces
- Topology and geometry
Probabilistic DL (Weeks 8-9) Langevin, MCMC, , VB
- Conjugate distributions, exponential family Bishop: Ch 2
- Model choice
- Hierarchical linear and generalize linear models (regression and classification): Bishop Ch 10
- Models for missing data (EM-algorithm)
- Bayes computations (MCMC, Variational Bayes)
Additional Topics (Weeks 10-13)
- Model Visualization Tensorboard
- Generative Models (normalizing flows, GANs, recurrent nets): NF Paper Tutorial; NF;
- Attention and Transformers attention paper
- Deep Reinforcement Learning DRL Tutorial
- Bayesian Optimisation: Hyperparameter selection and parameter initialization Hyperopt

Data analysis projects

You will work in a team of up to 3 people on a Kaggle-like project and will apply deep learning to solve a prediction or data generation problem. By week 8 of the class you should have a team formed and data set + analysis problem identified. You need to submit a 0.5-1 page description of the data and problem you are trying to solve for my feedback and approval. Proposal has to have names and emails of the team members. Description of data set, problem to be solved and proposed architectures.

You will post results of your analysis on the class blog post. The final project will be graded on presentation, writing and analysis.

Group Work

Both projects and homework can be done in a groups of size of up to 3 people. You can change groups in between. If you do a homework in a gorup, it means that all of the members of the group do it individually and can consult with each other. You can also do 1 submission per group if you prefer. You can use “group” section of the piazza page to find teammates if you need any. If you need help finding a group, please email me.

Grading

Each hw is 10 points, project is 30.

This is a graduate level course focused on developing deep learning predictive models. We will learn both practical and theoretical aspects of deep learning. We will consider applications in engineering, finance and artificial intelligence. It is targeted towards the students who have completed an introductory courses in statistics and optimization. We will make extensive use of computational tools, such as the Python language, both for illustration in class and in homework problems. The class will consist of 9 lectures given by the instructor on several advanced topics in deep learning. At another 5 lectures students will present on a given topic.

Books

Polson, Sokolov notes
Dive into Deep Learning link
Goodfellow, Ian, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Vol. 1. Cambridge: MIT press, 2016.
Ripley, Brian D. Pattern recognition and neural networks. Cambridge university press, 2007.
Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

Per Topic Resources

Architectures

Tuning CNN architecture (blog)
Sequence to Sequence Learning with Neural Networks (paper)
Skip RNN (paper)
Learning the Enigma with Recurrent Neural Networks (blog)
LSTM blog
Generative Adversarial Networks (presentation)
GANs at OpenAI (blog)
Adaptive Neural Trees (paper)
Cortex
Recognition
Networks
Modeling
solution
Need
Networks
Autoencoders
WaveNet
PixelCNN
https://chrisorm.github.io/NGP.html

Optimization

Book
(1970)
Lecture
(1983)
(1964)
Learning
(2004)
HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent (paper)
SGD (link)
Sampling
Dynamics
Optimization
code) (
code) (
Search
learning
Minima
Works
Minima
Nets
DNNs
Learning
Acceleration

Theory

Polyak, Boris, and Pavel Shcherbakov. “Why does Monte Carlo fail to work properly in high-dimensional optimization problems?.” Journal of Optimization Theory and Applications 173, no. 2 (2017): 612-627. (paper)
Leni, Pierre-Emmanuel, Yohan D. Fougerolle, and Frédéric Truchetet. “Kolmogorov superposition theorem and its application to multivariate function decompositions and image representation.” In Signal Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE International Conference on, pp. 344-351. IEEE, 2008. (paper)
Klartag, Bo’az. “A central limit theorem for convex sets.” Inventiones mathematicae 168, no. 1 (2007): 91-131. (slides),
Sun, Chen, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. “Revisiting unreasonable effectiveness of data in deep learning era.” In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 843-852. IEEE, 2017. (blog)
Bengio, Yoshua, Aaron Courville, and Pascal Vincent. “Representation learning: A review and new perspectives.” IEEE transactions on pattern analysis and machine intelligence 35, no. 8 (2013): 1798-1828. (paper)
Braun, Jürgen. “An application of Kolmogorov’s superposition theorem to function reconstruction in higher dimensions.” (2009). (dissertation)
Kolmogorov. “On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of a Smaller Number of Variables” (paper)
Arnold. “On functions of three variables” (papers)
Bianchini, Monica, and Franco Scarselli. “On the complexity of shallow and deep neural network classifiers.” In ESANN. 2014.(paper)
Girosi, Federico, and Tomaso Poggio. “Representation properties of networks: Kolmogorov’s theorem is irrelevant.” Neural Computation 1, no. 4 (1989): 465-469. (paper)
Kůrková, Věra. “Kolmogorov’s theorem and multilayer neural networks.” Neural networks 5, no. 3 (1992): 501-506. (paper)
Poggio, Tomaso, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. “Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.” International Journal of Automation and Computing 14, no. 5 (2017): 503-519. (paper)
Telgarsky, Matus. “Representation benefits of deep feedforward networks.” arXiv preprint arXiv:1509.08101 (2015). (paper)
Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. “On the number of linear regions of deep neural networks.” In Advances in neural information processing systems, pp. 2924-2932. 2014. (paper)
Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding deep learning requires rethinking generalization.” arXiv preprint arXiv:1611.03530 (2016). (paper)
Lin, Henry W., Max Tegmark, and David Rolnick. “Why does deep and cheap learning work so well?.” Journal of Statistical Physics 168, no. 6 (2017): 1223-1247. (paper)
Stéphane Mallat 1: Mathematical Mysteries of Deep Neural Networks (video)
Learning
addition
Networks
Networks

Reinforcement Learning

Optimization Models - Truth - Yet

Bayesian DL

VAE with a VampPrior (paper)
Bayesian DL (blog)
Recognition Networks for Approximate Inference in BN20 Networks (paper)
Non-linear regression models for Approximate Bayesian Computation (paper)
DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression (paper)
Fast ε-free Inference of Simulation Models with Bayesian Conditional Density Estimation (paper)
Auto-Encoding Variational Bayes (paper)
Composing graphical models with neural networks for structured representations and fast inference (paper)
Inference

Practical Tricks

Averaging
- Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks (paper)
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (paper)
Auto-Encoding Variational Bayes (paper)
Twin Networks: Using the Future as a Regularizer (paper)
Don’t Decay the Learning Rate, Increase the Batch Size (paper)
DL Tuning (blog)
Survey

SYST/OR 610. Deep Leanring

Datacamp

List of topics and tentative schedule

Data analysis projects

Group Work

Grading

Books

Per Topic Resources

Architectures

Optimization

Theory

Reinforcement Learning

Bayesian DL

Practical Tricks

Other Resources

Additional Reading List

Blogs

Videos

Other courses with good web presence

Tools

Misc Links