**Department of Systems Engineering and Operations Research**

**George Mason University**

**Fall 2019**

Course Material

**Instructor**: Vadim Sokolov (vsokolov(at)gmu.edu)

**Office hours**: By appointment

**TA**: Wanru Li (wli15(at)masonlive.gmu.edu)

**Office Hours**: Mon 4-6 pm at ENGR 2216

Presentation/Project proposals due Oct 16

9/3/2019: Room changed to EXPL L111

3/15/2019: No Class on October 23 (INFORMS Meeting)

3/15/2019: No Class on November 27 (Thanksgiving recess)

3/15/2019: First class is on Aug 28 at 7:20pm

3/15/2019: Last class is on Dec 4

DL Overview + Python (numpy and PyTorch) (Week 1)

Notes Ch 1; Polson18Probability (Week 2)

Notes Ch 2-3; Domingos

Optional DLB Ch 3Optimization: SGD, Backprop (Week 3)

Notes Ch 4, Baydin17,Architectures (Week 4)

Notes ch 5; Oord16More optimization (Week 5)

Glorot10DL Theory (Week 8)

Research paper presentations (Week 9-12)

Project Presentations (Week 13)

This is a graduate level course focused on developing deep learning predictive models. We will learn both practical and theoretical aspects of deep learning. We will consider applications in engineering, finance and artificial intelligence. It is targeted towards the students who have completed an introductory courses in statistics and optimization. We will make extensive use of computational tools, such as the Python language, both for illustration in class and in homework problems. The class will consist of 9 lectures given by the instructor on several advanced topics in deep learning. At another 5 lectures students will present on a given topic.

The lectures and homework for 750 and 610 are the same. The difference is in the final project. If you are registered for 750, you will read and present research papers and if you are registered for 610 you will do a Kaggle-type project. Both research paper presentations and projects are to be done in a group of size up to 5 for research papers and 3 for projects.

During weeks 9-12, this class will be run in a seminar mode. A team of students will prepare a topic and will lead the discussion and another team will write a blog-post about the class and will post it on Medium. Students responsible for posting the blog summary will be different from the ones charged with leading the topic discussion, but should work closely with the leaders on the posted write-up.

Two weeks before the scheduled class, meet briefly with me to discuss plan for the class. You should decide on a team leader for this class, who will be the one responsible for making sure everyone on the team knows what they are doing and coordinating the team's efforts.

The Monday the week of the class, at least a few representatives from the team should come to my office to discuss the plan for the class. You should come prepared to this meeting with suggested papers and ideas about how to present them.

On Tuesday before class, send me the preparation materials for the class. This can include links to papers to read, but could also include exercises to do or software to install and experiment with, etc. I will post it on the course page.

Day of class: lead an interesting, engaging, and illuminating class! This is a 2.5 hour class, so it can’t just be a series of unconnected, dull presentations. You need to think of things to do in class to make it more worthwhile and engaging.

After class: help the Blogging team by providing them with your materials, answering their questions, and reviewing their write-up.

The week before the scheduled class, develop a team plan for how to manage the blogging.

One team member should be designated the team leader for the blogging. The blogging leader is responsible for making sure the team is well coordinated and everyone knows what they are doing and follows through on this.

During class, participate actively in the class, and take detailed notes (this can be distributed among the team).

By the Wed following class, have a the blog post ready and posted on Medium. Get comments from the rest of the class (including the leading team and coordinators).

By the next Friday (one week after the class), have a final version of the blog post ready.

Adversarial attacks

DL in reinforcement learning

Interpretable DL

Science applications of DL (physics, molecular biology,…)

Engineering applications of DL (logistics, energy, smart grids, congestion management,…)

Natural language processing

You will work in a team of up to 3 people on a Kaggle-like project and will apply deep learning to solve a prediction or data generation problem. By week 9 of the class you should have a team formed and data set + analysis problem identified. You need to email me a 0.5-1 page description of the data and problem you are trying to solve for my feedback and approval. During week 13, you will have a time slot to present your findings. You are also encouraged (although it is not required) to post results of your analysis on Medium, if you think it is worth sharing.

**Lectures**: Exploratory Hall L111. 7:20-10pm on Wed

**Grades**: 40% homework, 60% class presentations

Convex Optimization

Stochastic gradient descent and its variants (ADAM, RMSpropr, Nesterov acceleration)

Second order methods

ADMM

Regularization (l1, l2 and dropout)

Batch normalization

Theory of deep learning

Universal approximators

Curse of dimensionality

Kernel spaces

Topology and geometry

Computational aspects (accelerated linear algebra, reduced precision calculations, parallelism)

Architectures (CNN, LSTM, MLP, VAE)

Bayesian DL

Deep reinforcement learning

Hyperparameter selection and parameter initialization

Generative models (GANs)

Deep Learning (DLB) (book page)

Deep Learning with Python (DLPB) (book page)

Learning Deep Architectures for AI (monograph)

Tuning CNN architecture (blog)

Sequence to Sequence Learning with Neural Networks (paper)

Skip RNN (blog and paper)

Learning the Enigma with Recurrent Neural Networks (blog)

LSTM blog

Generative Adversarial Networks (presentation)

GANs at OpenAI (blog)

Adaptive Neural Trees (paper)

Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

An intriguing failing of convolutional neural networks and the CoordConv solution

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent (paper)

SGD (link)

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Neural Architecture Search with Reinforcement Learning (code)

Regularized Evolution for Image Classifier Architecture Search

On the importance of initialization and momentum in deep learning

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Polyak, Boris, and Pavel Shcherbakov. “Why does Monte Carlo fail to work properly in high-dimensional optimization problems?.” Journal of Optimization Theory and Applications 173, no. 2 (2017): 612-627. (paper)

Leni, Pierre-Emmanuel, Yohan D. Fougerolle, and Frédéric Truchetet. “Kolmogorov superposition theorem and its application to multivariate function decompositions and image representation.” In Signal Image Technology and Internet Based Systems, 2008. SITIS’08. IEEE International Conference on, pp. 344-351. IEEE, 2008. (paper)

Klartag, Bo'az. “A central limit theorem for convex sets.” Inventiones mathematicae 168, no. 1 (2007): 91-131. (paper, slides)

Sun, Chen, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. “Revisiting unreasonable effectiveness of data in deep learning era.” In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 843-852. IEEE, 2017. (blog)

Bengio, Yoshua, Aaron Courville, and Pascal Vincent. “Representation learning: A review and new perspectives.” IEEE transactions on pattern analysis and machine intelligence 35, no. 8 (2013): 1798-1828. (paper)

Braun, Jürgen. “An application of Kolmogorov's superposition theorem to function reconstruction in higher dimensions.” (2009). (dissertation)

Kolmogorov. “On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of a Smaller Number of Variables” (paper)

Arnold. “On functions of three variables” (collection of papers)

Bianchini, Monica, and Franco Scarselli. “On the complexity of shallow and deep neural network classifiers.” In ESANN. 2014.(paper)

Girosi, Federico, and Tomaso Poggio. “Representation properties of networks: Kolmogorov's theorem is irrelevant.” Neural Computation 1, no. 4 (1989): 465-469. (paper)

Kůrková, Věra. “Kolmogorov's theorem and multilayer neural networks.” Neural networks 5, no. 3 (1992): 501-506. (paper)

Poggio, Tomaso, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. “Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.” International Journal of Automation and Computing 14, no. 5 (2017): 503-519. (paper)

Telgarsky, Matus. “Representation benefits of deep feedforward networks.” arXiv preprint arXiv:1509.08101 (2015). (paper)

Montufar, Guido F., Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. “On the number of linear regions of deep neural networks.” In Advances in neural information processing systems, pp. 2924-2932. 2014. (paper)

Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding deep learning requires rethinking generalization.” arXiv preprint arXiv:1611.03530 (2016). (paper)

Lin, Henry W., Max Tegmark, and David Rolnick. “Why does deep and cheap learning work so well?.” Journal of Statistical Physics 168, no. 6 (2017): 1223-1247. (paper)

Stéphane Mallat 1: Mathematical Mysteries of Deep Neural Networks (video)

Theory of Deep Learning II: Landscape of the Empirical Risk in Deep Learning

Model-Ensemble Trust-Region Policy Optimization Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

VAE with a VampPrior (paper)

Bayesian DL (blog)

Recognition Networks for Approximate Inference in BN20 Networks (paper)

Non-linear regression models for Approximate Bayesian Computation (paper)

DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression (paper)

Fast ε-free Inference of Simulation Models with Bayesian Conditional Density Estimation (paper)

Auto-Encoding Variational Bayes (paper)

Composing graphical models with neural networks for structured representations and fast inference (paper)

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks (paper)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (paper)

Auto-Encoding Variational Bayes (paper)

Twin Networks: Using the Future as a Regularizer (paper)

Don't Decay the Learning Rate, Increase the Batch Size (paper)

DL Tuning (blog)

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

50 Years of Data Science by Donoho (paper)

Papers with code link

Security (blog)

Unsupervised learning (blog)

Cybersecurity (paper collection)

Stanford's CS231n (course page)

Stanford's STATS385 (course page)

UC Berkeley Stat241B (lectures)

UCUC CSE598 (course page)

TF Playground (Google)

SnakeViz (python profiler)

Pytorch resources (a curated list of tutorials, papers, projects)