Deep Learning

Yoshua Bengio

Ian J. Goodfellow

Aaron Courville

January 1, 2015

Table of Contents

1 Deep Learning for AI 2

1.1 Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Historical Perspective and Neural Networks . . . . . . . . . . . . . . . . 14

1.4 Recent Impact of Deep Learning Research . . . . . . . . . . . . . . . . . 15

1.5 Challenges for Future Research . . . . . . . . . . . . . . . . . . . . . . . 17

2 Linear algebra 20

2.1 Scalars, vectors, matrices and tensors . . . . . . . . . . . . . . . . . . . . 20

2.2 Multiplying matrices and vectors . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Identity and inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Linear dependence, span, and rank . . . . . . . . . . . . . . . . . . . . . 25

2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Special kinds of matrices and vectors . . . . . . . . . . . . . . . . . . . . 28

2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 30

2.9 The trace operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.10 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.11 Example: Principal components analysis . . . . . . . . . . . . . . . . . . 32

3 Probability and Information Theory 35

3.1 Why probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Discrete variables and probability mass functions . . . . . . . . . 37

3.3.2 Continuous variables and probability density functions . . . . . . 38

3.4 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7 Independence and conditional independence . . . . . . . . . . . . . . . . 40

3.8 Expectation, variance, and covariance . . . . . . . . . . . . . . . . . . . 41

3.9 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.10 Common probability distributions . . . . . . . . . . . . . . . . . . . . . 44

3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44

3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 44

3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 45

3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 48

3.11 Useful properties of common functions . . . . . . . . . . . . . . . . . . . 48

3.12 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.13 Technical details of continuous variables . . . . . . . . . . . . . . . . . . 51

3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Numerical Computation 56

4.1 Overﬂow and underﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Example: linear least squares . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Machine Learning Basics 70

5.1 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 The task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.2 The performance measure, P . . . . . . . . . . . . . . . . . . . . 72

5.1.3 The experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Example: Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting . . . . . . . . . . 76

5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.3 Occam’s Razor, Underﬁtting and Overﬁtting . . . . . . . . . . . 78

5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 81

5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 81

5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.4 Trading oﬀ Bias and Variance and the Mean Squared Error . . . 85

5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 86

5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 87

5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 87

5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.8 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 88

5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-

mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.9 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.9.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . 90

5.10 Weakly supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.11 The Smoothness Prior, Local Generalization and Non-Parametric Models 93

5.12 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 97

5.13 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 100

6 Feedforward Deep Networks 102

6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 102

6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 105

6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 106

6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 111

6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 112

6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 113

6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.3.2 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 116

6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 120

6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 122

6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . . . . . 124

6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7 Regularization 126

7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 127

7.1.1 L

Parameter Regularization . . . . . . . . . . . . . . . . . . . . 128

7.1.2 L

Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.1.3 L

∞

Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.2 Classical Regularization as Constrained Optimization . . . . . . . . . . 132

7.3 Regularization from a Bayesian Perspective . . . . . . . . . . . . . . . . 134

7.4 Early Stopping as a Form of Regularization . . . . . . . . . . . . . . . . 134

7.5 Regularization and Under-Constrained Problems . . . . . . . . . . . . . 139

7.6 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.8 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.9 Classical Regularization as Noise Robustness . . . . . . . . . . . . . . . 141

7.10 Semi-Supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.11 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.11.1 Pretraining Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.12 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . . . . . 144

7.13 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.14 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8 Optimization for training deep models 150

8.1 Optimization for model training . . . . . . . . . . . . . . . . . . . . . . . 150

8.1.1 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.1.2 Plateaus, saddle points, and other ﬂat regions . . . . . . . . . . . 150

8.1.3 Cliﬀs and Exploding Gradients . . . . . . . . . . . . . . . . . . . 150

8.1.4 Vanishing and Exploding Gradients - An Introduction to the Issue

of Learning Long-Term Dependencies . . . . . . . . . . . . . . . 153

8.2 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.2.1 Approximate Natural Gradient and Second-Order Methods . . . 156

8.2.2 Optimization strategies and meta-algorithms . . . . . . . . . . . 156

8.2.3 Coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.2.4 Greedy supervised pre-training . . . . . . . . . . . . . . . . . . . 157

8.3 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 157

9 Structured Probabilistic Models: A Deep Learning Perspective 158

9.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . . . . 159

9.2 A Graphical Syntax for Describing Model Structure . . . . . . . . . . . 161

9.2.1 Directed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9.2.2 Undirected Models . . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 164

9.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 166

9.2.5 Separation and D-Separation . . . . . . . . . . . . . . . . . . . . 167

9.2.6 Operations on a Graph . . . . . . . . . . . . . . . . . . . . . . . 169

9.2.7 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

9.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . . . . . 171

9.4 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 171

9.4.1 Latent Variables Versus Structure Learning . . . . . . . . . . . . 171

9.4.2 Latent Variables for Feature Learning . . . . . . . . . . . . . . . 172

9.5 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 173

9.6 Inference and Approximate Inference Over Latent Variables . . . . . . . 174

9.7 The Deep Learning Approach to Structured Probabilistic Modeling . . . 176

9.7.1 Example: The Restricted Boltzmann Machine . . . . . . . . . . . 177

10 Unsupervised and Transfer Learning 179

10.1 Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

10.1.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 181

10.1.2 Representational Power, Layer Size and Depth . . . . . . . . . . 184

10.1.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 185

10.2 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

10.2.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 186

10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 188

10.2.3 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

10.2.4 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 191

10.3 RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

10.4 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 192

10.5 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 193

11 Convolutional Networks 199

11.1 The convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 199

11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

11.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

11.4 Variants of the basic convolution function . . . . . . . . . . . . . . . . . 209

11.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

11.6 Eﬃcient convolution algorithms . . . . . . . . . . . . . . . . . . . . . . . 216

11.7 Deep learning history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

12 Sequence Modeling: Recurrent and Recursive Nets 217

12.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 217

12.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 219

12.2.1 Computing the gradient in a recurrent neural network . . . . . . 221

12.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 223

12.2.3 RNNs to represent conditional probability distributions . . . . . 225

12.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

12.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 229

12.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 230

12.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 231

12.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 232

12.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

12.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 235

12.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely

Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

12.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 237

12.6.3 Leaky Units and a Hierarchy Diﬀerent Time Scales . . . . . . . . 238

12.6.4 The Long-Short-Term-Memory Architecture and Other Gated RNNs239

12.6.5 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

12.6.6 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 243

12.6.7 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 244

12.6.8 Regularizing to Encourage Information Flow . . . . . . . . . . . 245

12.6.9 Organizing the State at Multiple Time Scales . . . . . . . . . . . 245

12.7 Handling temporal dependencies with n-grams, HMMs, CRFs and other

graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.7.2 Eﬃcient Marginalization and Inference for Temporally Structured

Outputs by Dynamic Programming . . . . . . . . . . . . . . . . . 247

12.7.3 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

12.7.4 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

12.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 256

12.8.1 Approximate Search . . . . . . . . . . . . . . . . . . . . . . . . . 257

13 The Manifold Perspective on Auto-Encoders 261

13.1 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 269

13.2 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 272

13.3 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

13.3.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 274

13.3.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 276

13.4 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 277

13.4.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 279

13.4.2 Turning the Gradient Field into a Generative Model . . . . . . . 281

13.5 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 284

13.6 Tangent Distance, Tangent-Prop, and Manifold Tangent Classiﬁer . . . 285

14 Distributed Representations: Disentangling the Underlying Factors 288

14.1 Causality and Semi-Supervised Learning . . . . . . . . . . . . . . . . . . 288

14.2 Assumption of Underlying Factors and Distributed Representation . . . 290

14.3 Exponential Gain in Representational Eﬃciency from Distributed Repre-

sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

14.4 Exponential Gain in Representational Eﬃciency from Depth . . . . . . . 295

14.5 Priors Regarding The Underlying Factors . . . . . . . . . . . . . . . . . 298

15 Confronting the Partition Function 301

15.1 Estimating the partition function . . . . . . . . . . . . . . . . . . . . . . 301

15.1.1 Annealed importance sampling . . . . . . . . . . . . . . . . . . . 303

15.1.2 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

15.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

15.2 Stochastic maximum likelihood and contrastive divergence . . . . . . . . 307

15.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

15.4 Score matching and ratio matching . . . . . . . . . . . . . . . . . . . . . 316

15.5 Denoising score matching . . . . . . . . . . . . . . . . . . . . . . . . . . 318

15.6 Noise-contrastive estimation . . . . . . . . . . . . . . . . . . . . . . . . . 318

16 Approximate inference 321

16.1 Inference as optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 321

16.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 323

16.3 MAP inference: Sparse coding as a probabilistic model . . . . . . . . . . 324

16.4 Variational inference and learning . . . . . . . . . . . . . . . . . . . . . . 325

16.4.1 Discrete latent variables . . . . . . . . . . . . . . . . . . . . . . . 327

16.4.2 Calculus of variations . . . . . . . . . . . . . . . . . . . . . . . . 327

16.4.3 Continuous latent variables . . . . . . . . . . . . . . . . . . . . . 329

16.5 Stochastic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

16.6 Learned approximate inference . . . . . . . . . . . . . . . . . . . . . . . 329

17 Deep generative models 330

17.1 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . 330

17.2 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

17.3 Deep Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . 333

17.3.1 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . 333

17.3.2 Variational learning with SML . . . . . . . . . . . . . . . . . . . 334

17.3.3 Layerwise pretraining . . . . . . . . . . . . . . . . . . . . . . . . 335

17.3.4 Multi-prediction deep Boltzmann machines . . . . . . . . . . . . 337

17.3.5 Centered deep Boltzmann machines . . . . . . . . . . . . . . . . 337

17.4 Boltzmann machines for real-valued data . . . . . . . . . . . . . . . . . . 337

17.4.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 337

17.4.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

17.4.3 Spike and slab restricted Boltzmann machines . . . . . . . . . . . 338

17.5 Convolutional Boltzmann machines . . . . . . . . . . . . . . . . . . . . . 338

17.6 Other Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . 339

17.7 Directed generative nets . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

17.7.1 Variational autoencoders . . . . . . . . . . . . . . . . . . . . . . 339

17.7.2 Variational interpretation of PSD . . . . . . . . . . . . . . . . . . 339

17.7.3 Generative adversarial networks . . . . . . . . . . . . . . . . . . 339

17.8 A generative view of autoencoders . . . . . . . . . . . . . . . . . . . . . 340

17.9 Generative stochastic networks . . . . . . . . . . . . . . . . . . . . . . . 340

17.10Methodological notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

18 Large scale deep learning 343

18.1 Fast CPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . 343

18.2 GPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

18.3 Asynchronous parallel implementations . . . . . . . . . . . . . . . . . . . 343

18.4 Dynamically structured nets . . . . . . . . . . . . . . . . . . . . . . . . . 343

18.5 Model compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

19 Practical methodology 345

19.1 When to gather more data, control capacity, or change algorithms . . . 345

19.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 345

19.3 Manual hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . 345

19.4 Hyper-parameter optimization algorithms . . . . . . . . . . . . . . . . . 345

19.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 347

19.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 347

19.5.2 Automatic Diﬀerentation and Symbolic Manipulations of Flow

Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

19.5.3 Momentum and Other Averaging Techniques as Cheap Second

Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

20 Applications 348

20.1 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348

20.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

20.1.2 Convolutional nets . . . . . . . . . . . . . . . . . . . . . . . . . . 354

20.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

20.3 Natural language processing and neural language models . . . . . . . . . 354

20.3.1 Neural language models . . . . . . . . . . . . . . . . . . . . . . . 354

20.4 Structured outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

20.5 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

Bibliography 355

Index 376

Acknowledgments

We would like to thank the following people who commented our proposal for the

book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,

Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger

and Thomas Roh´ee.

We would like to thank the following people who oﬀered feedback on the content of

the book itself:

In many chapters: Pawel Chilinski.

Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, Ozan C¸ aglayan.

Math background chapters: Ilya Sutskever, Vincent Vanhoucke, Johannes Roith,

Linear algebra: Guillaume Alain, Dustin Webb, David Warde-Farley, Pierre Luc

Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad Almahairi, Sergey Oreshkov,

Probability: Rasmus Antti, Stephan Gouws, David Warde-Farley, Vincent Dumoulin,

Artem Oboturov, Li Yao. John Philip Anderson

Numerical: Meire Fortunato, Jurgen Van Gael. Dustin Webb

ML: Dzmitry Bahdanau Kelvin Xu

MLPs: Jurgen Van Gael

Convolutional nets: Guillaume Alain, David Warde-Farley, Mehdi Mirza, Caglar

Gulcehre.

Unsupervised: Kelvin Xu

Partition function: Sam Bowman.

Graphical models: Kelvin Xu

RNNs: Kelvin Xu Dmitriy Serdyuk

We also want to thank Jason Yosinski and Nicolas Chapados for contributing ﬁgures

(as noted in the captions).

TODO– this section is just notes, write it up in nice presentation form.