Deep Learning

Yoshua Bengio

Ian J. Goodfellow

Aaron Courville

December 5, 2014

Table of Contents

1 Deep Learning for AI 2

1.1 Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Historical Perspective and Neural Networks . . . . . . . . . . . . . . . . 14

1.4 Recent Impact of Deep Learning Research . . . . . . . . . . . . . . . . . 15

1.5 Challenges for Future Research . . . . . . . . . . . . . . . . . . . . . . . 17

2 Linear algebra 20

2.1 Scalars, vectors, matrices and tensors . . . . . . . . . . . . . . . . . . . . 20

2.2 Multiplying matrices and vectors . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Identity and inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Linear dependence, span, and rank . . . . . . . . . . . . . . . . . . . . . 25

2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Special kinds of matrices and vectors . . . . . . . . . . . . . . . . . . . . 27

2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 30

2.9 The trace operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.10 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.11 Example: Principal components analysis . . . . . . . . . . . . . . . . . . 31

3 Probability and Information Theory 35

3.1 Why probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Discrete variables and probability mass functions . . . . . . . . . 37

3.3.2 Continuous variables and probability density functions . . . . . . 38

3.4 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.7 Independence and conditional independence . . . . . . . . . . . . . . . . 40

3.8 Expectation, variance, and covariance . . . . . . . . . . . . . . . . . . . 41

3.9 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.10 Common probability distributions . . . . . . . . . . . . . . . . . . . . . 44

3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44

3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 44

3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 45

3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 48

3.11 Useful properties of common functions . . . . . . . . . . . . . . . . . . . 48

3.12 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.13 Technical details of continuous variables . . . . . . . . . . . . . . . . . . 51

3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Numerical Computation 56

4.1 Overﬂow and underﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Example: linear least squares . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Machine Learning Basics 70

5.1 Learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 The task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.2 The performance measure, P . . . . . . . . . . . . . . . . . . . . 72

5.1.3 The experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Example: Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting . . . . . . . . . . 76

5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.3 Occam’s Razor, Underﬁtting and Overﬁtting . . . . . . . . . . . 78

5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 81

5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 81

5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.4 Trading oﬀ Bias and Variance and the Mean Squared Error . . . 85

5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 86

5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 87

5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 87

5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.8 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 88

5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-

mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.9 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.9.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . 90

5.10 Weakly supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.11 The Smoothness Prior, Local Generalization and Non-Parametric Models 95

5.12 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 99

5.13 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 102

6 Feedforward Deep Networks 104

6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 104

6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 107

6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 108

6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 113

6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 114

6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 115

6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.3.2 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 118

6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 122

6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 124

6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . . . . . 125

6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7 Regularization 127

7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 128

7.1.1 L

parameter regularization . . . . . . . . . . . . . . . . . . . . . 129

7.1.2 L

regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.1.3 L

∞

regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.2 Classical regularization as constrained optimization . . . . . . . . . . . . 132

7.3 Regularization from a Bayesian perspective . . . . . . . . . . . . . . . . 134

7.4 Early stopping as a form of regularization . . . . . . . . . . . . . . . . . 134

7.5 Regularization and under-constrained problems . . . . . . . . . . . . . . 139

7.6 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.8 Dataset augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.9 Classical regularization as noise robustness . . . . . . . . . . . . . . . . 141

7.10 Semi-supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.11 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.11.1 The pretraining protocol. . . . . . . . . . . . . . . . . . . . . . . 142

7.12 Bagging and other ensemble methods . . . . . . . . . . . . . . . . . . . . 144

7.13 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.14 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8 Optimization for training deep models 150

8.1 Optimization for model training . . . . . . . . . . . . . . . . . . . . . . . 150

8.1.1 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.1.2 Plateaus, saddle points, and other ﬂat regions . . . . . . . . . . . 150

8.1.3 Cliﬀs and Exploding Gradients . . . . . . . . . . . . . . . . . . . 150

8.1.4 Vanishing and Exploding Gradients - An Introduction to the Issue

of Learning Long-Term Dependencies . . . . . . . . . . . . . . . 153

8.2 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.2.1 Approximate Natural Gradient and Second-Order Methods . . . 156

8.2.2 Optimization strategies and meta-algorithms . . . . . . . . . . . 156

8.2.3 Coordinate descent . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.2.4 Greedy supervised pre-training . . . . . . . . . . . . . . . . . . . 157

8.3 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 157

9 Structured Probabilistic Models: A Deep Learning Perspective 158

9.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . . . . 159

9.2 A Graphical Syntax for Describing Model Structure . . . . . . . . . . . 161

9.2.1 Directed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9.2.2 Undirected Models . . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 164

9.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 166

9.2.5 Separation and D-Separation . . . . . . . . . . . . . . . . . . . . 167

9.2.6 Operations on a Graph . . . . . . . . . . . . . . . . . . . . . . . 169

9.2.7 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

9.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . . . . . 171

9.4 Learning about Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 173

9.4.1 Latent Variables Versus Structure Learning . . . . . . . . . . . . 173

9.4.2 Latent Variables for Feature Learning . . . . . . . . . . . . . . . 174

9.5 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 174

9.6 Inference and Approximate Inference Over Latent Variables . . . . . . . 174

9.7 The Deep Learning Approach to Structured Probabilistic Modeling . . . 176

9.7.1 Example: The Restricted Boltzmann Machine . . . . . . . . . . . 177

10 Unsupervised and Transfer Learning 179

10.1 Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

10.1.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 181

10.1.2 Representational Power, Layer Size and Depth . . . . . . . . . . 184

10.1.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 185

10.2 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

10.2.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 186

10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 188

10.2.3 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

10.2.4 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 191

10.3 RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

10.4 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 192

10.5 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 193

11 Convolutional Networks 199

11.1 The convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 199

11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

11.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

11.4 Variants of the basic convolution function . . . . . . . . . . . . . . . . . 209

11.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

11.6 Eﬃcient convolution algorithms . . . . . . . . . . . . . . . . . . . . . . . 216

11.7 Deep learning history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

12 Sequence Modeling: Recurrent and Recursive Nets 217

12.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 217

12.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 219

12.2.1 Computing the gradient in a recurrent neural network . . . . . . 221

12.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 223

12.2.3 RNNs to represent conditional probability distributions . . . . . 225

12.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

12.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 229

12.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 230

12.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 231

12.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 232

12.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

12.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 235

12.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely

Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

12.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 237

12.6.3 Leaky Units and a Hierarchy Diﬀerent Time Scales . . . . . . . . 238

12.6.4 The Long-Short-Term-Memory Architecture and Other Gated RNNs239

12.6.5 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

12.6.6 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 243

12.6.7 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 244

12.6.8 Regularizing to Encourage Information Flow . . . . . . . . . . . 245

12.6.9 Organizing the State at Multiple Time Scales . . . . . . . . . . . 245

12.7 Handling temporal dependencies with n-grams, HMMs, CRFs and other

graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.7.2 Eﬃcient Marginalization and Inference for Temporally Structured

Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

12.7.3 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

12.7.4 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

12.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 251

12.8.1 Joint Training of Neural Networks and Sequential Probabilistic

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

12.8.2 MAP and Structured Output Models . . . . . . . . . . . . . . . . 251

12.8.3 Back-prop through Search . . . . . . . . . . . . . . . . . . . . . . 251

12.9 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

13 The Manifold Perspective on Auto-Encoders 252

13.1 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 261

13.2 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 263

13.3 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

13.3.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 266

13.3.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 267

13.4 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 267

13.4.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 269

13.4.2 Turning the Gradient Field into a Generative Model . . . . . . . 271

13.5 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 274

14 Distributed Representations: Disentangling the Underlying Factors 275

14.1 Assumption of Underlying Factors . . . . . . . . . . . . . . . . . . . . . 275

14.2 Exponential Gain in Representational Eﬃciency from Distributed Repre-

sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

14.3 Exponential Gain in Representational Eﬃciency from Depth . . . . . . . 275

14.4 Additional Priors Regarding The Underlying Factors . . . . . . . . . . . 275

15 Confronting the Partition Function 276

15.1 Estimating the partition function . . . . . . . . . . . . . . . . . . . . . . 276

15.1.1 Annealed importance sampling . . . . . . . . . . . . . . . . . . . 278

15.1.2 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

15.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

15.2 Stochastic maximum likelihood and contrastive divergence . . . . . . . . 282

15.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

15.4 Score matching and ratio matching . . . . . . . . . . . . . . . . . . . . . 291

15.5 Denoising score matching . . . . . . . . . . . . . . . . . . . . . . . . . . 293

15.6 Noise-contrastive estimation . . . . . . . . . . . . . . . . . . . . . . . . . 293

16 Approximate inference 296

16.1 Inference as optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 296

16.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 298

16.3 MAP inference: Sparse coding as a probabilistic model . . . . . . . . . . 299

16.4 Variational inference and learning . . . . . . . . . . . . . . . . . . . . . . 300

16.4.1 Discrete latent variables . . . . . . . . . . . . . . . . . . . . . . . 302

16.4.2 Calculus of variations . . . . . . . . . . . . . . . . . . . . . . . . 302

16.4.3 Continuous latent variables . . . . . . . . . . . . . . . . . . . . . 304

16.5 Stochastic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

16.6 Learned approximate inference . . . . . . . . . . . . . . . . . . . . . . . 304

17 Deep generative models 305

17.1 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . 305

17.2 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

17.3 Deep Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . 308

17.3.1 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . 308

17.3.2 Variational learning with SML . . . . . . . . . . . . . . . . . . . 309

17.3.3 Layerwise pretraining . . . . . . . . . . . . . . . . . . . . . . . . 310

17.3.4 Multi-prediction deep Boltzmann machines . . . . . . . . . . . . 312

17.3.5 Centered deep Boltzmann machines . . . . . . . . . . . . . . . . 312

17.4 Boltzmann machines for real-valued data . . . . . . . . . . . . . . . . . . 312

17.4.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 312

17.4.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

17.4.3 Spike and slab restricted Boltzmann machines . . . . . . . . . . . 313

17.5 Convolutional Boltzmann machines . . . . . . . . . . . . . . . . . . . . . 313

17.6 Other Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . 314

17.7 Directed generative nets . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

17.7.1 Variational autoencoders . . . . . . . . . . . . . . . . . . . . . . 314

17.7.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . 314

17.8 A generative view of autoencoders . . . . . . . . . . . . . . . . . . . . . 315

17.9 Generative stochastic networks . . . . . . . . . . . . . . . . . . . . . . . 315

17.10Methodological notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

18 Large scale deep learning 318

18.1 Fast CPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . 318

18.2 GPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

18.3 Asynchronous parallel implementations . . . . . . . . . . . . . . . . . . . 318

18.4 Dynamically structured nets . . . . . . . . . . . . . . . . . . . . . . . . . 318

18.5 Model compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

19 Practical methodology 320

19.1 When to gather more data, control capacity, or change algorithms . . . 320

19.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 320

19.3 Manual hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . 320

19.4 Hyper-parameter optimization algorithms . . . . . . . . . . . . . . . . . 320

19.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 322

19.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 322

19.5.2 Automatic Diﬀerentation and Symbolic Manipulations of Flow

Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

19.5.3 Momentum and Other Averaging Techniques as Cheap Second

Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

20 Applications 323

20.1 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

20.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324

20.1.2 Convolutional nets . . . . . . . . . . . . . . . . . . . . . . . . . . 329

20.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

20.3 Natural language processing and neural language models . . . . . . . . . 329

20.3.1 Neural language models . . . . . . . . . . . . . . . . . . . . . . . 329

20.4 Structured outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

20.5 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

Bibliography 330

Index 348

Acknowledgments

We would like to thank the following people who commented our proposal for the

book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,

Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger

and Thomas Roh´ee.

We would like to thank the following people who oﬀered feedback on the content of

the book itself:

In many chapters: Pawel Chilinski.

Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, Ozan C¸ aglayan.

Math background chapters: Ilya Sutskever, Vincent Vanhoucke, Johannes Roith,

Linear algebra: Guillaume Alain, Dustin Webb, David Warde-Farley, Pierre Luc

Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad Almahairi, Sergey Oreshkov,

Probability: Rasmus Antti, Stephan Gouws, David Warde-Farley, Vincent Dumoulin,

Artem Oboturov, Li Yao. John Philip Anderson

Numerical: Meire Fortunato, Jurgen Van Gael. Dustin Webb

ML: Dzmitry Bahdanau Kelvin Xu

MLPs: Jurgen Van Gael

Convolutional nets: Guillaume Alain, David Warde-Farley, Mehdi Mirza, Caglar

Gulcehre.

Unsupervised: Kelvin Xu

Partition function: Sam Bowman.

Graphical models: Kelvin Xu

RNNs: Kelvin Xu Dmitriy Serdyuk

We also want to thank Jason Yosinski and Nicolas Chapados for contributing ﬁgures

(as noted in the captions).

TODO– this section is just notes, write it up in nice presentation form.