Deep Learning

Yoshua Bengio

Ian J. Goodfellow

Aaron Courville

October 21, 2014

Table of Contents

1 Deep Learning for AI 2

1.1 Who should read this book? . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Historical Perspective and Neural Networks . . . . . . . . . . . . . . . . 14

1.4 Recent Impact of Deep Learning Research . . . . . . . . . . . . . . . . . 15

1.5 Challenges for Future Research . . . . . . . . . . . . . . . . . . . . . . . 17

2 Linear algebra 20

2.1 Scalars, vectors, matrices and tensors . . . . . . . . . . . . . . . . . . . . 20

2.2 Multiplying matrices and vectors . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Identity and inverse matrices . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Linear dependence, span, and rank . . . . . . . . . . . . . . . . . . . . . 25

2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Special kinds of matrices and vectors . . . . . . . . . . . . . . . . . . . . 27

2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.8 The trace operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.9 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.10 Example: Principal components analysis . . . . . . . . . . . . . . . . . . 31

3 Probability and Information Theory 34

3.1 Why probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Discrete variables and probability mass functions . . . . . . . . . 36

3.3.2 Continuous variables and probability density functions . . . . . . 37

3.4 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.7 Independence and conditional independence . . . . . . . . . . . . . . . . 39

3.8 Expectation, variance, and covariance . . . . . . . . . . . . . . . . . . . 40

3.9 Information theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.10 Common probability distributions . . . . . . . . . . . . . . . . . . . . . 43

3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 43

3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 43

3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44

3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 47

3.11 Useful properties of common functions . . . . . . . . . . . . . . . . . . . 47

3.12 Bayes’ rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.13 Technical details of continuous variables . . . . . . . . . . . . . . . . . . 50

3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Numerical Computation 54

4.1 Overﬂow and underﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Poor conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Example: linear least squares . . . . . . . . . . . . . . . . . . . . . . . . 65

5 Machine Learning Basics 67

5.1 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.1 The task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.2 The performance measure, P . . . . . . . . . . . . . . . . . . . . 69

5.1.3 The experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Example: Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting . . . . . . . . . . 73

5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3.3 Occam’s Razor, Underﬁtting and Overﬁtting . . . . . . . . . . . 75

5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 78

5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 78

5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.5.4 Trading oﬀ Bias and Variance and the Mean Squared Error . . . 81

5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.6 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . 82

5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 82

5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 83

5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.8 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 84

5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-

mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.9 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.10 The Smoothness Prior, Local Generalization and Non-Parametric Models 85

5.11 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 90

5.12 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 93

6 Feedforward Deep Networks 95

6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 95

6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 97

6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 99

6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 104

6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 105

6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 106

6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3.2 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 108

6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 112

6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 114

6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7 Regularization 116

7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 117

7.1.1 L

regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.1.2 L

regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.1.3 L

∞

regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.2 Regularization from a Bayesian perspective . . . . . . . . . . . . . . . . 119

7.3 Early Stopping as a Form of Regularization . . . . . . . . . . . . . . . . 119

7.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.5 Sparsity of Representations . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.6 Semi-supervised Training . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.7 Early stopping as a form of regularization . . . . . . . . . . . . . . . . . 120

7.8 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.8.1 The pretraining protocol. . . . . . . . . . . . . . . . . . . . . . . 121

7.9 Bagging and other ensemble methods . . . . . . . . . . . . . . . . . . . . 123

7.10 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.11 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8 Optimization for training deep models 129

8.1 Optimization for model training . . . . . . . . . . . . . . . . . . . . . . . 129

8.1.1 Surrogate loss functions . . . . . . . . . . . . . . . . . . . . . . . 129

8.1.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.1.3 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.1.4 Data parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.2 Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.3 Ill-Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.4 Plateaus, saddle points, and other ﬂat regions . . . . . . . . . . . . . . . 130

8.5 Cliﬀs and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . . . 130

8.6 Vanishing and Exploding Gradients - An Introduction to the Issue of

Learning Long-Term Dependencies . . . . . . . . . . . . . . . . . . . . . 130

8.7 Inexact gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.8 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.8.1 Approximate Natural Gradient and Second-Order Methods . . . 130

8.9 Challenges in optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 131

8.9.1 Optimization strategies and meta-algorithms . . . . . . . . . . . 131

8.9.2 Initialization strategies . . . . . . . . . . . . . . . . . . . . . . . . 131

8.9.3 Greedy Supervised Pre-Training . . . . . . . . . . . . . . . . . . 131

8.9.4 Designing models to aid optimization . . . . . . . . . . . . . . . 131

8.10 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 131

9 Structured Probabilistic Models: A Deep Learning Perspective 132

9.1 The challenge of unstructured modeling . . . . . . . . . . . . . . . . . . 133

9.2 A graphical syntax for describing model structure . . . . . . . . . . . . . 135

9.2.1 Directed models . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

9.2.2 Undirected models . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 138

9.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 140

9.2.5 Separation and d-separation . . . . . . . . . . . . . . . . . . . . . 141

9.2.6 Operations on a graph . . . . . . . . . . . . . . . . . . . . . . . . 143

9.2.7 Factor graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

9.3 Advantages of structured modeling . . . . . . . . . . . . . . . . . . . . . 145

9.4 Learning about dependencies . . . . . . . . . . . . . . . . . . . . . . . . 147

9.4.1 Latent variables versus structure learning . . . . . . . . . . . . . 147

9.4.2 Latent variables for feature learning . . . . . . . . . . . . . . . . 148

9.5 The deep learning approach to structured probabilistic modeling . . . . 148

9.5.1 Example: The restricted Boltzmann machine . . . . . . . . . . . 148

9.6 Markov chain Monte Carlo methods . . . . . . . . . . . . . . . . . . . . 149

10 Unsupervised and Transfer Learning 151

10.1 Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

10.1.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 153

10.1.2 Representational Power, Layer Size and Depth . . . . . . . . . . 156

10.1.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 157

10.2 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

10.2.1 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 158

10.2.2 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 160

10.2.3 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

10.2.4 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 163

10.3 RBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

10.4 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 164

10.5 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 165

11 Convolutional Networks 169

11.1 The convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 169

11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

11.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

11.4 Variants of the basic convolution function . . . . . . . . . . . . . . . . . 179

11.5 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

11.6 Eﬃcient convolution algorithms . . . . . . . . . . . . . . . . . . . . . . . 186

11.7 Deep learning history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

12 Sequence Modeling: Recurrent and Recursive Nets 187

12.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 187

12.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 189

12.2.1 Computing the gradient in a recurrent neural network . . . . . . 191

12.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 193

12.2.3 RNNs to represent conditional probability distributions . . . . . 195

12.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

12.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 199

12.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 200

12.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 201

12.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 202

12.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

12.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 205

12.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely

Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 205

12.6.3 The Long-Short-Term-Memory architecture . . . . . . . . . . . . 205

12.6.4 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.6.5 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.6.6 Regularizing to Encourage Information Flow . . . . . . . . . . . 205

12.6.7 Organizing the State at Multiple Time Scales . . . . . . . . . . . 205

12.7 Handling temporal dependencies with n-grams, HMMs, CRFs and other

graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.7.2 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.7.3 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

12.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 206

12.8.1 Joint Training of Neural Networks and Sequential Probabilistic

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

12.8.2 MAP and Structured Output Models . . . . . . . . . . . . . . . . 206

12.8.3 Back-prop through Search . . . . . . . . . . . . . . . . . . . . . . 206

12.9 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

13 The Manifold Perspective on Auto-Encoders 207

13.1 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 216

13.2 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 218

13.3 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

13.3.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 221

13.3.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 222

13.4 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 222

13.4.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 224

13.4.2 Turning the Gradient Field into a Generative Model . . . . . . . 226

13.5 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 229

14 Distributed Representations: Disentangling the Underlying Factors 230

14.1 Assumption of Underlying Factors . . . . . . . . . . . . . . . . . . . . . 230

14.2 Exponential Gain in Representational Eﬃciency from Distributed Repre-

sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

14.3 Exponential Gain in Representational Eﬃciency from Depth . . . . . . . 230

14.4 Additional Priors Regarding The Underlying Factors . . . . . . . . . . . 230

15 Confronting the Partition Function 231

15.1 Estimating the partition function . . . . . . . . . . . . . . . . . . . . . . 231

15.1.1 Annealed importance sampling . . . . . . . . . . . . . . . . . . . 233

15.1.2 Bridge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

15.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

15.2 Stochastic maximum likelihood and contrastive divergence . . . . . . . . 237

15.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

15.4 Score matching and ratio matching . . . . . . . . . . . . . . . . . . . . . 246

15.5 Denoising score matching . . . . . . . . . . . . . . . . . . . . . . . . . . 248

15.6 Noise-contrastive estimation . . . . . . . . . . . . . . . . . . . . . . . . . 248

16 Approximate inference 251

16.1 Inference as optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 251

16.2 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 253

16.3 MAP inference: Sparse coding as a probabilistic model . . . . . . . . . . 254

16.4 Variational inference and learning . . . . . . . . . . . . . . . . . . . . . . 255

16.4.1 Discrete latent variables . . . . . . . . . . . . . . . . . . . . . . . 256

16.4.2 Calculus of variations . . . . . . . . . . . . . . . . . . . . . . . . 257

16.4.3 Continuous latent variables . . . . . . . . . . . . . . . . . . . . . 258

16.5 Stochastic inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

16.6 Learned approximate inference . . . . . . . . . . . . . . . . . . . . . . . 259

17 Deep generative models 260

17.1 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . 260

17.2 Deep belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

17.3 Deep Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . 263

17.3.1 Interesting properties . . . . . . . . . . . . . . . . . . . . . . . . 263

17.3.2 Variational learning with SML . . . . . . . . . . . . . . . . . . . 264

17.3.3 Layerwise pretraining . . . . . . . . . . . . . . . . . . . . . . . . 264

17.3.4 Multi-prediction deep Boltzmann machines . . . . . . . . . . . . 266

17.3.5 Centered deep Boltzmann machines . . . . . . . . . . . . . . . . 266

17.4 Boltzmann machines for real-valued data . . . . . . . . . . . . . . . . . . 266

17.4.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 266

17.4.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

17.4.3 Spike and slab restricted Boltzmann machines . . . . . . . . . . . 268

17.5 Convolutional Boltzmann machines . . . . . . . . . . . . . . . . . . . . . 268

17.6 Other Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . 269

17.7 Directed generative nets . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

17.7.1 Variational autoencoders . . . . . . . . . . . . . . . . . . . . . . 269

17.7.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . 269

17.8 A generative view of autoencoders . . . . . . . . . . . . . . . . . . . . . 270

17.9 Generative stochastic networks . . . . . . . . . . . . . . . . . . . . . . . 270

18 Large scale deep learning 271

18.1 Fast CPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . 271

18.2 GPU implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

18.3 Asynchronous parallel implementations . . . . . . . . . . . . . . . . . . . 271

18.4 Dynamically structured nets . . . . . . . . . . . . . . . . . . . . . . . . . 271

18.5 Model compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

19 Practical methodology 273

19.1 When to gather more data, control capacity, or change algorithms . . . 273

19.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 273

19.3 Manual hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . 273

19.4 Hyper-parameter optimization algorithms . . . . . . . . . . . . . . . . . 273

19.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 275

19.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 275

19.5.2 Automatic Diﬀerentation and Symbolic Manipulations of Flow

Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

19.5.3 Momentum and Other Averaging Techniques as Cheap Second

Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

20 Applications 276

20.1 Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

20.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

20.1.2 Convolutional nets . . . . . . . . . . . . . . . . . . . . . . . . . . 282

20.2 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

20.3 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . 282

20.4 Structured outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

20.5 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

Bibliography 283

Index 298

Acknowledgments

We would like to thank the following people who commented our proposal for the

book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,

Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger

and Thomas Roh´ee.

We would like to thank the following people who oﬀered feedback on the content of

the book itself:

Introduction: Johannes Roith, Eric Morris, Ozan C¸ aglayan.

Math background chapters: Ilya Sutskever, Vincent Vanhoucke, Johannes Roith,

Linear algebra: Guillaume Alain, Dustin Webb, David Warde-Farley, Pierre Luc

Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad Almahairi, Sergey Oreshkov,

Probability: Rasmus Antti, Stephan Gouws, David Warde-Farley, Vincent Dumoulin,

Artem Oboturov, Li Yao.

Numerical: Meire Fortunato.

Convolutional nets: Guillaume Alain, David Warde-Farley, Mehdi Mirza, Caglar

Gulcehre.

Partition function: Sam Bowman.

We also want to thank Jason Yosinski and Nicolas Chapados for contributing ﬁgures

(as noted in the captions).

TODO– this section is just notes, write it up in nice presentation form.