Deep Learning

Yoshua Bengio

Ian Goodfellow

Aaron Courville

Contents

Acknowledgments vii

Notation ix

1 Introduction 1

1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8

1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 11

I Applied math and machine learning basics 25

2 Linear Algebra 27

2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 27

2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 30

2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 31

2.4 Linear Dependence, Span, and Rank . . . . . . . . . . . . . . . . 32

2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . 35

2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 39

2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . 40

2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.11 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 42

3 Probability and Information Theory 46

3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 51

CONTENTS

3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 52

3.7 Independence and Conditional Independence . . . . . . . . . . . 52

3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . 53

3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . 57

3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . 62

3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . 64

3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 65

4 Numerical Computation 69

4.1 Overﬂow and Underﬂow . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 71

4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 80

4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 82

5 Machine Learning Basics 84

5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . 91

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting . . . . . . 94

5.4 The No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . 99

5.5 Hyperparameters, Validation Sets and Cross-Validation . . . . . 101

5.6 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . 103

5.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 111

5.8 Bayesian Statistics and Prior Probability Distributions . . . . . . 114

5.9 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.10 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 125

5.11 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 128

5.12 The Curse of Dimensionality and Statistical Limitations of Local

Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

II Modern practical deep networks 141

6 Feedforward Deep Networks 143

6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . 144

6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . 148

6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . 158

6.4 Universal Approximation Properties and Depth . . . . . . . . . . 169

6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . 171

CONTENTS

6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . 173

6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7 Regularization 176

7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . 177

7.2 Classical Regularization as Constrained Optimization . . . . . . . 184

7.3 Regularization from a Bayesian Perspective . . . . . . . . . . . . 185

7.4 Regularization and Under-Constrained Problems . . . . . . . . . 185

7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 187

7.6 Classical Regularization as Noise Robustness . . . . . . . . . . . 188

7.7 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 188

7.8 Early Stopping as a Form of Regularization . . . . . . . . . . . . 191

7.9 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 198

7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 198

7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 202

7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 202

8 Optimization for Training Deep Models 205

8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . 205

8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . 208

8.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . 215

8.4 Approximate Natural Gradient and Second-Order Methods . . . 219

8.5 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.6 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.7 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . 221

9 Convolutional Networks 224

9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 224

9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

9.4 Convolution and Pooling as an Inﬁnitely Strong Prior . . . . . . 235

9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 236

9.6 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

9.7 Eﬃcient Convolution Algorithms . . . . . . . . . . . . . . . . . . 243

9.8 The Neuroscientiﬁc Basis for Convolutional Networks . . . . . . . 245

9.9 Convolutional Networks and the History of Deep Learning . . . . 248

10 Sequence Modeling: Recurrent and Recursive Nets 250

10.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . 251

10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 252

iii

CONTENTS

10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 258

10.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . 260

10.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . 261

10.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . 267

10.7 Handling Temporal Dependencies with N-Grams, HMMs, CRFs

and Other Graphical Models . . . . . . . . . . . . . . . . . . . . . 280

10.8 Combining Neural Networks and Search . . . . . . . . . . . . . . 290

11 Large scale deep learning 303

11.1 Fast CPU Implementations . . . . . . . . . . . . . . . . . . . . . 303

11.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . 303

11.3 Asynchronous Parallel Implementations . . . . . . . . . . . . . . 304

11.4 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 304

11.5 Dynamically Structured Nets . . . . . . . . . . . . . . . . . . . . 305

12 Practical methodology 306

12.1 Basic Machine Learning Methodology . . . . . . . . . . . . . . . 306

12.2 Manual Hyperparameter Tuning . . . . . . . . . . . . . . . . . . 307

12.3 Hyper-parameter Optimization Algorithms . . . . . . . . . . . . . 307

12.4 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 309

13 Applications 311

13.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

13.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 318

13.3 Natural Language Processing and Neural Language Models . . . 319

13.4 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 335

13.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 335

III Deep learning research 336

14 Structured Probabilistic Models for Deep Learning 338

14.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 339

14.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 342

14.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 358

14.4 Learning About Dependencies . . . . . . . . . . . . . . . . . . . . 358

14.5 Inference and Approximate Inference Over Latent Variables . . . 359

14.6 The Deep Learning Approach to Structured Probabilistic Models 361

15 Monte Carlo Methods 365

15.1 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 365

15.2 The Diﬃculty of Mixing Between Well-Separated Modes . . . . . 367

CONTENTS

16 Linear Factor Models and Auto-Encoders 369

16.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 370

16.2 Representational Power, Layer Size and Depth . . . . . . . . . . 375

16.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 375

16.4 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . 377

16.5 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 378

16.6 Reconstruction Error as Log-Likelihood . . . . . . . . . . . . . . 381

16.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 383

16.8 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . 387

16.9 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 392

17 Representation Learning 395

17.1 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . 396

17.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 403

17.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 411

17.4 Semi-Supervised Learning and Disentangling Underlying Causal

Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412

17.5 Assumption of Underlying Factors and Distributed Representation414

17.6 Exponential Gain in Representational Eﬃciency from Distributed

Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

17.7 Exponential Gain in Representational Eﬃciency from Depth . . . 420

17.8 Priors Regarding The Underlying Factors . . . . . . . . . . . . . 422

18 The Manifold Perspective on Representation Learning 426

18.1 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 434

18.2 Manifold Interpretation of Sparse Coding . . . . . . . . . . . . . 437

18.3 The Entropy Bias from Maximum Likelihood . . . . . . . . . . . 437

18.4 Manifold Learning via Regularized Auto-Encoders . . . . . . . . 438

18.5 Tangent Distance, Tangent-Prop, and Manifold Tangent Classiﬁer 439

19 Confronting the Partition Function 443

19.1 The Log-Likelihood Gradient of Energy-Based Models . . . . . . 444

19.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 446

19.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 453

19.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 455

19.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 457

19.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 457

19.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . 459

20 Approximate inference 467

20.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 467

CONTENTS

20.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 470

20.3 MAP Inference: Sparse Coding as a Probabilistic Model . . . . . 471

20.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 472

20.5 Stochastic Inference . . . . . . . . . . . . . . . . . . . . . . . . . 476

20.6 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 476

21 Deep Generative Models 478

21.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 478

21.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 481

21.3 Training Restricted Boltzmann Machines . . . . . . . . . . . . . . 484

21.4 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 487

21.5 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 489

21.6 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 502

21.7 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 504

21.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 505

21.9 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . 505

21.10 A Generative View of Autoencoders . . . . . . . . . . . . . . . . 507

21.11 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 513

21.12 Methodological Notes . . . . . . . . . . . . . . . . . . . . . . . . . 515

Bibliography 521

Index 554

Acknowledgments

This book would not have been possible without the contributions of many people.

We would like to thank those who commented on our proposal for the book

and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,

Kyunghyun Cho, C¸ a˘glar G¨ul¸cehre, Razvan Pascanu, David Krueger and Thomas

Roh´ee.

We would like to thank the people who oﬀered feedback on the content of

the book itself. Some oﬀered feedback on many chapters: Julian Serban, Laurent

Dinh, Guillaume Alain, Ilya Sutskever, Vincent Vanhoucke, David Warde-Farley,

Jurgen Van Gael, Dustin Webb, Johannes Roith, Ion Androutsopoulos, Pawel

Chilinski, Halis Sak, Fr´ed´eric Francis, and Grigory Sapunov.

We would also like to thank those who provided us with useful feedback on

individual chapters:

• Chapter 1, Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi,

Ozan C¸a˘glayan, Mart´ın Abadi, and Sebastien Bratieres.

• Chapter 2, Linear Algebra: Pierre Luc Carrier, Li Yao, Thomas Roh´ee,

Colby Toland, Amjad Almahairi, and Sergey Oreshkov.

• Chapter 3, Probability and Information Theory: Rasmus Antti, Stephan

Gouws, Vincent Dumoulin, Artem Oboturov, Li Yao, John Philip Anderson,

and Kai Arulkumaran.

• Chapter 4, Numerical Computation: Meire Fortunato.

• Chapter 5, Machine Learning Basics: Dzmitry Bahdanau, Meire Fortunato,

and Kelvin Xu.

• Chapter 8, Optimization for Training Deep Models: Marcel Ackermann.

• Chapter 9, Convolutional Networks: Mehdi Mirza, C¸ a˘glar G¨ul¸cehre, and

Mart´ın Arjovsky.

vii

CONTENTS

• Chapter 10, Sequence Modeling: Recurrent and Recursive Nets: Kelvin Xu,

Dmitriy Serdyuk, and Dongyu Shi.

• Chapter 14, Structured Probabilistic Models for Deep Learning: Kelvin Xu.

• Chapter 17, Representation Learning: Kelvin Xu.

• Chapter 19, Confronting the Partition Function: Sam Bowman, and Ozan

C¸ a˘glayan.

We also want to thank those who allowed us to reproduce images, ﬁgures

or data from their publications: David Warde-Farley, Matthew D. Zeiler, Rob

Fergus, Chris Olah, Jason Yosinski, Nicolas Chapados, and James Bergstra. We

indicate their contributions in the ﬁgure captions throughout the text.

Finally, we would like to thank Google for allowing Ian Goodfellow to work

on the book as his 20% project while working at Google. In particular, we would

like to thank Ian’s former manager, Greg Corrado, and his subsequent manager,

Samy Bengio, for their support of this eﬀort.

viii

Notation

This section provides a concise reference describing the notation used throughout

this book. If you are unfamiliar with any of these mathematical concepts, this

notation reference may seem intimidating. However, do not despair, we describe

most of these ideas in chapters 1-3.

Numbers and arrays

a A scalar (integer or real) value with the name “a”

a A vector with the name “a”

A A matrix with the name “A”

A A tensor with the name “A”

Identity matrix with n rows and n columns

I Identity matrix with dimensionality implied by context

diag(a) A square, diagonal matrix with entries given by a

a A scalar random variable with the name “a”

a A vector-valued random variable with the name “a”

A A matrix-valued random variable with the name “A”

CONTENTS

Sets

A A set with the name “A”

R The set of real numbers

{0, 1} The set containing 0 and 1

{0, 1, . . . , n} The set of all integers between 0 and n

[a, b] The real interval including a and b

(a, b] The real interval excluding a but including b

A\B Set subtraction, i.e., the elements of A that are not in B

Indexing

Element i of vector a, with indexing starting at 1

i,j

Element i, j of matrix A

i,:

Row i of matrix A

:,i

Column i of matrix A

i,j,k

Element (i, j, k) of a 3-D tensor A

:,:,i

2-D slice of a 3-D tensor

Element i of the random vector a

(t)

The t-th example (input) from a dataset

(t)

or y

(t)

The target associated with x

(t)

for supervised learning

X The matrix of input examples, with one row per example x

(t)

Linear Algebra Operations

Transpose of matrix A

Moore-Penrose pseudoinverse of A

A  B Element-wise (Hadamard) product of A and B

CONTENTS

Calculus

Derivative of y with respect to x

∂y

∂x

Partial derivative of y with respect to x

∇

y Gradient of y with respect to x

∇

y Matrix derivatives of y with respect to x

∂f

∂x

Jacobian matrix J ∈ R

m×n

of a function f : R

→ R

H(f)(x) The Hessian matrix of f at input point x

f(x)dx Deﬁnite integral over the entire domain of x

f(x)dx Deﬁnite integral with respect to x over the set S

Functions

G A graph with the name “G”

f ◦ g Composition of the functions f and g

f(x; θ) A function of x parameterized by θ

log x Natural logarithm of x

σ(x) Logistic sigmoid, 1/(1 + exp(−x))

ζ(x) Softplus, log(1 + exp(x))

||x||

norm of x

Probability and Information Theory

a⊥b The random variables a and b are independent.

a⊥b | c They are are conditionally independent given c.

x∼P

[f(x)] or Ef(x) Expectation of f(x) with respect to P (x)

Var(f(x)) Variance of f(x) under P (x)

Cov(f(x), g(x)) Covariance of f (x) and g(x) under P (x, y)

H(x) Shannon entropy of the random variable x

(P kQ) Kullback-Leibler divergence of P and Q

Sometimes we write f(x), f(X), or f(X), when f is a function of a scalar

CONTENTS

rather than a vector, matrix, or tensor. In this case, we mean to apply f to the

array element-wise. For example, if C = σ(X), then C

i,j,k

= σ(X

i,j,k

) for all valid

values of i, j, and k.

xii