Deep Learning
Yoshua Bengio
Ian Goodfellow
Aaron Courville
Contents
Acknowledgments vii
Notation ix
1 Introduction 1
1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8
1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 11
I Applied math and machine learning basics 25
2 Linear Algebra 27
2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 27
2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 30
2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 31
2.4 Linear Dependence, Span, and Rank . . . . . . . . . . . . . . . . 32
2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . 35
2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 39
2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . 40
2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.11 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 42
3 Probability and Information Theory 46
3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 51
i
CONTENTS
3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 52
3.7 Independence and Conditional Independence . . . . . . . . . . . 52
3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . 53
3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . 57
3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . 62
3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . 64
3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Numerical Computation 69
4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 71
4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 82
5 Machine Learning Basics 84
5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . 91
5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . 94
5.4 The No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . 99
5.5 Hyperparameters, Validation Sets and Cross-Validation . . . . . 101
5.6 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . 103
5.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 111
5.8 Bayesian Statistics and Prior Probability Distributions . . . . . . 114
5.9 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.10 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 125
5.11 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 128
5.12 The Curse of Dimensionality and Statistical Limitations of Local
Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
II Modern practical deep networks 141
6 Feedforward Deep Networks 143
6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . 144
6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . 148
6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . 158
6.4 Universal Approximation Properties and Depth . . . . . . . . . . 169
6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . 171
ii
CONTENTS
6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . 173
6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7 Regularization 176
7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . 177
7.2 Classical Regularization as Constrained Optimization . . . . . . . 184
7.3 Regularization from a Bayesian Perspective . . . . . . . . . . . . 185
7.4 Regularization and Under-Constrained Problems . . . . . . . . . 185
7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 187
7.6 Classical Regularization as Noise Robustness . . . . . . . . . . . 188
7.7 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 188
7.8 Early Stopping as a Form of Regularization . . . . . . . . . . . . 191
7.9 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 198
7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 202
8 Optimization for Training Deep Models 205
8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . 205
8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . 208
8.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . 215
8.4 Approximate Natural Gradient and Second-Order Methods . . . 219
8.5 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.6 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.7 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . 221
9 Convolutional Networks 224
9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 224
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
9.4 Convolution and Pooling as an Infinitely Strong Prior . . . . . . 235
9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 236
9.6 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.7 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 243
9.8 The Neuroscientific Basis for Convolutional Networks . . . . . . . 245
9.9 Convolutional Networks and the History of Deep Learning . . . . 248
10 Sequence Modeling: Recurrent and Recursive Nets 250
10.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . 251
10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 252
iii
CONTENTS
10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 258
10.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . 260
10.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . 261
10.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . 267
10.7 Handling Temporal Dependencies with N-Grams, HMMs, CRFs
and Other Graphical Models . . . . . . . . . . . . . . . . . . . . . 280
10.8 Combining Neural Networks and Search . . . . . . . . . . . . . . 290
11 Large scale deep learning 303
11.1 Fast CPU Implementations . . . . . . . . . . . . . . . . . . . . . 303
11.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . 303
11.3 Asynchronous Parallel Implementations . . . . . . . . . . . . . . 304
11.4 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 304
11.5 Dynamically Structured Nets . . . . . . . . . . . . . . . . . . . . 305
12 Practical methodology 306
12.1 Basic Machine Learning Methodology . . . . . . . . . . . . . . . 306
12.2 Manual Hyperparameter Tuning . . . . . . . . . . . . . . . . . . 307
12.3 Hyper-parameter Optimization Algorithms . . . . . . . . . . . . . 307
12.4 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 309
13 Applications 311
13.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 318
13.3 Natural Language Processing and Neural Language Models . . . 319
13.4 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 335
13.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 335
III Deep learning research 336
14 Structured Probabilistic Models for Deep Learning 338
14.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 339
14.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 342
14.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 358
14.4 Learning About Dependencies . . . . . . . . . . . . . . . . . . . . 358
14.5 Inference and Approximate Inference Over Latent Variables . . . 359
14.6 The Deep Learning Approach to Structured Probabilistic Models 361
15 Monte Carlo Methods 365
15.1 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 365
15.2 The Difficulty of Mixing Between Well-Separated Modes . . . . . 367
iv
CONTENTS
16 Linear Factor Models and Auto-Encoders 369
16.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 370
16.2 Representational Power, Layer Size and Depth . . . . . . . . . . 375
16.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 375
16.4 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . 377
16.5 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 378
16.6 Reconstruction Error as Log-Likelihood . . . . . . . . . . . . . . 381
16.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 383
16.8 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . 387
16.9 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 392
17 Representation Learning 395
17.1 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . 396
17.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 403
17.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 411
17.4 Semi-Supervised Learning and Disentangling Underlying Causal
Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
17.5 Assumption of Underlying Factors and Distributed Representation414
17.6 Exponential Gain in Representational Efficiency from Distributed
Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
17.7 Exponential Gain in Representational Efficiency from Depth . . . 420
17.8 Priors Regarding The Underlying Factors . . . . . . . . . . . . . 422
18 The Manifold Perspective on Representation Learning 426
18.1 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 434
18.2 Manifold Interpretation of Sparse Coding . . . . . . . . . . . . . 437
18.3 The Entropy Bias from Maximum Likelihood . . . . . . . . . . . 437
18.4 Manifold Learning via Regularized Auto-Encoders . . . . . . . . 438
18.5 Tangent Distance, Tangent-Prop, and Manifold Tangent Classifier 439
19 Confronting the Partition Function 443
19.1 The Log-Likelihood Gradient of Energy-Based Models . . . . . . 444
19.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 446
19.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
19.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 455
19.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 457
19.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 457
19.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . 459
20 Approximate inference 467
20.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 467
v
CONTENTS
20.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 470
20.3 MAP Inference: Sparse Coding as a Probabilistic Model . . . . . 471
20.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 472
20.5 Stochastic Inference . . . . . . . . . . . . . . . . . . . . . . . . . 476
20.6 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 476
21 Deep Generative Models 478
21.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 478
21.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 481
21.3 Training Restricted Boltzmann Machines . . . . . . . . . . . . . . 484
21.4 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 487
21.5 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 489
21.6 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 502
21.7 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 504
21.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 505
21.9 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . 505
21.10 A Generative View of Autoencoders . . . . . . . . . . . . . . . . 507
21.11 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 513
21.12 Methodological Notes . . . . . . . . . . . . . . . . . . . . . . . . . 515
Bibliography 521
Index 554
vi
Acknowledgments
This book would not have been possible without the contributions of many people.
We would like to thank those who commented on our proposal for the book
and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,
Kyunghyun Cho, C¸ glar G¨ul¸cehre, Razvan Pascanu, David Krueger and Thomas
Roh´ee.
We would like to thank the people who offered feedback on the content of
the book itself. Some offered feedback on many chapters: Julian Serban, Laurent
Dinh, Guillaume Alain, Ilya Sutskever, Vincent Vanhoucke, David Warde-Farley,
Jurgen Van Gael, Dustin Webb, Johannes Roith, Ion Androutsopoulos, Pawel
Chilinski, Halis Sak, Fr´ed´eric Francis, and Grigory Sapunov.
We would also like to thank those who provided us with useful feedback on
individual chapters:
Chapter 1, Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi,
Ozan C¸glayan, Mart´ın Abadi, and Sebastien Bratieres.
Chapter 2, Linear Algebra: Pierre Luc Carrier, Li Yao, Thomas Roh´ee,
Colby Toland, Amjad Almahairi, and Sergey Oreshkov.
Chapter 3, Probability and Information Theory: Rasmus Antti, Stephan
Gouws, Vincent Dumoulin, Artem Oboturov, Li Yao, John Philip Anderson,
and Kai Arulkumaran.
Chapter 4, Numerical Computation: Meire Fortunato.
Chapter 5, Machine Learning Basics: Dzmitry Bahdanau, Meire Fortunato,
and Kelvin Xu.
Chapter 8, Optimization for Training Deep Models: Marcel Ackermann.
Chapter 9, Convolutional Networks: Mehdi Mirza, C¸ glar G¨ul¸cehre, and
Mart´ın Arjovsky.
vii
CONTENTS
Chapter 10, Sequence Modeling: Recurrent and Recursive Nets: Kelvin Xu,
Dmitriy Serdyuk, and Dongyu Shi.
Chapter 14, Structured Probabilistic Models for Deep Learning: Kelvin Xu.
Chapter 17, Representation Learning: Kelvin Xu.
Chapter 19, Confronting the Partition Function: Sam Bowman, and Ozan
C¸ glayan.
We also want to thank those who allowed us to reproduce images, figures
or data from their publications: David Warde-Farley, Matthew D. Zeiler, Rob
Fergus, Chris Olah, Jason Yosinski, Nicolas Chapados, and James Bergstra. We
indicate their contributions in the figure captions throughout the text.
Finally, we would like to thank Google for allowing Ian Goodfellow to work
on the book as his 20% project while working at Google. In particular, we would
like to thank Ian’s former manager, Greg Corrado, and his subsequent manager,
Samy Bengio, for their support of this effort.
viii
Notation
This section provides a concise reference describing the notation used throughout
this book. If you are unfamiliar with any of these mathematical concepts, this
notation reference may seem intimidating. However, do not despair, we describe
most of these ideas in chapters 1-3.
Numbers and arrays
a A scalar (integer or real) value with the name “a”
a A vector with the name “a”
A A matrix with the name “A”
A A tensor with the name “A”
I
n
Identity matrix with n rows and n columns
I Identity matrix with dimensionality implied by context
diag(a) A square, diagonal matrix with entries given by a
a A scalar random variable with the name “a”
a A vector-valued random variable with the name “a”
A A matrix-valued random variable with the name “A”
ix
CONTENTS
Sets
A A set with the name “A”
R The set of real numbers
{0, 1} The set containing 0 and 1
{0, 1, . . . , n} The set of all integers between 0 and n
[a, b] The real interval including a and b
(a, b] The real interval excluding a but including b
A\B Set subtraction, i.e., the elements of A that are not in B
Indexing
a
i
Element i of vector a, with indexing starting at 1
A
i,j
Element i, j of matrix A
A
i,:
Row i of matrix A
A
:,i
Column i of matrix A
A
i,j,k
Element (i, j, k) of a 3-D tensor A
A
:,:,i
2-D slice of a 3-D tensor
a
i
Element i of the random vector a
x
(t)
The t-th example (input) from a dataset
y
(t)
or y
(t)
The target associated with x
(t)
for supervised learning
X The matrix of input examples, with one row per example x
(t)
.
Linear Algebra Operations
A
>
Transpose of matrix A
A
+
Moore-Penrose pseudoinverse of A
A B Element-wise (Hadamard) product of A and B
x
CONTENTS
Calculus
dy
dx
Derivative of y with respect to x
y
x
Partial derivative of y with respect to x
x
y Gradient of y with respect to x
X
y Matrix derivatives of y with respect to x
f
x
Jacobian matrix J R
m×n
of a function f : R
n
R
m
H(f)(x) The Hessian matrix of f at input point x
Z
f(x)dx Definite integral over the entire domain of x
Z
S
f(x)dx Definite integral with respect to x over the set S
Functions
G A graph with the name “G”
f g Composition of the functions f and g
f(x; θ) A function of x parameterized by θ
log x Natural logarithm of x
σ(x) Logistic sigmoid, 1/(1 + exp(x))
ζ(x) Softplus, log(1 + exp(x))
||x||
p
L
p
norm of x
Probability and Information Theory
ab The random variables a and b are independent.
ab | c They are are conditionally independent given c.
E
xP
[f(x)] or Ef(x) Expectation of f(x) with respect to P (x)
Var(f(x)) Variance of f(x) under P (x)
Cov(f(x), g(x)) Covariance of f (x) and g(x) under P (x, y)
H(x) Shannon entropy of the random variable x
D
KL
(P kQ) Kullback-Leibler divergence of P and Q
Sometimes we write f(x), f(X), or f(X), when f is a function of a scalar
xi
CONTENTS
rather than a vector, matrix, or tensor. In this case, we mean to apply f to the
array element-wise. For example, if C = σ(X), then C
i,j,k
= σ(X
i,j,k
) for all valid
values of i, j, and k.
xii