Deep Learning
Yoshua Bengio
Ian J. Goodfellow
Aaron Courville
March 30, 2015
Table of Contents
Acknowledgments 1
Notation 2
1 Introduction 4
1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . . . . . 12
I Applied math and machine learning basics 18
2 Linear Algebra 20
2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . . . . . 20
2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Linear Dependence, Span, and Rank . . . . . . . . . . . . . . . . . . . . 25
2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . . . . . 28
2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 30
2.9 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.10 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.11 Example: Principal Components Analysis . . . . . . . . . . . . . . . . . 32
3 Probability and Information Theory 35
3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Discrete Variables and Probability Mass Functions . . . . . . . . 38
3.3.2 Continuous Variables and Probability Density Functions . . . . . 38
3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . . . . . 40
3.7 Independence and Conditional Independence . . . . . . . . . . . . . . . 40
1
3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . . . . . 41
3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . . . . . 44
3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44
3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 44
3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 48
3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . . . . . 48
3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . . . . . 51
3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Numerical Computation 56
4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . 68
5 Machine Learning Basics 70
5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 The Task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 The Performance Measure, P . . . . . . . . . . . . . . . . . . . . 72
5.1.3 The Experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . . . . . 76
5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.3 Occam’s Razor, Underfitting and Overfitting . . . . . . . . . . . 78
5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 81
5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 83
5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5.4 Trading off Bias and Variance and the Mean Squared Error . . . 85
5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.6 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . 87
5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 87
5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 87
5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.8 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 88
2
5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-
mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.9 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.9.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . 91
5.10 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.11 The Smoothness Prior, Local Generalization and Non-Parametric Models 93
5.12 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 97
5.13 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 100
II Modern practical deep networks 102
6 Feedforward Deep Networks 104
6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 104
6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 108
6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 109
6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 115
6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 116
6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 117
6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3.2 Back-Propagation in an MLP . . . . . . . . . . . . . . . . . . . . 119
6.3.3 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 120
6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 126
6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 128
6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . . . . . 129
6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7 Regularization 131
7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 132
7.1.1 L
2
Parameter Regularization . . . . . . . . . . . . . . . . . . . . 133
7.1.2 L
1
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1.3 L
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Classical Regularization as Constrained Optimization . . . . . . . . . . 138
7.3 Regularization from a Bayesian Perspective . . . . . . . . . . . . . . . . 139
7.4 Regularization and Under-Constrained Problems . . . . . . . . . . . . . 139
7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.6 Classical Regularization as Noise Robustness . . . . . . . . . . . . . . . 142
7.7 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . . . . . 142
7.8 Early Stopping as a Form of Regularization . . . . . . . . . . . . . . . . 143
7.9 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3
8 Optimization for Training Deep Models 156
8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . . . . . 156
8.1.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . 156
8.1.2 Surrogate Loss Functions . . . . . . . . . . . . . . . . . . . . . . 157
8.1.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.1.4 Batches and Minibatches . . . . . . . . . . . . . . . . . . . . . . 158
8.1.5 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.2.1 Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.2.2 Ill-Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.2.3 Plateaus, Saddle Points, and Other Flat Regions . . . . . . . . . 158
8.2.4 Cliffs and Exploding Gradients . . . . . . . . . . . . . . . . . . . 158
8.2.5 Vanishing and Exploding Gradients - An Introduction to the Issue
of Learning Long-Term Dependencies . . . . . . . . . . . . . . . 161
8.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 165
8.3.3 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.3.4 Adagrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3.5 RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3.6 Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3.7 No Pesky Learning Rates . . . . . . . . . . . . . . . . . . . . . . 168
8.4 Approximate Natural Gradient and Second-Order Methods . . . . . . . 168
8.5 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.6 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.6.1 New . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.6.2 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 168
8.6.3 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.6.4 Greedy Supervised Pre-training . . . . . . . . . . . . . . . . . . . 169
8.7 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 169
9 Convolutional Networks 173
9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4 Variants of the Basic Convolution Function . . . . . . . . . . . . . . . . 183
9.5 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.6 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . . . . . 190
9.7 Deep Learning History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10 Sequence Modeling: Recurrent and Recursive Nets 191
10.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 191
10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2.1 Computing the Gradient in a Recurrent Neural Network . . . . . 195
4
10.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 197
10.2.3 RNNs to Represent Conditional Probability Distributions . . . . 199
10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 206
10.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 207
10.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 210
10.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely
Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
10.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 212
10.6.3 Leaky Units and a Hierarchy of Different Time Scales . . . . . . 213
10.6.4 The Long-Short-Term-Memory Architecture and Other Gated RNNs214
10.6.5 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.6.6 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 218
10.6.7 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 219
10.6.8 Regularizing to Encourage Information Flow . . . . . . . . . . . 220
10.6.9 Organizing the State at Multiple Time Scales . . . . . . . . . . . 221
10.7 Handling Temporal Dependencies with N-Grams, HMMs, CRFs and Other
Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
10.7.2 Efficient Marginalization and Inference for Temporally Structured
Outputs by Dynamic Programming . . . . . . . . . . . . . . . . . 223
10.7.3 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.7.4 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 231
10.8.1 Approximate Search . . . . . . . . . . . . . . . . . . . . . . . . . 233
11 Large scale deep learning 236
11.1 Fast CPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.3 Asynchronous Parallel Implementations . . . . . . . . . . . . . . . . . . 236
11.4 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
11.5 Dynamically Structured Nets . . . . . . . . . . . . . . . . . . . . . . . . 237
12 Practical methodology 238
12.1 When to Gather More Data, Control Capacity, or Change Algorithms . 238
12.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 238
12.3 Manual Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . 238
12.4 Hyper-parameter Optimization Algorithms . . . . . . . . . . . . . . . . 238
12.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 240
12.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 240
5
12.5.2 Automatic Differentation and Symbolic Manipulations of Flow
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.5.3 Momentum and Other Averaging Techniques as Cheap Second
Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13 Applications 241
13.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
13.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
13.1.2 Convolutional Nets . . . . . . . . . . . . . . . . . . . . . . . . . . 247
13.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
13.3 Natural Language Processing and Neural Language Models . . . . . . . 248
13.3.1 The Basics of Neural Language Models . . . . . . . . . . . . . . 248
13.3.2 The Problem With N-Grams . . . . . . . . . . . . . . . . . . . . 248
13.3.3 How Neural Language Models can Generalize Better . . . . . . . 250
13.3.4 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . 252
13.3.5 High-Dimensional Outputs . . . . . . . . . . . . . . . . . . . . . 252
13.3.6 Combining Neural Language Models with N-Grams . . . . . . . 252
13.4 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
13.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
III Deep learning research 253
14 Structured Probabilistic Models: A Deep Learning Perspective 255
14.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . . . . 256
14.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . . . . 259
14.2.1 Directed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
14.2.2 Undirected Models . . . . . . . . . . . . . . . . . . . . . . . . . . 261
14.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 263
14.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 265
14.2.5 Separation and D-Separation . . . . . . . . . . . . . . . . . . . . 266
14.2.6 Converting Between Undirected and Directed Graphs . . . . . . 267
14.2.7 Marginalizing Variables out of a Graph . . . . . . . . . . . . . . 272
14.2.8 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
14.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . . . . . 272
14.4 Learning About Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 273
14.4.1 Latent Variables Versus Structure Learning . . . . . . . . . . . . 273
14.4.2 Latent Variables for Feature Learning . . . . . . . . . . . . . . . 274
14.5 Inference and Approximate Inference Over Latent Variables . . . . . . . 274
14.5.1 Reparametrization Trick . . . . . . . . . . . . . . . . . . . . . . . 274
14.6 The Deep Learning Approach to Structured Probabilistic Modeling . . . 276
14.6.1 Example: The Restricted Boltzmann Machine . . . . . . . . . . . 277
6
15 Monte Carlo Methods 279
15.1 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 279
15.1.1 Markov Chain Theory . . . . . . . . . . . . . . . . . . . . . . . . 280
16 Linear Factor Models and Auto-Encoders 282
16.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 283
16.2 Representational Power, Layer Size and Depth . . . . . . . . . . . . . . 287
16.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 288
16.4 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
16.5 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . . . . . 289
16.5.1 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
16.5.2 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 292
16.6 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 293
16.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
16.7.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 296
16.7.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 298
16.8 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 298
16.8.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 301
16.9 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 304
17 Representation Learning 307
17.1 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 308
17.1.1 Why Does Unsupervised Pre-Training Work? . . . . . . . . . . . 311
17.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 315
17.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 322
17.4 Causality, Semi-Supervised Learning and Disentangling the Underlying
Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
17.5 Assumption of Underlying Factors and Distributed Representation . . . 325
17.6 Exponential Gain in Representational Efficiency from Distributed Repre-
sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
17.7 Exponential Gain in Representational Efficiency from Depth . . . . . . . 330
17.8 Priors Regarding The Underlying Factors . . . . . . . . . . . . . . . . . 333
18 The Manifold Perspective on Representation Learning 336
18.1 Manifold Interpretation of PCA and Linear Auto-Encoders . . . . . . . 344
18.2 Manifold Interpretation of Sparse Coding . . . . . . . . . . . . . . . . . 347
18.3 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 347
18.4 Tangent Distance, Tangent-Prop, and Manifold Tangent Classifier . . . 348
19 Confronting the Partition Function 352
19.1 Estimating the Partition Function . . . . . . . . . . . . . . . . . . . . . 352
19.1.1 Annealed Importance Sampling . . . . . . . . . . . . . . . . . . . 354
19.1.2 Bridge Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
19.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
7
19.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . . . . . 358
19.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
19.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . . . . . 367
19.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 369
19.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . . . 370
20 Approximate inference 372
20.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 372
20.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 375
20.3 MAP Inference: Sparse Coding as a Probabilistic Model . . . . . . . . . 375
20.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . . . . . 376
20.4.1 Discrete Latent Variables . . . . . . . . . . . . . . . . . . . . . . 378
20.4.2 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . 378
20.4.3 Continuous Latent Variables . . . . . . . . . . . . . . . . . . . . 380
20.5 Stochastic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
20.6 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . 380
21 Deep Generative Models 382
21.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . 382
21.1.1 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . 384
21.1.2 RBM Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . 385
21.2 Training Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . 385
21.2.1 Contrastive Divergence Training of the RBM . . . . . . . . . . . 388
21.2.2 Stochastic Maximum Likelihood (Persistent Contrastive Diver-
gence) for the RBM . . . . . . . . . . . . . . . . . . . . . . . . . 388
21.2.3 Other Inductive Principles . . . . . . . . . . . . . . . . . . . . . . 388
21.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
21.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 391
21.4.1 Interesting Properties . . . . . . . . . . . . . . . . . . . . . . . . 395
21.4.2 Mean Field Inference in the DBM . . . . . . . . . . . . . . . . . 395
21.4.3 Variational Expectation Maximization . . . . . . . . . . . . . . . 396
21.4.4 Variational Learning With SML . . . . . . . . . . . . . . . . . . 396
21.4.5 Layerwise Pretraining . . . . . . . . . . . . . . . . . . . . . . . . 397
21.4.6 Multi-Prediction Deep Boltzmann Machines . . . . . . . . . . . . 398
21.4.7 Centered Deep Boltzmann Machines . . . . . . . . . . . . . . . . 398
21.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . . . . 400
21.5.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 400
21.5.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
21.5.3 mPoT Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
21.5.4 Spike and Slab Restricted Boltzmann Machines . . . . . . . . . . 401
21.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 401
21.7 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 402
21.8 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . . . . 403
21.8.1 Sigmoid Belief Nets . . . . . . . . . . . . . . . . . . . . . . . . . 403
8
21.8.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . 403
21.8.3 Variational Interpretation of PSD . . . . . . . . . . . . . . . . . . 403
21.8.4 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . 403
21.9 A Generative View of Autoencoders . . . . . . . . . . . . . . . . . . . . 404
21.9.1 Markov Chain Associated with any Denoising Auto-Encoder . . 405
21.9.2 Clamping and Conditional Sampling . . . . . . . . . . . . . . . . 407
21.9.3 Walk-Back Training Procedure . . . . . . . . . . . . . . . . . . . 408
21.10Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . . . . . 409
21.10.1 Discriminant GSNs . . . . . . . . . . . . . . . . . . . . . . . . . . 410
21.11Methodological Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Bibliography 413
Index 437
9
Acknowledgments
We would like to thank the following people who commented our proposal for the
book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,
Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger
and Thomas Roh´ee.
We would like to thank the following people who offered feedback on the content of
the book itself:
In many chapters: Julian Serban, Laurent Dinh, Guillaume Alain, Ilya Sutskever,
Vincent Vanhoucke, David Warde-Farley, Jurgen Van Gael, Dustin Webb, Johannes
Roith, Ion Androutsopoulos, Pawel Chilinski, Halis Sak, Grigory Sapunov, Ion An-
droutsopoulos.
Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, Ozan C¸ aglayan, Mart´ın
Abadi.
Math background chapters:
Linear algebra: Pierre Luc Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad
Almahairi, Sergey Oreshkov,
Probability: Rasmus Antti, Stephan Gouws, Vincent Dumoulin, Artem Oboturov,
Li Yao, John Philip Anderson.
Numerical: Meire Fortunato,
Optimization: Marcel Ackermann
ML: Dzmitry Bahdanau Kelvin Xu
MLPs:
Convolutional nets: Mehdi Mirza, Caglar Gulcehre.
Unsupervised: Kelvin Xu
Partition function: Sam Bowman.
Graphical models: Kelvin Xu
RNNs: Kelvin Xu Dmitriy Serdyuk Dongyu Shi
We also want to thank David Warde-Farley, Matthew D. Zeiler, Rob Fergus, Chris
Olah, Jason Yosinski, Nicolas Chapados and James Bergstra for contributing images or
figures (as noted in the captions).
TODO– this section is just notes, write it up in nice presentation form.
1
Notation
Mathematical Objects
a A scalar (integer or real) value with the name “a”
a A vector with the name “a”
A A matrix with the name “A”
TODO TODO– higher order tensors
A A set with the name “A”
R The set of real numbers
{0, 1} The set containing 0 and 1
a A scalar random variable with the name “a”
a A vector-valued random variable with the name “a”
A A matrix-valued random variable with the name “A”
G A graph with the name “G”
Indexing
a
i
Element i of vector a, with indexing starting at 1
A
i,j
Element i, j of matrix A
A
i,:
Row i of matrix A
A
:,i
Column i of matrix A
TODO TODO– higher order tensors
a
i
Element i of the random vector a
x
(t)
usually the t-th example (input) from a dataset, with y
(t)
the associated
target, for supervised learning
X The matrix of input examples, with one row per example x
(t)
.
2
Linear Algebra Operations
A
>
Transpose of matrix A
A B Element-wise (Hadamard) product of A and B
Calculus
dy
dx
Derivative of y with respect to x
y
x
Partial derivative of y with respect to x
x
y Gradient of y with respect to x
X
y Matrix derivatives of y with respect to x
Z
f(x)dx Definite integral over the entire domain of x
Z
S
f(x)dx Definite integral with respect to x over the set S
Miscellaneous
f g Composition of the functions f and g
log x Natural logarithm of x
Probability and Information Theory
ab The random variables a and b are independent.
ab | c The random variables a and b are conditionally independent given c.
E
xP
[f(x)] or Ef (x) Expectation of f(x) with respect to P (x)
V ar(f(x)) Variance of f(x) under P (x)
Cov(f (x), g(x)) Covariance of f(x) and g(x) under P (x, y)
D
KL
(P kQ) Kullback-Leibler divergence of P and Q
TODO– norms TODO– entropy TODO– Jacobian and Hessian TODO– Specify that un-
less otherwise clear from context, functions applied to vectors and matrices are applied
elementwise.
3