Deep Learning

Yoshua Bengio

Ian J. Goodfellow

Aaron Courville

March 30, 2015

Table of Contents

Acknowledgments 1

Notation 2

1 Introduction 4

1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . . . . . 12

I Applied math and machine learning basics 18

2 Linear Algebra 20

2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . . . . . 20

2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Linear Dependence, Span, and Rank . . . . . . . . . . . . . . . . . . . . 25

2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . . . . . 28

2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 30

2.9 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.10 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.11 Example: Principal Components Analysis . . . . . . . . . . . . . . . . . 32

3 Probability and Information Theory 35

3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 Discrete Variables and Probability Mass Functions . . . . . . . . 38

3.3.2 Continuous Variables and Probability Density Functions . . . . . 38

3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . . . . . 40

3.7 Independence and Conditional Independence . . . . . . . . . . . . . . . 40

3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . . . . . 41

3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . . . . . 44

3.10.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . 44

3.10.2 Multinoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . 44

3.10.3 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 45

3.10.4 Dirac Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.10.5 Mixtures of Distributions and Gaussian Mixture . . . . . . . . . 48

3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . . . . . 48

3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . . . . . 51

3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Numerical Computation 56

4.1 Overﬂow and Underﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 58

4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . 68

5 Machine Learning Basics 70

5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 The Task, T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.2 The Performance Measure, P . . . . . . . . . . . . . . . . . . . . 72

5.1.3 The Experience, E . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting . . . . . . . . . . 76

5.3.1 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.2 Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.3 Occam’s Razor, Underﬁtting and Overﬁtting . . . . . . . . . . . 78

5.4 Estimating and Monitoring Generalization Error . . . . . . . . . . . . . 81

5.5 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . . . . . 83

5.5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.5.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.5.4 Trading oﬀ Bias and Variance and the Mean Squared Error . . . 85

5.5.5 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.6 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . 87

5.6.1 Properties of Maximum Likelihood . . . . . . . . . . . . . . . . . 87

5.6.2 Regularized Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 87

5.7 Bayesian Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.8 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.8.1 Estimating Conditional Expectation by Minimizing Squared Error 88

5.8.2 Estimating Probabilities or Conditional Probabilities by Maxi-

mum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.9 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.9.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . 91

5.10 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.11 The Smoothness Prior, Local Generalization and Non-Parametric Models 93

5.12 Manifold Learning and the Curse of Dimensionality . . . . . . . . . . . . 97

5.13 Challenges of High-Dimensional Distributions . . . . . . . . . . . . . . . 100

II Modern practical deep networks 102

6 Feedforward Deep Networks 104

6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . . . . . 104

6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . . . . . 108

6.2.1 Family of Functions . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.2.2 Loss Function and Conditional Log-Likelihood . . . . . . . . . . 109

6.2.3 Training Criterion and Regularizer . . . . . . . . . . . . . . . . . 115

6.2.4 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . 116

6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . . . . . 117

6.3.1 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.3.2 Back-Propagation in an MLP . . . . . . . . . . . . . . . . . . . . 119

6.3.3 Back-Propagation in a General Flow Graph . . . . . . . . . . . . 120

6.4 Universal Approximation Properties and Depth . . . . . . . . . . . . . . 126

6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . . . . . 128

6.6 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . . . . . 129

6.7 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7 Regularization 131

7.1 Classical Regularization: Parameter Norm Penalty . . . . . . . . . . . . 132

7.1.1 L

Parameter Regularization . . . . . . . . . . . . . . . . . . . . 133

7.1.2 L

Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.1.3 L

∞

Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 138

7.2 Classical Regularization as Constrained Optimization . . . . . . . . . . 138

7.3 Regularization from a Bayesian Perspective . . . . . . . . . . . . . . . . 139

7.4 Regularization and Under-Constrained Problems . . . . . . . . . . . . . 139

7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.6 Classical Regularization as Noise Robustness . . . . . . . . . . . . . . . 142

7.7 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . . . . . 142

7.8 Early Stopping as a Form of Regularization . . . . . . . . . . . . . . . . 143

7.9 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

8 Optimization for Training Deep Models 156

8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . . . . . 156

8.1.1 Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . 156

8.1.2 Surrogate Loss Functions . . . . . . . . . . . . . . . . . . . . . . 157

8.1.3 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.1.4 Batches and Minibatches . . . . . . . . . . . . . . . . . . . . . . 158

8.1.5 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.2.1 Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.2.2 Ill-Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8.2.3 Plateaus, Saddle Points, and Other Flat Regions . . . . . . . . . 158

8.2.4 Cliﬀs and Exploding Gradients . . . . . . . . . . . . . . . . . . . 158

8.2.5 Vanishing and Exploding Gradients - An Introduction to the Issue

of Learning Long-Term Dependencies . . . . . . . . . . . . . . . 161

8.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3.1 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 165

8.3.3 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8.3.4 Adagrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

8.3.5 RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

8.3.6 Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.3.7 No Pesky Learning Rates . . . . . . . . . . . . . . . . . . . . . . 168

8.4 Approximate Natural Gradient and Second-Order Methods . . . . . . . 168

8.5 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.6 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.6.1 New . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.6.2 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 168

8.6.3 Coordinate Descent . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.6.4 Greedy Supervised Pre-training . . . . . . . . . . . . . . . . . . . 169

8.7 Hints and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . 169

9 Convolutional Networks 173

9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . . . . . 173

9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.4 Variants of the Basic Convolution Function . . . . . . . . . . . . . . . . 183

9.5 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

9.6 Eﬃcient Convolution Algorithms . . . . . . . . . . . . . . . . . . . . . . 190

9.7 Deep Learning History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

10 Sequence Modeling: Recurrent and Recursive Nets 191

10.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . . . . 191

10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 193

10.2.1 Computing the Gradient in a Recurrent Neural Network . . . . . 195

10.2.2 Recurrent Networks as Generative Directed Acyclic Models . . . 197

10.2.3 RNNs to Represent Conditional Probability Distributions . . . . 199

10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

10.4 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 204

10.5 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 205

10.5.1 Logistic Auto-Regressive Networks . . . . . . . . . . . . . . . . . 206

10.5.2 Neural Auto-Regressive Networks . . . . . . . . . . . . . . . . . . 207

10.5.3 NADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

10.6 Facing the Challenge of Long-Term Dependencies . . . . . . . . . . . . . 210

10.6.1 Echo State Networks: Choosing Weights to Make Dynamics Barely

Contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

10.6.2 Combining Short and Long Paths in the Unfolded Flow Graph . 212

10.6.3 Leaky Units and a Hierarchy of Diﬀerent Time Scales . . . . . . 213

10.6.4 The Long-Short-Term-Memory Architecture and Other Gated RNNs214

10.6.5 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

10.6.6 Better Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 218

10.6.7 Clipping Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . 219

10.6.8 Regularizing to Encourage Information Flow . . . . . . . . . . . 220

10.6.9 Organizing the State at Multiple Time Scales . . . . . . . . . . . 221

10.7 Handling Temporal Dependencies with N-Grams, HMMs, CRFs and Other

Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

10.7.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

10.7.2 Eﬃcient Marginalization and Inference for Temporally Structured

Outputs by Dynamic Programming . . . . . . . . . . . . . . . . . 223

10.7.3 HMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

10.7.4 CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

10.8 Combining Neural Networks and Search . . . . . . . . . . . . . . . . . . 231

10.8.1 Approximate Search . . . . . . . . . . . . . . . . . . . . . . . . . 233

11 Large scale deep learning 236

11.1 Fast CPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 236

11.2 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

11.3 Asynchronous Parallel Implementations . . . . . . . . . . . . . . . . . . 236

11.4 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

11.5 Dynamically Structured Nets . . . . . . . . . . . . . . . . . . . . . . . . 237

12 Practical methodology 238

12.1 When to Gather More Data, Control Capacity, or Change Algorithms . 238

12.2 Machine Learning Methodology 101 . . . . . . . . . . . . . . . . . . . . 238

12.3 Manual Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . 238

12.4 Hyper-parameter Optimization Algorithms . . . . . . . . . . . . . . . . 238

12.5 Tricks of the Trade for Deep Learning . . . . . . . . . . . . . . . . . . . 240

12.5.1 Debugging Back-Prop . . . . . . . . . . . . . . . . . . . . . . . . 240

12.5.2 Automatic Diﬀerentation and Symbolic Manipulations of Flow

Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

12.5.3 Momentum and Other Averaging Techniques as Cheap Second

Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

13 Applications 241

13.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

13.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

13.1.2 Convolutional Nets . . . . . . . . . . . . . . . . . . . . . . . . . . 247

13.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

13.3 Natural Language Processing and Neural Language Models . . . . . . . 248

13.3.1 The Basics of Neural Language Models . . . . . . . . . . . . . . 248

13.3.2 The Problem With N-Grams . . . . . . . . . . . . . . . . . . . . 248

13.3.3 How Neural Language Models can Generalize Better . . . . . . . 250

13.3.4 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . 252

13.3.5 High-Dimensional Outputs . . . . . . . . . . . . . . . . . . . . . 252

13.3.6 Combining Neural Language Models with N-Grams . . . . . . . 252

13.4 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

13.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

III Deep learning research 253

14 Structured Probabilistic Models: A Deep Learning Perspective 255

14.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . . . . 256

14.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . . . . 259

14.2.1 Directed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

14.2.2 Undirected Models . . . . . . . . . . . . . . . . . . . . . . . . . . 261

14.2.3 The Partition Function . . . . . . . . . . . . . . . . . . . . . . . 263

14.2.4 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . 265

14.2.5 Separation and D-Separation . . . . . . . . . . . . . . . . . . . . 266

14.2.6 Converting Between Undirected and Directed Graphs . . . . . . 267

14.2.7 Marginalizing Variables out of a Graph . . . . . . . . . . . . . . 272

14.2.8 Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

14.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . . . . . 272

14.4 Learning About Dependencies . . . . . . . . . . . . . . . . . . . . . . . . 273

14.4.1 Latent Variables Versus Structure Learning . . . . . . . . . . . . 273

14.4.2 Latent Variables for Feature Learning . . . . . . . . . . . . . . . 274

14.5 Inference and Approximate Inference Over Latent Variables . . . . . . . 274

14.5.1 Reparametrization Trick . . . . . . . . . . . . . . . . . . . . . . . 274

14.6 The Deep Learning Approach to Structured Probabilistic Modeling . . . 276

14.6.1 Example: The Restricted Boltzmann Machine . . . . . . . . . . . 277

15 Monte Carlo Methods 279

15.1 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 279

15.1.1 Markov Chain Theory . . . . . . . . . . . . . . . . . . . . . . . . 280

16 Linear Factor Models and Auto-Encoders 282

16.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 283

16.2 Representational Power, Layer Size and Depth . . . . . . . . . . . . . . 287

16.3 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 288

16.4 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

16.5 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . . . . . 289

16.5.1 ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

16.5.2 Sparse Coding as a Generative Model . . . . . . . . . . . . . . . 292

16.6 Probabilistic Interpretation of Reconstruction Error as Log-Likelihood . 293

16.7 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

16.7.1 Sparse Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . 296

16.7.2 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . 298

16.8 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . 298

16.8.1 Learning a Vector Field that Estimates a Gradient Field . . . . . 301

16.9 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . 304

17 Representation Learning 307

17.1 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . . . . . 308

17.1.1 Why Does Unsupervised Pre-Training Work? . . . . . . . . . . . 311

17.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . . . . 315

17.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 322

17.4 Causality, Semi-Supervised Learning and Disentangling the Underlying

Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

17.5 Assumption of Underlying Factors and Distributed Representation . . . 325

17.6 Exponential Gain in Representational Eﬃciency from Distributed Repre-

sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

17.7 Exponential Gain in Representational Eﬃciency from Depth . . . . . . . 330

17.8 Priors Regarding The Underlying Factors . . . . . . . . . . . . . . . . . 333

18 The Manifold Perspective on Representation Learning 336

18.1 Manifold Interpretation of PCA and Linear Auto-Encoders . . . . . . . 344

18.2 Manifold Interpretation of Sparse Coding . . . . . . . . . . . . . . . . . 347

18.3 Manifold Learning via Regularized Auto-Encoders . . . . . . . . . . . . 347

18.4 Tangent Distance, Tangent-Prop, and Manifold Tangent Classiﬁer . . . 348

19 Confronting the Partition Function 352

19.1 Estimating the Partition Function . . . . . . . . . . . . . . . . . . . . . 352

19.1.1 Annealed Importance Sampling . . . . . . . . . . . . . . . . . . . 354

19.1.2 Bridge Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

19.1.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357

19.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . . . . . 358

19.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

19.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . . . . . 367

19.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 369

19.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . . . 370

20 Approximate inference 372

20.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 372

20.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . 375

20.3 MAP Inference: Sparse Coding as a Probabilistic Model . . . . . . . . . 375

20.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . . . . . 376

20.4.1 Discrete Latent Variables . . . . . . . . . . . . . . . . . . . . . . 378

20.4.2 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . 378

20.4.3 Continuous Latent Variables . . . . . . . . . . . . . . . . . . . . 380

20.5 Stochastic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

20.6 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . 380

21 Deep Generative Models 382

21.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . 382

21.1.1 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . 384

21.1.2 RBM Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . 385

21.2 Training Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . 385

21.2.1 Contrastive Divergence Training of the RBM . . . . . . . . . . . 388

21.2.2 Stochastic Maximum Likelihood (Persistent Contrastive Diver-

gence) for the RBM . . . . . . . . . . . . . . . . . . . . . . . . . 388

21.2.3 Other Inductive Principles . . . . . . . . . . . . . . . . . . . . . . 388

21.3 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

21.4 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 391

21.4.1 Interesting Properties . . . . . . . . . . . . . . . . . . . . . . . . 395

21.4.2 Mean Field Inference in the DBM . . . . . . . . . . . . . . . . . 395

21.4.3 Variational Expectation Maximization . . . . . . . . . . . . . . . 396

21.4.4 Variational Learning With SML . . . . . . . . . . . . . . . . . . 396

21.4.5 Layerwise Pretraining . . . . . . . . . . . . . . . . . . . . . . . . 397

21.4.6 Multi-Prediction Deep Boltzmann Machines . . . . . . . . . . . . 398

21.4.7 Centered Deep Boltzmann Machines . . . . . . . . . . . . . . . . 398

21.5 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . . . . 400

21.5.1 Gaussian-Bernoulli RBMs . . . . . . . . . . . . . . . . . . . . . . 400

21.5.2 mcRBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

21.5.3 mPoT Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

21.5.4 Spike and Slab Restricted Boltzmann Machines . . . . . . . . . . 401

21.6 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 401

21.7 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 402

21.8 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . . . . 403

21.8.1 Sigmoid Belief Nets . . . . . . . . . . . . . . . . . . . . . . . . . 403

21.8.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . 403

21.8.3 Variational Interpretation of PSD . . . . . . . . . . . . . . . . . . 403

21.8.4 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . 403

21.9 A Generative View of Autoencoders . . . . . . . . . . . . . . . . . . . . 404

21.9.1 Markov Chain Associated with any Denoising Auto-Encoder . . 405

21.9.2 Clamping and Conditional Sampling . . . . . . . . . . . . . . . . 407

21.9.3 Walk-Back Training Procedure . . . . . . . . . . . . . . . . . . . 408

21.10Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . . . . . 409

21.10.1 Discriminant GSNs . . . . . . . . . . . . . . . . . . . . . . . . . . 410

21.11Methodological Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411

Bibliography 413

Index 437

Acknowledgments

We would like to thank the following people who commented our proposal for the

book and helped plan its contents and organization: Hugo Larochelle, Guillaume Alain,

Kyunghyun Cho, Caglar Gulcehre (TODO diacritics), Razvan Pascanu, David Krueger

and Thomas Roh´ee.

We would like to thank the following people who oﬀered feedback on the content of

the book itself:

In many chapters: Julian Serban, Laurent Dinh, Guillaume Alain, Ilya Sutskever,

Vincent Vanhoucke, David Warde-Farley, Jurgen Van Gael, Dustin Webb, Johannes

Roith, Ion Androutsopoulos, Pawel Chilinski, Halis Sak, Grigory Sapunov, Ion An-

droutsopoulos.

Introduction: Johannes Roith, Eric Morris, Samira Ebrahimi, Ozan C¸ aglayan, Mart´ın

Abadi.

Math background chapters:

Linear algebra: Pierre Luc Carrier, Li Yao, Thomas Roh´ee, Colby Toland, Amjad

Almahairi, Sergey Oreshkov,

Probability: Rasmus Antti, Stephan Gouws, Vincent Dumoulin, Artem Oboturov,

Li Yao, John Philip Anderson.

Numerical: Meire Fortunato,

Optimization: Marcel Ackermann

ML: Dzmitry Bahdanau Kelvin Xu

MLPs:

Convolutional nets: Mehdi Mirza, Caglar Gulcehre.

Unsupervised: Kelvin Xu

Partition function: Sam Bowman.

Graphical models: Kelvin Xu

RNNs: Kelvin Xu Dmitriy Serdyuk Dongyu Shi

We also want to thank David Warde-Farley, Matthew D. Zeiler, Rob Fergus, Chris

Olah, Jason Yosinski, Nicolas Chapados and James Bergstra for contributing images or

ﬁgures (as noted in the captions).

TODO– this section is just notes, write it up in nice presentation form.

Notation

Mathematical Objects

a A scalar (integer or real) value with the name “a”

a A vector with the name “a”

A A matrix with the name “A”

TODO TODO– higher order tensors

A A set with the name “A”

R The set of real numbers

{0, 1} The set containing 0 and 1

a A scalar random variable with the name “a”

a A vector-valued random variable with the name “a”

A A matrix-valued random variable with the name “A”

G A graph with the name “G”

Indexing

Element i of vector a, with indexing starting at 1

i,j

Element i, j of matrix A

i,:

Row i of matrix A

:,i

Column i of matrix A

TODO TODO– higher order tensors

Element i of the random vector a

(t)

usually the t-th example (input) from a dataset, with y

(t)

the associated

target, for supervised learning

X The matrix of input examples, with one row per example x

(t)

Linear Algebra Operations

Transpose of matrix A

A B Element-wise (Hadamard) product of A and B

Calculus

Derivative of y with respect to x

∂y

∂x

Partial derivative of y with respect to x

∇

y Gradient of y with respect to x

∇

y Matrix derivatives of y with respect to x

f(x)dx Deﬁnite integral over the entire domain of x

f(x)dx Deﬁnite integral with respect to x over the set S

Miscellaneous

f ◦ g Composition of the functions f and g

log x Natural logarithm of x

Probability and Information Theory

a⊥b The random variables a and b are independent.

a⊥b | c The random variables a and b are conditionally independent given c.

x∼P

[f(x)] or Ef (x) Expectation of f(x) with respect to P (x)

V ar(f(x)) Variance of f(x) under P (x)

Cov(f (x), g(x)) Covariance of f(x) and g(x) under P (x, y)

(P kQ) Kullback-Leibler divergence of P and Q

TODO– norms TODO– entropy TODO– Jacobian and Hessian TODO– Specify that un-

less otherwise clear from context, functions applied to vectors and matrices are applied

elementwise.