Deep Learning

Yoshua Bengio

Ian Goodfellow

Aaron Courville

Contents

Acknowledgments vii

Notation ix

1 Introduction 1

1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . 8

1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . 11

I Applied Math and Machine Learning Basics 25

2 Linear Algebra 27

2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . 27

2.2 Multiplying Matrices and Vectors . . . . . . . . . . . . . . . . . . 30

2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . . 31

2.4 Linear Dependence, Span, and Rank . . . . . . . . . . . . . . . . 32

2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . . 35

2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.8 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 39

2.9 The Moore-Penrose Pseudoinverse . . . . . . . . . . . . . . . . . 40

2.10 The Trace Operator . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.11 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.12 Example: Principal Components Analysis . . . . . . . . . . . . . 42

3 Probability and Information Theory 46

3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 51

CONTENTS

3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 52

3.7 Independence and Conditional Independence . . . . . . . . . . . 52

3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . 53

3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . 57

3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . 62

3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . 64

3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 65

3.15 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 68

4 Numerical Computation 74

4.1 Overﬂow and Underﬂow . . . . . . . . . . . . . . . . . . . . . . . 74

4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 76

4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 85

4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 87

5 Machine Learning Basics 89

5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . 97

5.3 Generalization, Capacity, Overﬁtting and Underﬁtting . . . . . . 99

5.4 The No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . 104

5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.6 Hyperparameters, Validation Sets and Cross-Validation . . . . . 108

5.7 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . 110

5.8 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 118

5.9 Bayesian Statistics and Prior Probability Distributions . . . . . . 121

5.10 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.11 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 131

5.12 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 134

5.13 The Curse of Dimensionality and Statistical Limitations of Local

Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

II Modern Practical Deep Networks 147

6 Feedforward Deep Networks 149

6.1 From Fixed Features to Learned Features . . . . . . . . . . . . . 149

6.2 Formalizing and Generalizing Neural Networks . . . . . . . . . . 152

6.3 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . 154

CONTENTS

6.4 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . 167

6.5 Universal Approximation Properties and Depth . . . . . . . . . . 180

6.6 Feature / Representation Learning . . . . . . . . . . . . . . . . . 184

6.7 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . 186

6.8 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7 Regularization 190

7.1 Regularization from a Bayesian Perspective . . . . . . . . . . . . 191

7.2 Classical Regularization: Parameter Norm Penalty . . . . . . . . 193

7.3 Classical Regularization as Constrained Optimization . . . . . . . 200

7.4 Regularization and Under-Constrained Problems . . . . . . . . . 201

7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 203

7.6 Classical Regularization as Noise Robustness . . . . . . . . . . . 204

7.7 Early Stopping as a Form of Regularization . . . . . . . . . . . . 208

7.8 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 215

7.9 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 215

7.10 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 215

7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 222

7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 223

8 Optimization for Training Deep Models 226

8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . 226

8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . 229

8.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . 236

8.4 Approximate Natural Gradient and Second-Order Methods . . . 241

8.5 Conjugate Gradients . . . . . . . . . . . . . . . . . . . . . . . . . 241

8.6 BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

8.7 Hints, Global Optimization and Curriculum Learning . . . . . . . 243

9 Convolutional Networks 248

9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 248

9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

9.4 Convolution and Pooling as an Inﬁnitely Strong Prior . . . . . . 261

9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 262

9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 269

9.7 Convolutional Modules . . . . . . . . . . . . . . . . . . . . . . . . 269

9.8 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

9.9 Eﬃcient Convolution Algorithms . . . . . . . . . . . . . . . . . . 271

9.10 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 271

iii

CONTENTS

9.11 The Neuroscientiﬁc Basis for Convolutional Networks . . . . . . . 273

9.12 Convolutional Networks and the History of Deep Learning . . . . 280

10 Sequence Modeling: Recurrent and Recursive Nets 281

10.1 Unfolding Flow Graphs and Sharing Parameters . . . . . . . . . . 282

10.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 284

10.3 Bidirectional RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 295

10.4 Deep Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . 296

10.5 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . 299

10.6 Auto-Regressive Networks . . . . . . . . . . . . . . . . . . . . . . 299

10.7 Facing the Challenge of Long-Term Dependencies . . . . . . . . . 305

10.8 Handling Temporal Dependencies with N-Grams, HMMs, CRFs

and Other Graphical Models . . . . . . . . . . . . . . . . . . . . . 317

10.9 Combining Neural Networks and Search . . . . . . . . . . . . . . 328

11 Practical methodology 333

11.1 Basic Machine Learning Methodology . . . . . . . . . . . . . . . 333

11.2 Manual Hyperparameter Tuning . . . . . . . . . . . . . . . . . . 334

11.3 Hyper-parameter Optimization Algorithms . . . . . . . . . . . . . 334

11.4 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 336

12 Applications 339

12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . . . . 339

12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

12.3 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 352

12.4 Natural Language Processing and Neural Language Models . . . 353

12.5 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 369

12.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 369

III Deep Learning Research 370

13 Structured Probabilistic Models for Deep Learning 372

13.1 The Challenge of Unstructured Modeling . . . . . . . . . . . . . . 373

13.2 Using Graphs to Describe Model Structure . . . . . . . . . . . . . 377

13.3 Advantages of Structured Modeling . . . . . . . . . . . . . . . . . 391

13.4 Learning About Dependencies . . . . . . . . . . . . . . . . . . . . 392

13.5 Inference and Approximate Inference Over Latent Variables . . . 394

13.6 The Deep Learning Approach to Structured Probabilistic Models 395

14 Monte Carlo Methods 400

14.1 Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 400

CONTENTS

14.2 The Diﬃculty of Mixing Between Well-Separated Modes . . . . . 402

15 Linear Factor Models and Auto-Encoders 404

15.1 Regularized Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 405

15.2 Denoising Auto-encoders . . . . . . . . . . . . . . . . . . . . . . . 408

15.3 Representational Power, Layer Size and Depth . . . . . . . . . . 410

15.4 Reconstruction Distribution . . . . . . . . . . . . . . . . . . . . . 411

15.5 Linear Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . 412

15.6 Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 413

15.7 Reconstruction Error as Log-Likelihood . . . . . . . . . . . . . . 417

15.8 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 418

15.9 Denoising Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . 423

15.10 Contractive Auto-Encoders . . . . . . . . . . . . . . . . . . . . . 428

16 Representation Learning 431

16.1 Greedy Layerwise Unsupervised Pre-Training . . . . . . . . . . . 432

16.2 Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 439

16.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 446

16.4 Semi-Supervised Learning and Disentangling Underlying Causal

Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447

16.5 Assumption of Underlying Factors and Distributed Representation449

16.6 Exponential Gain in Representational Eﬃciency from Distributed

Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453

16.7 Exponential Gain in Representational Eﬃciency from Depth . . . 455

16.8 Priors Regarding The Underlying Factors . . . . . . . . . . . . . 457

17 The Manifold Perspective on Representation Learning 461

17.1 Manifold Interpretation of PCA and Linear Auto-Encoders . . . 469

17.2 Manifold Interpretation of Sparse Coding . . . . . . . . . . . . . 472

17.3 The Entropy Bias from Maximum Likelihood . . . . . . . . . . . 472

17.4 Manifold Learning via Regularized Auto-Encoders . . . . . . . . 473

17.5 Tangent Distance, Tangent-Prop, and Manifold Tangent Classiﬁer 474

18 Confronting the Partition Function 478

18.1 The Log-Likelihood Gradient of Energy-Based Models . . . . . . 479

18.2 Stochastic Maximum Likelihood and Contrastive Divergence . . . 481

18.3 Pseudolikelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 488

18.4 Score Matching and Ratio Matching . . . . . . . . . . . . . . . . 490

18.5 Denoising Score Matching . . . . . . . . . . . . . . . . . . . . . . 492

18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . . . . . 492

18.7 Estimating the Partition Function . . . . . . . . . . . . . . . . . 494

CONTENTS

19 Approximate inference 502

19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . . . . . 504

19.2 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . 505

19.3 MAP Inference: Sparse Coding as a Probabilistic Model . . . . . 506

19.4 Variational Inference and Learning . . . . . . . . . . . . . . . . . 507

19.5 Stochastic Inference . . . . . . . . . . . . . . . . . . . . . . . . . 511

19.6 Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 511

20 Deep Generative Models 513

20.1 Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 513

20.2 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 516

20.3 Training Restricted Boltzmann Machines . . . . . . . . . . . . . . 519

20.4 Deep Belief Networks . . . . . . . . . . . . . . . . . . . . . . . . . 523

20.5 Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 526

20.6 Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 537

20.7 Convolutional Boltzmann Machines . . . . . . . . . . . . . . . . . 540

20.8 Other Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 541

20.9 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . 541

20.10 A Generative View of Autoencoders . . . . . . . . . . . . . . . . 543

20.11 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 549

20.12 Methodological Notes . . . . . . . . . . . . . . . . . . . . . . . . . 551

Bibliography 555

Index 593