
CONTENTS
6.5 Universal Approximation Properties and Depth . . . . . . . . . . 188
6.6 Feature / Representation Learning . . . . . . . . . . . . . . . . . 191
6.7 Piecewise Linear Hidden Units . . . . . . . . . . . . . . . . . . . 192
6.8 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7 Regularization of Deep or Distributed Models 196
7.1 Regularization from a Bayesian Perspective . . . . . . . . . . . . 198
7.2 Classical Regularization: Parameter Norm Penalty . . . . . . . . 199
7.3 Classical Regularization as Constrained Optimization . . . . . . . 207
7.4 Regularization and Under-Constrained Problems . . . . . . . . . 208
7.5 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 210
7.6 Classical Regularization as Noise Robustness . . . . . . . . . . . 211
7.7 Early Stopping as a Form of Regularization . . . . . . . . . . . . 217
7.8 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 223
7.9 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 224
7.10 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 226
7.11 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.12 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 232
7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 234
8 Optimization for Training Deep Models 236
8.1 Optimization for Model Training . . . . . . . . . . . . . . . . . . 236
8.2 Challenges in Optimization . . . . . . . . . . . . . . . . . . . . . 241
8.3 Optimization Algorithms I: Basic Algorithms . . . . . . . . . . . 250
8.4 Optimization Algorithms II: Adaptive Learning Rates . . . . . . 256
8.5 Optimization Algorithms III: Approximate Second-Order Methods261
8.6 Optimization Algorithms IV: Natural Gradient Methods . . . . . 262
8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 262
8.8 Hints, Global Optimization and Curriculum Learning . . . . . . . 270
9 Convolutional Networks 274
9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 275
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
9.4 Convolution and Pooling as an Infinitely Strong Prior . . . . . . 287
9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 288
9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.7 Convolutional Modules . . . . . . . . . . . . . . . . . . . . . . . . 295
9.8 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.9 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 297
9.10 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 298
iii