
CONTENTS
6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
6.4 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.5 Back-Propagation and Other Differentiation Algorithms . . . . . 204
6.6 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7 Regularization for Deep Learning 229
7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 231
7.2 Norm Penalties as Constrained Optimization . . . . . . . . . . . . 238
7.3 Regularization and Under-Constrained Problems . . . . . . . . . 240
7.4 Dataset Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 241
7.5 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.6 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . 245
7.7 Multi-Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
7.9 Parameter Tying and Parameter Sharing . . . . . . . . . . . . . . 252
7.10 Sparse Representations . . . . . . . . . . . . . . . . . . . . . . . . 254
7.11 Bagging and Other Ensemble Methods . . . . . . . . . . . . . . . 256
7.12 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 268
7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier 270
8 Optimization for Training Deep Models 276
8.1 How Learning Differs from Pure Optimization . . . . . . . . . . . 277
8.2 Challenges in Neural Network Optimization . . . . . . . . . . . . 284
8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
8.4 Parameter Initialization Strategies . . . . . . . . . . . . . . . . . 302
8.5 Algorithms with Adaptive Learning Rates . . . . . . . . . . . . . 308
8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 312
8.7 Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 320
9 Convolutional Networks 333
9.1 The Convolution Operation . . . . . . . . . . . . . . . . . . . . . 334
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
9.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
9.4 Convolution and Pooling as an Infinitely Strong Prior . . . . . . . 348
9.5 Variants of the Basic Convolution Function . . . . . . . . . . . . 350
9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 361
9.7 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
9.8 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 365
9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 366
iii