
CONTENTS
3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 52
3.7 Independence and Conditional Independence . . . . . . . . . . . 52
3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . 53
3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . 57
3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . 62
3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . 64
3.14 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 65
4 Numerical Computation 69
4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 71
4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 82
5 Machine Learning Basics 84
5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . 91
5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . 94
5.4 The No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . 99
5.5 Hyperparameters, Validation Sets and Cross-Validation . . . . . 101
5.6 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . 103
5.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 111
5.8 Bayesian Statistics and Prior Probability Distributions . . . . . . 114
5.9 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.10 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 125
5.11 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 128
5.12 The Curse of Dimensionality and Statistical Limitations of Local
Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
II Modern practical deep networks 141
6 Feedforward Deep Networks 143
6.1 Formalizing and Generalizing Neural Networks . . . . . . . . . . 144
6.2 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . 148
6.3 Flow Graphs and Back-Propagation . . . . . . . . . . . . . . . . 158
6.4 Universal Approximation Properties and Depth . . . . . . . . . . 169
6.5 Feature / Representation Learning . . . . . . . . . . . . . . . . . 171
ii