
CONTENTS
3.6 The Chain Rule of Conditional Probabilities . . . . . . . . . . . . 52
3.7 Independence and Conditional Independence . . . . . . . . . . . 52
3.8 Expectation, Variance, and Covariance . . . . . . . . . . . . . . . 53
3.9 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.10 Common Probability Distributions . . . . . . . . . . . . . . . . . 57
3.11 Useful Properties of Common Functions . . . . . . . . . . . . . . 62
3.12 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.13 Technical Details of Continuous Variables . . . . . . . . . . . . . 64
3.14 Structured Probabilistic Models . . . . . . . . . . . . . . . . . . . 65
3.15 Example: Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . 68
4 Numerical Computation 74
4.1 Overflow and Underflow . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Poor Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . 76
4.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Example: Linear Least Squares . . . . . . . . . . . . . . . . . . . 87
5 Machine Learning Basics 89
5.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . 97
5.3 Generalization, Capacity, Overfitting and Underfitting . . . . . . 99
5.4 The No Free Lunch Theorem . . . . . . . . . . . . . . . . . . . . 104
5.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 Hyperparameters, Validation Sets and Cross-Validation . . . . . 108
5.7 Estimators, Bias, and Variance . . . . . . . . . . . . . . . . . . . 110
5.8 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 118
5.9 Bayesian Statistics and Prior Probability Distributions . . . . . . 121
5.10 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.11 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 131
5.12 Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . 134
5.13 The Curse of Dimensionality and Statistical Limitations of Local
Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
II Modern Practical Deep Networks 147
6 Feedforward Deep Networks 149
6.1 From Fixed Features to Learned Features . . . . . . . . . . . . . 149
6.2 Formalizing and Generalizing Neural Networks . . . . . . . . . . 152
6.3 Parametrizing a Learned Predictor . . . . . . . . . . . . . . . . . 154
ii