Index
0-1 loss, 102, 275
Absolute value rectification, 192
Accuracy, 422
Activation function, 170
Active constraint, 94
AdaGrad, 306
ADALINE, see adaptive linear element
Adam, 308, 424
Adaptive linear element, 15, 23, 26
Adversarial example, 266
Adversarial training, 267, 269
Affine, 109
AIS, see annealed importance sampling
Almost everywhere, 70
Almost sure convergence, 129
Ancestral sampling, 578, 594
ANN, see Artificial neural network
Annealed importance sampling, 625, 669,
719
Approximate Bayesian computation, 718
Approximate inference, 581
Artificial intelligence, 1
Artificial neural network, see Neural net-
work
ASR, see automatic speech recognition
Asymptotically unbiased, 123
Audio, 359
Autoencoder, 4, 355, 500
Automatic speech recognition, 456
Back-propagation, 202
Back-propagation through time, 383
Backprop, see back-propagation
Bag of words, 469
Bagging, 254
Batch normalization, 265, 424
Bayes error, 116
Bayes’ rule, 69
Bayesian hyperparameter optimization, 435
Bayesian network, see directed graphical
model
Bayesian probability, 54
Bayesian statistics, 134
Belief network, see directed graphical model
Bernoulli distribution, 61
BFGS, 315
Bias, 123, 228
Bias parameter, 109
Biased importance sampling, 592
Bigram, 460
Binary relation, 480
Block Gibbs sampling, 598
Boltzmann distribution, 568
Boltzmann machine, 568, 654
BPTT, see back-propagation through time
Broadcasting, 33
Burn-in, 597
CAE, see contractive autoencoder
Calculus of variations, 179
Categorical distribution, see multinoulli dis-
tribution
CD, see contrastive divergence
Centering trick (DBM), 671, 675
Central limit theorem, 62
Chain rule (calculus), 204
Chain rule of probability, 58
787
INDEX
Chess, 2
Chord, 575
Chordal graph, 575
Class-based language models, 461
Classical dynamical system, 374
Classification, 99
Clique potential, see factor (graphical model)
CNN, see convolutional neural network
Collaborative Filtering, 476
Collider, see explaining away
Color images, 359
Complex cell, 364
Computational graph, 203
Computer vision, 451
Concept drift, 535
Condition number, 278
Conditional computation, see dynamic struc-
ture
Conditional independence, viii, 59
Conditional probability, 58
Conditional RBM, 687
Connectionism, 17, 442
Connectionist temporal classification, 459
Consistency, 129, 509
Constrained optimization, 92, 236
Content-based addressing, 418
Content-Based Recommender Systems, 478
Context-specific independence, 571
Contextual bandits, 478
Continuation methods, 326
Contractive autoencoder, 519
Contrast, 452
Contrastive divergence, 290, 608, 669, 674
Convex optimization, 140
Convolution, 329, 685
Convolutional network, 16
Convolutional neural network, 251,
329
, 424,
458
Coordinate descent, 321, 673
Correlation, 60
Cost function, see objective function
Covariance, viii, 60
Covariance matrix, 61
Coverage, 423
Critical temperature, 602
Cross-correlation, 331
Cross-entropy, 74, 131
Cross-validation, 121
CTC, see connectionist temporal classifica-
tion
Curriculum learning, 328
Curse of dimensionality, 154
Cyc, 2
D-separation, 571
DAE, see denoising autoencoder
Data generating distribution, 110, 130
Data generating process, 110
Data parallelism, 446
Dataset, 103
Dataset augmentation, 269, 456
DBM, see deep Boltzmann machine
Decision tree, 145, 546
Decoder, 4
Deep belief network, 26, 527, 631, 657, 660,
686
Deep Blue, 2
Deep Boltzmann machine, 23, 26, 527, 631,
652, 657, 663, 674, 686
Deep feedforward network, 167, 424
Deep learning, 2, 5
Denoising autoencoder, 508, 691
Denoising score matching, 619
Density estimation, 102
Derivative, viii, 82
Design matrix, 105
Detector layer, 338
Determinant, vii
Diagonal matrix, 40
Differential entropy, 73, 646
Dirac delta function, 64
Directed graphical model, 76, 505, 561
Directional derivative, 84
Discriminative fine-tuning, see supervised
fine-tuning
Discriminative RBM, 688
Distributed representation, 17, 150, 544
Domain adaptation, 534
Dot product, 33, 140
788
INDEX
Double backprop, 269
Doubly block circulant matrix, 332
Dream sleep, 608, 652
DropConnect, 264
Dropout, 256, 424, 429, 430, 674, 691
Dynamic structure, 447
E-step, 634
Early stopping, 246, 248, 271, 272, 424
EBM, see energy-based model
Echo state network, 23, 26, 403
Effective capacity, 113
Eigendecomposition, 41
Eigenvalue, 41
Eigenvector, 41
ELBO, see evidence lower bound
Element-wise product, see Hadamard prod-
uct, see Hadamard product
EM, see expectation maximization
Embedding, 514
Empirical distribution, 65
Empirical risk, 275
Empirical risk minimization, 275
Encoder, 4
Energy function, 568
Energy-based model, 568, 594, 654, 663
Ensemble methods, 254
Epoch, 245
Equality constraint, 93
Equivariance, 337
Error function, see objective function
ESN, see echo state network
Euclidean norm, 38
Euler-Lagrange equation, 646
Evidence lower bound, 633, 662
Example, 98
Expectation, 59
Expectation maximization, 634
Expected value, see expectation
Explaining away, 572, 631, 645
Exploitation, 479
Exploration, 479
Exponential distribution, 64
F-score, 422
Factor (graphical model), 565
Factor analysis, 488
Factor graph, 577
Factors of variation, 4
Feature, 98
Feature selection, 235
Feedforward neural network, 167
Fine-tuning, 323
Finite differences, 438
Forget gate, 304
Forward propagation, 202
Fourier transform, 359, 361
Fovea, 365
FPCD, 613
Free energy, 569, 682
Freebase, 481
Frequentist probability, 54
Frequentist statistics, 134
Frobenius norm, 45
Fully-visible Bayes network, 707
Functional derivatives, 646
FVBN, see fully-visible Bayes network
Gabor function, 367
GANs, see generative adversarial networks
Gated recurrent unit, 424
Gaussian distribution, see normal distribu-
tion
Gaussian kernel, 141
Gaussian mixture, 66, 188
GCN, see Global contrast normalization
GeneOntology, 481
Generalization, 109
Generalized Lagrange function, see Gener-
alized Lagrangian
Generalized Lagrangian, 93
Generative adversarial networks, 691, 701
Generative models, 589
Generative moment matching networks, 705
Generator network, 695
Gibbs distribution, 566
Gibbs sampling, 579, 598
Global contrast normalization, 453
GPU, see Graphics processing unit
Gradient, 83
789
INDEX
Gradient clipping, 288, 413
Gradient descent, 82, 84
Graph, vii
Graphical model, see structured probabilis-
tic model
Graphics processing unit, 443
Greedy algorithm, 323
Greedy layer-wise unsupervised pretraining,
526
Greedy supervised pretraining, 323
Grid search, 431
Hadamard product, vii, 33
Hard tanh, 196
Harmonium, see restricted Boltzmann ma-
chine
Harmony theory, 569
Helmholtz free energy, see evidence lower
bound
Hessian, 223
Hessian matrix, viii, 86
Heteroscedastic, 187
Hidden layer, 6, 167
Hill climbing, 85
Hyperparameter optimization, 431
Hyperparameters, 119, 429
Hypothesis space, 111, 117
i.i.d. assumptions, 110, 121, 266
Identity matrix, 35
ILSVRC, see ImageNet Large-Scale Visual
Recognition Challenge
ImageNet Large-Scale Visual Recognition
Challenge, 22
Immorality, 575
Importance sampling, 591, 624, 700
Importance weighted autoencoder, 700
Independence, viii, 59
Independent and identically distributed, see
i.i.d. assumptions
Independent component analysis, 489
Independent subspace analysis, 491
Inequality constraint, 93
Inference, 560, 581, 631, 633, 635, 638, 648,
651
Information retrieval, 523
Initialization, 299
Integral, viii
Invariance, 341
Isotropic, 64
Jacobian matrix, viii, 71, 85
Joint probability, 56
k-means, 362, 544
k-nearest neighbors, 142, 546
Karush-Kuhn-Tucker conditions, 94, 236
Karush–Kuhn–Tucker, 93
Kernel (convolution), 330, 331
Kernel machine, 546
Kernel trick, 140
KKT, see Karush–Kuhn–Tucker
KKT conditions, see Karush-Kuhn-Tucker
conditions
KL divergence, see Kllback-Leibler diver-
gence73
Knowledge base, 2, 481
Krylov methods, 223
Kullback-Leibler divergence, viii, 73
Label smoothing, 242
Lagrange multipliers, 93, 647
Lagrangian, see Gneralized Lagrangian93
LAPGAN, 704
Laplace distribution, 64, 494
Latent variable, 66
Layer (neural network), 167
LCN, see local contrast normalization
Leaky ReLU, 192
Leaky units, 406
Learning rate, 84
Line search, 84, 85, 92
Linear combination, 36
Linear dependence, 37
Linear factor models, 487
Linear regression, 106, 109, 139
Link Prediction, 482
Lipschitz constant, 91
Lipschitz continuous, 91
Liquid state machine, 403
790
INDEX
Local conditional probability distribution,
562
Local contrast normalization, 455
Logistic regression, 3, 139, 140
Logistic sigmoid, 7, 66
Long short-term memory, 304, 407,
409
, 424
Loop, 575
Loss function, see objective function
L
p
norm, 38
LSTM, 18, 24, see lng short-term mem-
ory407, see long short-term mem-
ory
M-step, 634
Machine learning, 2
Machine translation, 100
Main diagonal, 32
Manifold, 160
Manifold hypothesis, 161
Manifold learning, 161
Manifold tangent classifier, 269
MAP approximation, 137, 503
Marginal probability, 57
Markov chain, 594
Markov chain Monte Carlo, 594
Markov network, see udirected model564
Markov random field, see udirected model564
Matrix, vi, vii, 31
Matrix inverse, 35
Matrix product, 33
Max norm, 39
Max pooling, 338
Maximum likelihood, 130
Maxout, 192, 424
MCMC, see Markov chain Monte Carlo
Mean field, 638, 639, 669, 674
Mean squared error, 107
Measure theory, 70
Measure zero, 70
Memory network, 415, 417
Method of steepest descent, see gradient
descent
Minibatch, 278
Missing inputs, 99
Mixing (Markov chain), 600
Mixture density networks, 188
Mixture distribution, 65
Mixture model, 188, 508
Mixture of experts, 448, 546
MLP, see multilayer perception
MNIST, 20, 21, 674
Model averaging, 254
Model compression, 446
Model identifiability, 283
Model parallelism, 446
Moment matching, 705
Moore-Penrose pseudoinverse, 44, 239
Moralized graph, 575
MP-DBM, see multi-prediction DBM
MRF (Markov Random Field), see udirected
model564
MSE, see mean squared error
Multi-modal learning, 537
Multi-prediction DBM, 671, 676
Multi-task learning, 243, 535
Multilayer perception, 5
Multilayer perceptron, 26
Multinomial distribution, 61
Multinoulli distribution, 61
n-gram, 459
NADE, 710
Naive Bayes, 3
Nat, 72
Natural image, 557
Natural language processing, 459
Nearest neighbor regression, 114
Negative definite, 88
Negative phase, 468, 605, 608
Neocognitron, 16, 23, 26, 366
Nesterov momentum, 299
Netflix Grand Prize, 256
Netflix prize, 477
Neural language model, 462, 474
Neural network, 13
Neural Turing machine, 417
Neuroscience, 15
Newton’s method, 88, 310
NLM, see neural language model
NLP, see natural language processing
791
INDEX
No free lunch theorem, 115
Noise-contrastive estimation, 620
Non-parametric model, 113
Norm, ix, 38
Normal distribution, 62, 63, 124
Normal equations, 108, 108, 111, 233
Normalized initialization, 302
Numerical differentiation, see finite differ-
ences
Object detection, 451
Object recognition, 451
Objective function, 81
OMP-k, see orthogonal matching pursuit
One-shot learning, 536
Operation, 203
Optimization, 79, 81
Orthodox statistics, see frequentist statistics
Orthogonal matching pursuit, 26, 253
Orthogonal matrix, 41
Orthogonality, 40
Output layer, 167
Parallel distributed processing, 17
Parameter initialization, 299, 405
Parameter sharing, 250, 334, 372, 374, 388
Parameter tying, see Parameter sharing
Parametric model, 113
Parametric ReLU, 192
Partial derivative, 83
Partition function, 566, 604, 669
PCA, see principal components analysis
PCD, see stochastic maximum likelihood
Perceptron, 15, 26
Persistent contrastive divergence, see stochas-
tic maximum likelihood
Perturbation analysis, see reparametrization
trick
Point estimator, 121
Policy, 478
Pooling, 329, 685
Positive definite, 88
Positive phase, 468, 605, 608, 656, 668
Precision, 422
Precision (of a normal distribution), 62, 64
Predictive sparse decomposition, 521
Preprocessing, 451
Pretraining, 322, 526
Primary visual cortex, 363
Principal components analysis, 47, 146, 147,
488, 631
Prior probability distribution, 134
Probabilistic max pooling, 685
Probabilistic PCA, 488, 489, 632
Probability density function, 57
Probability distribution, 55
Probability mass function, 55
Probability mass function estimation, 102
Product of experts, 568
Product rule of probability, see chain rule
of probability
PSD, see predictive sparse decomposition
Pseudolikelihood, 615
Quadrature pair, 369
Quasi-Newton condition, 315
Quasi-Newton methods, 315
Radial basis function, 195
Random search, 433
Random variable, 55
Ratio matching, 618
RBF, 195
RBM, see restricted Boltzmann machine
Recall, 422
Receptive field, 336
Recommender Systems, 475
Rectified linear unit, 171, 192, 424, 505
Recurrent network, 26
Recurrent neural network, 377
Regression, 99
Regularization, 119, 119, 177, 227, 429
Regularizer, 118
REINFORCE, 691
Reinforcement learning, 24, 105, 478, 691
Relational database, 481
Reparametrization trick, 690
Representation learning, 3
Representational capacity, 113
792
INDEX
Restricted Boltzmann machine, 355, 457,
477, 585, 631, 656, 657, 674, 677,
680, 683, 685
Ridge regression, see weight decay
Risk, 274
RNN-RBM, 688
Saddle points, 284
Sample mean, 124
Scalar, vi, vii, 30
Score matching, 509, 617
Secant condition, 315
Second derivative, 85
Second derivative test, 88
Self-information, 72
Semantic hashing, 523
Semi-supervised learning, 242
Separable convolution, 361
Separation (probabilistic modeling), 570
Set, vii
SGD, see stochastic gradient descent
Shannon entropy, viii, 72
Shortlist, 464
Sigmoid, ix, see logistic sigmoid
Sigmoid belief network, 26
Simple cell, 364
Singular value, see singular value decompo-
sition
Singular value decomposition, 43, 147, 477
Singular vector, see singular value decom-
position
Slow feature analysis, 491
SML, see stochastic maximum likelihood
Softmax, 183, 417, 448
Softplus, ix, 67, 196
Spam detection, 3
Sparse coding, 321, 355, 494, 631
Sparse initialization, 303, 405
Sparse representation, 146, 226, 252, 503,
554
Spearmint, 435
Spectral radius, 403
Speech recognition, see automatic speech
recognition
Sphering, see whitening
Spike and slab restricted Boltzmann ma-
chine, 683
SPN, see sum-product network
Square matrix, 37
ssRBM, see spike and slab restricted Boltz-
mann machine
Standard deviation, 60
Standard error, 126
Standard error of the mean, 126, 277
Statistic, 121
Statistical learning theory, 109
Steepest descent, see gradient descent
Stochastic back-propagation, see reparametriza-
tion trick
Stochastic gradient descent, 15, 150, 278,
293, 674
Stochastic maximum likelihood, 612, 669,
674
Stochastic pooling, 264
Structure learning, 581
Structured output, 100, 687
Structured probabilistic model, 76, 556
Sum rule of probability, 57
Sum-product network, 551
Supervised fine-tuning, 527, 662
Supervised learning, 104
Support vector machine, 140
Surrogate loss function, 275
SVD, see singular value decomposition
Symmetric matrix, 40, 42
Tangent distance, 268
Tangent plane, 514
Tangent prop, 268
TDNN, see time-delay neural network
Teacher forcing, 381, 382
Tempering, 602
Template matching, 140
Tensor, vi, vii, 32
Test set, 109
Tikhonov regularization, see weight decay
Tiled convolution, 351
Time-delay neural network, 366, 373
Toeplitz matrix, 332
Topographic ICA, 491
793
INDEX
Trace operator, 45
Training error, 109
Transcription, 100
Transfer learning, 534
Transpose, vii, 32
Triangle inequality, 38
Triangulated graph, see chordal graph
Trigram, 460
Unbiased, 123
Undirected graphical model, 76, 505
Undirected model, 564
Uniform distribution, 56
Unigram, 460
Unit norm, 40
Unit vector, 40
Universal approximation theorem, 197
Universal approximator, 551
Unnormalized probability distribution, 565
Unsupervised learning, 104, 145
Unsupervised pretraining, 457, 526
V-structure, see explaining away
V1, 363
VAE, see variational autoencoder
Vapnik-Chervonenkis dimension, 113
Variance, viii, 60, 228
Variational autoencoder, 691, 698
Variational derivatives, see functional deriva-
tives
Variational free energy, see evidence lower
bound
VC dimension, see Vapnik-Chervonenkis di-
mension
Vector, vi, vii, 31
Virtual adversarial examples, 267
Visible layer, 6
Volumetric data, 359
Wake-sleep, 651, 662
Weight decay, 117, 177, 230, 430
Weight space symmetry, 283
Weights, 15, 106
Whitening, 454
Wikibase, 481
Wikibase, 481
Word embedding, 462
Word-Sense Disambiguation, 482
WordNet, 481
Zero-data learning, see zero-shot learning
Zero-shot learning, 536
794