Index
L
p
norm, 34
k-means, 271, 449
k-nearest neighbors, 449
, 273
Absolute value rectification, 155
Active constraint, 87
ADALINE, see Adaptive Linear Element
Adaptive Linear Element, 13, 20, 23
Adversarial example, 222
Affine, 99
AIS, see annealed importance sampling
Almost everywhere, 65
Ancestral sampling, 399
ANN, see Artificial neural network
Annealed importance sampling, 495, 531
Approximate inference, 392
Artificial intelligence, 1
Artificial neural network, see Neural net-
work
Asymptotically unbiased, 111
Audio, 268
Autoencoder, 4
Automatic differentiation, 176
Back-propagation, 167
Back-Propagation Through Time, 285
Bagging, 214
Bayes’ rule, 63, 64
Bayesian hyperparameter optimization, 335
Bayesian network, see directed graphical model
Bayesian probability, 48
Bayesian statistics,
boldindex122
Beam Search, 330
Beam search, 318
Belief network, see directed graphical model
Bernoulli distribution, 57
Bias, 111
Boltzmann distribution, 382
Boltzmann machine, 382
Boltzmann Machines, 511
BPTT, see Back-Propagation Through Time
Broadcasting, 29
CAE, see contractive auto-encoder
Calculus of variations, 507
Categorical distribution, see multinoulli dis-
tribution57
CD, see contrastive divergence
Centering trick (DBM), 535
Central limit theorem, 58
Chain rule of probability, 52
Chess, 2
Chord, 388
Chordal graph, 388
Classical dynamical system, 281
Classical regularization, 192
Classification, 90
Cliffs, 229
Clipping the gradient, 314
Clique potential, see factor (graphical model)
CNN, see convolutional neural network
Collider, see explaining away
Color images, 268
Computer vision, 344
Concept drift, 440
Conditional computation, see dynamically
structured nets, 343
593
INDEX
Conditional Computation in Neural Nets,
365
Conditional independence, vi, 52
Conditional probability, 51
Connectionism, 15, 338
Connectionist temporal classification, 318
consistency, 118
Constrained optimization, 85
Context-specific independence, 385
Continuation methods, 245
Contractive auto-encoder, 427, 476
Contractive autoencoders, 407
Contrast, 345
Contrastive divergence, 482, 531, 534
Convolution, 247, 538
Convolutional network, 14
Convolutional neural network, 247
Coordinate descent, 240, 534
Correlation, 53
Cost function, see objective function
Covariance, vi, 53
Covariance matrix, 54
Cross entropy, 119, 156
Cross-correlation, 249
Cross-validation, 109
CTC, see connectionist temporal classifica-
tion
Curriculum-learning, 245
curse of dimensionality, 135
Cyc, 2
D-separation, 384
DAE, see denoising auto-encoder
Data generating distribution, 100
Data generating process, 100
Data parallelism, 341
Dataset, 94
Dataset augmentation, 345, 349
DBM, see deep Boltzmann machine
Decision trees, 449
Decoder, 4
Deep belief network, 23, 500, 512, 521, 539
Deep Blue, 2
Deep Boltzmann machine, 20, 23, 500, 512,
524, 534, 539
Deep learning, 1, 5
Denoising auto-encoder, 421
Denoising autoencoders, 180
Denoising score matching, 490
Density estimation, 93
Derivative, vi, 76
Design matrix,
boldindex96
Detector layer, 255
Diagonal matrix, 36
Dirac delta function, 60
Directed graphical model, 66, 376
Directional derivative, 80
Distributed Representation, 448
Distributed representation, 15
domain adaptation, 438
Dot product, 30
Doubly block circulant matrix, 249
Dream sleep, 481, 510
DropConnect, 220
Dropout, 180, 217, 333, 334, 534
Dynamic structure, 343
Dynamically structured networks, 343
E-step, 503
Early stopping, 166, 207, 209–212
EBM, see energy-based model
Echo state network, 20, 23, 305
Effective number of parameters, 195
Efficiency, 121
Eigendecomposition, 37
Eigenvalue, 37
Eigenvector, 37
ELBO, see evidence lower bound
Element-wise product, see Hadamard prod-
uct, see Hadamard product
EM, see expectation maximization
Embedding, 464
Empirical distribution, 60
Empirical risk, 226
Empirical risk minimization, 226
Encoder, 4
Energy function, 382
Energy-based model, 382, 524
Ensemble methods, 215
Epoch, 227, 236
Equality constraint, 86
594
INDEX
Equivariance, 254
Error function, see objective function
ESN, see echo state network
Euclidean norm, 34
Euler-Lagrange equation, 508
Evidence lower bound, 502–505, 523
Example, 94
Expectation, 53
Expectation maximization, 503
Expected value, see expectation
Explaining away, 386
Factor (graphical model), 379
Factor analysis, 412
Factor graph, 390
Factors of variation, 4
Feature, 94
Finite differences, 337
Forward-Backward algorithm, 319
Fourier transform, 268, 270
Fovea, 274
Frequentist probability, 48
Frequentist statistics,
boldindex122
Functional derivatives, 507
Gabor function, 275
Gaussian distribution, see Normal distri-
bution58
Gaussian kernel, 130
Gaussian mixture, 61
GCN, see Global contrast normalization
Generalization, 99
Generalized Lagrange function, see Gener-
alized Lagrangian
Generalized Lagrangian, 86
Generative adversarial networks, 180
Gibbs distribution, 380
Gibbs sampling, 400
Global contrast normalization, 346
GPU, see Graphics processing unit
Gradient, 80
Gradient clipping, 314
Gradient descent, 80
Graph, v
Graph Transformer, 328
Graph transformer, 325
Graphical model, see structured probabilis-
tic model
Graphics processing unit, 339
Greedy layer-wise unsupervised pre-training,
431
Grid search, 335
Hadamard product, v, 30
Hard tanh, 155
Harmonium, see Restricted Boltzmann ma-
chine395
Harmony theory, 383
Helmholtz free energy, see evidence lower
bound
Hessian matrix, vi, 81
Hidden layer, 6
Hidden Markov model, 280
HMM, see hidden Markov model
Hyperbolic tangent, 155
Hyperparameters, 108, 333
Hypothesis space, 101, 106
i.i.d assumptions, 221
i.i.d., 110
i.i.d. assumptions, 100
Identity matrix, 31
Immorality, 388
Independence, vi, 52
Independent and identically distributed, 110
Independent component analysis, 413
Inequality constraint, 86
Inference, 375, 392, 500, 502–506, 509
Integral, vi
Invariance, 258
Isomap, 435
Jacobian matrix, vi, 65, 80
Joint probability, 49
Karush-Kuhn-Tucker conditions, 87
Karush–Kuhn–Tucker, 86
Kernel (convolution), 248, 249
Kernel machine, 449
Kernel trick, 129
KKT, see Karush–Kuhn–Tucker
595
INDEX
KKT conditions, see Karush-Kuhn-Tucker
conditions
KL divergence, see Kllback-Leibler diver-
gence55
Knowledge base, 2
Kullback-Leibler divergence, vi, 55
Lagrange multipliers, 86, 87, 508
Lagrangian, see Gneralized Lagrangian86
Latent variable, 408
LCN, see local contrast normalization
Leaky units, 308
Line search, 80
Linear combination, 33
Linear dependence, 33
Linear factor models, 411
Linear regression,
boldindex97, 99, 128
Liquid state machine, 305
Local conditional probability distribution,
376
Local contrast normalization, 347
Logistic regression, 2, 129
Logistic sigmoid, 7, 62
Long short-term memory, 309
Loop, 388
Loss function, see objective function
LSTM, 21, see lng short-term memory309
M-step, 503
Machine learning, 2
Main diagonal, 29
Manifold, 143
Manifold hypothesis, 460
Manifold hypothesis, 145
Manifold learning, 144, 460
Manifold Tangent Classifier, 475
MAP inference, 505
Marginal probability, 51
Markov chain, 320, 399
Markov network, see undirected model378
Markov property, 320
Markov random field, see undirected model378
Matrix, iv, v, 28
Matrix inverse, 32
Matrix product, 30
Max pooling, 255
Maximum likelihood, 118
Maxout, 155
Mean field, 531, 534
Mean squared error, 98
Measure theory, 64
Measure zero, 64
Method of steepest descent, see gradient de-
scent
Missing inputs, 90
Mixing (Markov chain), 401
Mixture distribution, 61
Mixture of experts, 449
MLP, see multilayer perception
MNIST, 534
Model averaging, 215
Model capacity, 333
Model compression, 342
Model parallelism, 341
Moore-Penrose Pseudoinverse, 40
Moore-Penrose pseudoinverse, 201
Moralized graph, 388
MP-DBM, see multi-prediction DBM
MRF (Markov Random Field), see undi-
rected model378
MSE, see mean squared error98
Multi-modal learning, 444
Multi-prediction DBM, 533, 535
Multi-task learning, 221, 440
Multilayer perception, 5
Multilayer perceptron, 23
Multinomial distribution, 57
Multinoulli distribution, 57
Naive Bayes, 2, 68
Nat, 55
natural image, 372
Negative definite, 82
Negative phase, 479, 481
Neocognitron, 14, 20, 23
Nesterov momentum, 237
Netflix Grand Prize, 217
Neural network, 12
Neuroscience, 13
Noise-contrastive estimation, 491
Non-parametric, 103
596
INDEX
Norm, vi, 34
Normal distribution, 58, 60
Normal equations, 195
Numerical differentiation, 176, see finite dif-
ferences
Object detection, 344
Object recognition, 344
Objective function, 76
Offset, 153
One-shot learning, 442
Orthodox statistics, see frequentist statis-
tics
Orthogonal matrix, 37
Orthogonality, 36
Overfitting, 333
Parallel distributed processing, 15
Parameter sharing, 251
Parameter tying , Parameter sharing214
Parametric, 103
Partial derivative, 80
Partition function, 381, 477, 531
PCA, see principal components analysis
PCD, see stochastic maximum likelihood
Perceptron, 13, 23
Perplexity, 121
Persistent contrastive divergence, see stochas-
tic maximum likelihood
Point Estimator, 110
Pooling, 247, 538
Positive definite, 81
Positive phase, 479, 481
Pre-training, 431
Precision (of a normal distribution), 58, 60
Predictive sparse decomposition, 271, 406,
418, 420
Preprocessing, 344
Primary visual cortex, 272
Principal components analysis, 42, 349, 412,
500
Principle components analysis, 132–134, 146
Prior probability distribution,
boldindex122
Probabilistic max pooling, 538
Probability density function, 50
Probability distribution, 49
Probability function estimation, 93
Probability mass function, 49
Product rule of probability, see chain rule
of probability
PSD, see predictive sparse decomposition
Pseudolikelihood, 486
Quadrature pair, 276
Radial basis function, 155
Random search, 335
Random variable, 48
Ratio matching, 490
RBF, 155
RBM, see restricted Boltzmann machine
Receptive field, 252
Rectified linear unit, 155
Rectifier, 155
Recurrent network, 23
Recurrent neural network, 283
Regression, 91
Regularization,
boldindex107, 107, 189, 333
Reinforcement learning, 180
ReLU, 155
Representation learning, 3
Restricted Boltzmann machine, 395, 500,
512, 514, 534, 535, 537, 538
Ridge regression, 193
Risk, 226
Sample mean, 112
Scalar, iv, v, 27
Score matching, 489
Second derivative, 81
Second derivative test, 81
Self-information, 55
Semi-supervised learning, 131, 445
Separable convolution, 270
Separation (probabilistic modeling), 384
Set, v
SGD, see stochastic gradient descent, see
stochastic gradient descent
Shannon entropy, vi, 55, 508
Sigmoid, vi, see logistic sigmoid, 155
Sigmoid belief network, 23
597
INDEX
Simple cell, 273
Simulated annealing, 245
Singular value, see singular value decompo-
sition
Singular value decomposition, 39, 133
Singular vector, see singular value decom-
position
SML, see stochastic maximum likelihood
Softmax, 155, 158
Softplus, vi, 62, 155
Spam detection, 2
Sparse coding, 406, 415, 500
Sparse representations, 417
Spearmint, 335
spectral radius, 306
Sphering, see Whitening, 347
Spike and slab restricted Boltzmann ma-
chine, 537
Square matrix, 33
ssRBM, see spike and slab restricted Boltz-
mann machine
Standard deviation, 53
Statistic, 110
Statistical learning theory, 100
Steepest descent, see gradient descent
Stochastic gradient descent, 13, 227, 236,
534
Stochastic maximum likelihood, 483, 531,
534
Stochastic pooling, 221
Structure learning, 392
Structured output, 91
Structured probabilistic model, 66, 371
Student-t, 407
Sum rule of probability, 51
Sum-product network, 455
Supervised learning,
boldindex95
Support vector machine, 129
Surrogate loss function, 226
SVD, see singular value decomposition
Symbolic differentiation, 177
Symmetric matrix, 36, 39
t-SNE, 435
Tangent Distance, 473
Tangent plane, 464
Tangent-Prop, 474
Tanh, 155
Teacher forcing, 284
Tensor, iv, v, 29
Test set, 100
Tiled convolution, 265
Toeplitz matrix, 249
Trace operator, 41
Training error, 99
Transcription, 91
Transfer learning, 438
Transpose, v, 29
Triangle inequality, 34
Triangulated graph, see chordal graph
Unbiased, 111
Undirected graphical model, 66
Undirected model, 378
Uniform distribution, 50
Unit norm, 36
Unit vector, 36
Universal approximation theorem, 180
Universal approximator, 454
Unnormalized probability distribution, 379
Unsupervised learning,
boldindex95, 131
Unsupervised pre-training, 431
V-structure, see explaining away
V1, 272
Variance, vi, 53
Variational autoencoder, 180
Variational derivatives, see functional deriva-
tives
Variational free energy, see evidence lower
bound
Vector, iv, v, 28
Visible layer, 6
Viterbi algorithm, 319
Viterbi decoding, 322
Volumetric data, 268
Weight decay, 106, 193, 334
Weights, 13, 97
Whitening, 347, 349
598
INDEX
ZCA, see zero-phase components analysis
Zero-data learning, 442
Zero-phase components analysis, 349
Zero-shot learning, 442
599
INDEX
dlbook.indIndex
600