Index

0-1 loss, 102, 275

Absolute value rectiﬁcation, 192

Accuracy, 422

Activation function, 170

Active constraint, 94

AdaGrad, 306

ADALINE, see adaptive linear element

Adam, 308, 424

Adaptive linear element, 15, 23, 26

Adversarial example, 266

Adversarial training, 267, 269

Aﬃne, 109

AIS, see annealed importance sampling

Almost everywhere, 70

Almost sure convergence, 129

Ancestral sampling, 578, 594

ANN, see Artiﬁcial neural network

Annealed importance sampling, 625, 669,

719

Approximate Bayesian computation, 718

Approximate inference, 581

Artiﬁcial intelligence, 1

Artiﬁcial neural network, see Neural net-

work

ASR, see automatic speech recognition

Asymptotically unbiased, 123

Audio, 359

Autoencoder, 4, 355, 500

Automatic speech recognition, 456

Back-propagation, 202

Back-propagation through time, 383

Backprop, see back-propagation

Bag of words, 469

Bagging, 254

Batch normalization, 265, 424

Bayes error, 116

Bayes’ rule, 69

Bayesian hyperparameter optimization, 435

Bayesian network, see directed graphical

model

Bayesian probability, 54

Bayesian statistics, 134

Belief network, see directed graphical model

Bernoulli distribution, 61

BFGS, 315

Bias, 123, 228

Bias parameter, 109

Biased importance sampling, 592

Bigram, 460

Binary relation, 480

Block Gibbs sampling, 598

Boltzmann distribution, 568

Boltzmann machine, 568, 654

BPTT, see back-propagation through time

Broadcasting, 33

Burn-in, 597

CAE, see contractive autoencoder

Calculus of variations, 179

Categorical distribution, see multinoulli dis-

tribution

CD, see contrastive divergence

Centering trick (DBM), 671, 675

Central limit theorem, 62

Chain rule (calculus), 204

Chain rule of probability, 58

787

INDEX

Chess, 2

Chord, 575

Chordal graph, 575

Class-based language models, 461

Classical dynamical system, 374

Classiﬁcation, 99

Clique potential, see factor (graphical model)

CNN, see convolutional neural network

Collaborative Filtering, 476

Collider, see explaining away

Color images, 359

Complex cell, 364

Computational graph, 203

Computer vision, 451

Concept drift, 535

Condition number, 278

Conditional computation, see dynamic struc-

ture

Conditional independence, viii, 59

Conditional probability, 58

Conditional RBM, 687

Connectionism, 17, 442

Connectionist temporal classiﬁcation, 459

Consistency, 129, 509

Constrained optimization, 92, 236

Content-based addressing, 418

Content-Based Recommender Systems, 478

Context-speciﬁc independence, 571

Contextual bandits, 478

Continuation methods, 326

Contractive autoencoder, 519

Contrast, 452

Contrastive divergence, 290, 608, 669, 674

Convex optimization, 140

Convolution, 329, 685

Convolutional network, 16

Convolutional neural network, 251,

329

, 424,

458

Coordinate descent, 321, 673

Correlation, 60

Cost function, see objective function

Covariance, viii, 60

Covariance matrix, 61

Coverage, 423

Critical temperature, 602

Cross-correlation, 331

Cross-entropy, 74, 131

Cross-validation, 121

CTC, see connectionist temporal classiﬁca-

tion

Curriculum learning, 328

Curse of dimensionality, 154

Cyc, 2

D-separation, 571

DAE, see denoising autoencoder

Data generating distribution, 110, 130

Data generating process, 110

Data parallelism, 446

Dataset, 103

Dataset augmentation, 269, 456

DBM, see deep Boltzmann machine

Decision tree, 145, 546

Decoder, 4

Deep belief network, 26, 527, 631, 657, 660,

686

Deep Blue, 2

Deep Boltzmann machine, 23, 26, 527, 631,

652, 657, 663, 674, 686

Deep feedforward network, 167, 424

Deep learning, 2, 5

Denoising autoencoder, 508, 691

Denoising score matching, 619

Density estimation, 102

Derivative, viii, 82

Design matrix, 105

Detector layer, 338

Determinant, vii

Diagonal matrix, 40

Diﬀerential entropy, 73, 646

Dirac delta function, 64

Directed graphical model, 76, 505, 561

Directional derivative, 84

Discriminative ﬁne-tuning, see supervised

ﬁne-tuning

Discriminative RBM, 688

Distributed representation, 17, 150, 544

Domain adaptation, 534

Dot product, 33, 140

788

INDEX

Double backprop, 269

Doubly block circulant matrix, 332

Dream sleep, 608, 652

DropConnect, 264

Dropout, 256, 424, 429, 430, 674, 691

Dynamic structure, 447

E-step, 634

Early stopping, 246, 248, 271, 272, 424

EBM, see energy-based model

Echo state network, 23, 26, 403

Eﬀective capacity, 113

Eigendecomposition, 41

Eigenvalue, 41

Eigenvector, 41

ELBO, see evidence lower bound

Element-wise product, see Hadamard prod-

uct, see Hadamard product

EM, see expectation maximization

Embedding, 514

Empirical distribution, 65

Empirical risk, 275

Empirical risk minimization, 275

Encoder, 4

Energy function, 568

Energy-based model, 568, 594, 654, 663

Ensemble methods, 254

Epoch, 245

Equality constraint, 93

Equivariance, 337

Error function, see objective function

ESN, see echo state network

Euclidean norm, 38

Euler-Lagrange equation, 646

Evidence lower bound, 633, 662

Example, 98

Expectation, 59

Expectation maximization, 634

Expected value, see expectation

Explaining away, 572, 631, 645

Exploitation, 479

Exploration, 479

Exponential distribution, 64

F-score, 422

Factor (graphical model), 565

Factor analysis, 488

Factor graph, 577

Factors of variation, 4

Feature, 98

Feature selection, 235

Feedforward neural network, 167

Fine-tuning, 323

Finite diﬀerences, 438

Forget gate, 304

Forward propagation, 202

Fourier transform, 359, 361

Fovea, 365

FPCD, 613

Free energy, 569, 682

Freebase, 481

Frequentist probability, 54

Frequentist statistics, 134

Frobenius norm, 45

Fully-visible Bayes network, 707

Functional derivatives, 646

FVBN, see fully-visible Bayes network

Gabor function, 367

GANs, see generative adversarial networks

Gated recurrent unit, 424

Gaussian distribution, see normal distribu-

tion

Gaussian kernel, 141

Gaussian mixture, 66, 188

GCN, see Global contrast normalization

GeneOntology, 481

Generalization, 109

Generalized Lagrange function, see Gener-

alized Lagrangian

Generalized Lagrangian, 93

Generative adversarial networks, 691, 701

Generative models, 589

Generative moment matching networks, 705

Generator network, 695

Gibbs distribution, 566

Gibbs sampling, 579, 598

Global contrast normalization, 453

GPU, see Graphics processing unit

Gradient, 83

789

INDEX

Gradient clipping, 288, 413

Gradient descent, 82, 84

Graph, vii

Graphical model, see structured probabilis-

tic model

Graphics processing unit, 443

Greedy algorithm, 323

Greedy layer-wise unsupervised pretraining,

526

Greedy supervised pretraining, 323

Grid search, 431

Hadamard product, vii, 33

Hard tanh, 196

Harmonium, see restricted Boltzmann ma-

chine

Harmony theory, 569

Helmholtz free energy, see evidence lower

bound

Hessian, 223

Hessian matrix, viii, 86

Heteroscedastic, 187

Hidden layer, 6, 167

Hill climbing, 85

Hyperparameter optimization, 431

Hyperparameters, 119, 429

Hypothesis space, 111, 117

i.i.d. assumptions, 110, 121, 266

Identity matrix, 35

ILSVRC, see ImageNet Large-Scale Visual

Recognition Challenge

ImageNet Large-Scale Visual Recognition

Challenge, 22

Immorality, 575

Importance sampling, 591, 624, 700

Importance weighted autoencoder, 700

Independence, viii, 59

Independent and identically distributed, see

i.i.d. assumptions

Independent component analysis, 489

Independent subspace analysis, 491

Inequality constraint, 93

Inference, 560, 581, 631, 633, 635, 638, 648,

651

Information retrieval, 523

Initialization, 299

Integral, viii

Invariance, 341

Isotropic, 64

Jacobian matrix, viii, 71, 85

Joint probability, 56

k-means, 362, 544

k-nearest neighbors, 142, 546

Karush-Kuhn-Tucker conditions, 94, 236

Karush–Kuhn–Tucker, 93

Kernel (convolution), 330, 331

Kernel machine, 546

Kernel trick, 140

KKT, see Karush–Kuhn–Tucker

KKT conditions, see Karush-Kuhn-Tucker

conditions

KL divergence, see Kllback-Leibler diver-

gence73

Knowledge base, 2, 481

Krylov methods, 223

Kullback-Leibler divergence, viii, 73

Label smoothing, 242

Lagrange multipliers, 93, 647

Lagrangian, see Gneralized Lagrangian93

LAPGAN, 704

Laplace distribution, 64, 494

Latent variable, 66

Layer (neural network), 167

LCN, see local contrast normalization

Leaky ReLU, 192

Leaky units, 406

Learning rate, 84

Line search, 84, 85, 92

Linear combination, 36

Linear dependence, 37

Linear factor models, 487

Linear regression, 106, 109, 139

Link Prediction, 482

Lipschitz constant, 91

Lipschitz continuous, 91

Liquid state machine, 403

790

INDEX

Local conditional probability distribution,

562

Local contrast normalization, 455

Logistic regression, 3, 139, 140

Logistic sigmoid, 7, 66

Long short-term memory, 304, 407,

409

, 424

Loop, 575

Loss function, see objective function

norm, 38

LSTM, 18, 24, see lng short-term mem-

ory407, see long short-term mem-

ory

M-step, 634

Machine learning, 2

Machine translation, 100

Main diagonal, 32

Manifold, 160

Manifold hypothesis, 161

Manifold learning, 161

Manifold tangent classiﬁer, 269

MAP approximation, 137, 503

Marginal probability, 57

Markov chain, 594

Markov chain Monte Carlo, 594

Markov network, see udirected model564

Markov random ﬁeld, see udirected model564

Matrix, vi, vii, 31

Matrix inverse, 35

Matrix product, 33

Max norm, 39

Max pooling, 338

Maximum likelihood, 130

Maxout, 192, 424

MCMC, see Markov chain Monte Carlo

Mean ﬁeld, 638, 639, 669, 674

Mean squared error, 107

Measure theory, 70

Measure zero, 70

Memory network, 415, 417

Method of steepest descent, see gradient

descent

Minibatch, 278

Missing inputs, 99

Mixing (Markov chain), 600

Mixture density networks, 188

Mixture distribution, 65

Mixture model, 188, 508

Mixture of experts, 448, 546

MLP, see multilayer perception

MNIST, 20, 21, 674

Model averaging, 254

Model compression, 446

Model identiﬁability, 283

Model parallelism, 446

Moment matching, 705

Moore-Penrose pseudoinverse, 44, 239

Moralized graph, 575

MP-DBM, see multi-prediction DBM

MRF (Markov Random Field), see udirected

model564

MSE, see mean squared error

Multi-modal learning, 537

Multi-prediction DBM, 671, 676

Multi-task learning, 243, 535

Multilayer perception, 5

Multilayer perceptron, 26

Multinomial distribution, 61

Multinoulli distribution, 61

n-gram, 459

NADE, 710

Naive Bayes, 3

Nat, 72

Natural image, 557

Natural language processing, 459

Nearest neighbor regression, 114

Negative deﬁnite, 88

Negative phase, 468, 605, 608

Neocognitron, 16, 23, 26, 366

Nesterov momentum, 299

Netﬂix Grand Prize, 256

Netﬂix prize, 477

Neural language model, 462, 474

Neural network, 13

Neural Turing machine, 417

Neuroscience, 15

Newton’s method, 88, 310

NLM, see neural language model

NLP, see natural language processing

791

INDEX

No free lunch theorem, 115

Noise-contrastive estimation, 620

Non-parametric model, 113

Norm, ix, 38

Normal distribution, 62, 63, 124

Normal equations, 108, 108, 111, 233

Normalized initialization, 302

Numerical diﬀerentiation, see ﬁnite diﬀer-

ences

Object detection, 451

Object recognition, 451

Objective function, 81

OMP-k, see orthogonal matching pursuit

One-shot learning, 536

Operation, 203

Optimization, 79, 81

Orthodox statistics, see frequentist statistics

Orthogonal matching pursuit, 26, 253

Orthogonal matrix, 41

Orthogonality, 40

Output layer, 167

Parallel distributed processing, 17

Parameter initialization, 299, 405

Parameter sharing, 250, 334, 372, 374, 388

Parameter tying, see Parameter sharing

Parametric model, 113

Parametric ReLU, 192

Partial derivative, 83

Partition function, 566, 604, 669

PCA, see principal components analysis

PCD, see stochastic maximum likelihood

Perceptron, 15, 26

Persistent contrastive divergence, see stochas-

tic maximum likelihood

Perturbation analysis, see reparametrization

trick

Point estimator, 121

Policy, 478

Pooling, 329, 685

Positive deﬁnite, 88

Positive phase, 468, 605, 608, 656, 668

Precision, 422

Precision (of a normal distribution), 62, 64

Predictive sparse decomposition, 521

Preprocessing, 451

Pretraining, 322, 526

Primary visual cortex, 363

Principal components analysis, 47, 146, 147,

488, 631

Prior probability distribution, 134

Probabilistic max pooling, 685

Probabilistic PCA, 488, 489, 632

Probability density function, 57

Probability distribution, 55

Probability mass function, 55

Probability mass function estimation, 102

Product of experts, 568

Product rule of probability, see chain rule

of probability

PSD, see predictive sparse decomposition

Pseudolikelihood, 615

Quadrature pair, 369

Quasi-Newton condition, 315

Quasi-Newton methods, 315

Radial basis function, 195

Random search, 433

Random variable, 55

Ratio matching, 618

RBF, 195

RBM, see restricted Boltzmann machine

Recall, 422

Receptive ﬁeld, 336

Recommender Systems, 475

Rectiﬁed linear unit, 171, 192, 424, 505

Recurrent network, 26

Recurrent neural network, 377

Regression, 99

Regularization, 119, 119, 177, 227, 429

Regularizer, 118

REINFORCE, 691

Reinforcement learning, 24, 105, 478, 691

Relational database, 481

Reparametrization trick, 690

Representation learning, 3

Representational capacity, 113

792

INDEX

Restricted Boltzmann machine, 355, 457,

477, 585, 631, 656, 657, 674, 677,

680, 683, 685

Ridge regression, see weight decay

Risk, 274

RNN-RBM, 688

Saddle points, 284

Sample mean, 124

Scalar, vi, vii, 30

Score matching, 509, 617

Secant condition, 315

Second derivative, 85

Second derivative test, 88

Self-information, 72

Semantic hashing, 523

Semi-supervised learning, 242

Separable convolution, 361

Separation (probabilistic modeling), 570

Set, vii

SGD, see stochastic gradient descent

Shannon entropy, viii, 72

Shortlist, 464

Sigmoid, ix, see logistic sigmoid

Sigmoid belief network, 26

Simple cell, 364

Singular value, see singular value decompo-

sition

Singular value decomposition, 43, 147, 477

Singular vector, see singular value decom-

position

Slow feature analysis, 491

SML, see stochastic maximum likelihood

Softmax, 183, 417, 448

Softplus, ix, 67, 196

Spam detection, 3

Sparse coding, 321, 355, 494, 631

Sparse initialization, 303, 405

Sparse representation, 146, 226, 252, 503,

554

Spearmint, 435

Spectral radius, 403

Speech recognition, see automatic speech

recognition

Sphering, see whitening

Spike and slab restricted Boltzmann ma-

chine, 683

SPN, see sum-product network

Square matrix, 37

ssRBM, see spike and slab restricted Boltz-

mann machine

Standard deviation, 60

Standard error, 126

Standard error of the mean, 126, 277

Statistic, 121

Statistical learning theory, 109

Steepest descent, see gradient descent

Stochastic back-propagation, see reparametriza-

tion trick

Stochastic gradient descent, 15, 150, 278,

293, 674

Stochastic maximum likelihood, 612, 669,

674

Stochastic pooling, 264

Structure learning, 581

Structured output, 100, 687

Structured probabilistic model, 76, 556

Sum rule of probability, 57

Sum-product network, 551

Supervised ﬁne-tuning, 527, 662

Supervised learning, 104

Support vector machine, 140

Surrogate loss function, 275

SVD, see singular value decomposition

Symmetric matrix, 40, 42

Tangent distance, 268

Tangent plane, 514

Tangent prop, 268

TDNN, see time-delay neural network

Teacher forcing, 381, 382

Tempering, 602

Template matching, 140

Tensor, vi, vii, 32

Test set, 109

Tikhonov regularization, see weight decay

Tiled convolution, 351

Time-delay neural network, 366, 373

Toeplitz matrix, 332

Topographic ICA, 491

793

INDEX

Trace operator, 45

Training error, 109

Transcription, 100

Transfer learning, 534

Transpose, vii, 32

Triangle inequality, 38

Triangulated graph, see chordal graph

Trigram, 460

Unbiased, 123

Undirected graphical model, 76, 505

Undirected model, 564

Uniform distribution, 56

Unigram, 460

Unit norm, 40

Unit vector, 40

Universal approximation theorem, 197

Universal approximator, 551

Unnormalized probability distribution, 565

Unsupervised learning, 104, 145

Unsupervised pretraining, 457, 526

V-structure, see explaining away

V1, 363

VAE, see variational autoencoder

Vapnik-Chervonenkis dimension, 113

Variance, viii, 60, 228

Variational autoencoder, 691, 698

Variational derivatives, see functional deriva-

tives

Variational free energy, see evidence lower

bound

VC dimension, see Vapnik-Chervonenkis di-

mension

Vector, vi, vii, 31

Virtual adversarial examples, 267

Visible layer, 6

Volumetric data, 359

Wake-sleep, 651, 662

Weight decay, 117, 177, 230, 430

Weight space symmetry, 283

Weights, 15, 106

Whitening, 454

Wikibase, 481

Word embedding, 462

Word-Sense Disambiguation, 482

WordNet, 481

Zero-data learning, see zero-shot learning

Zero-shot learning, 536

794