Index

norm, 34

k-means, 271, 449

k-nearest neighbors, 449

, 273

Absolute value rectiﬁcation, 155

Active constraint, 87

ADALINE, see Adaptive Linear Element

Adaptive Linear Element, 13, 20, 23

Adversarial example, 222

Aﬃne, 99

AIS, see annealed importance sampling

Almost everywhere, 65

Ancestral sampling, 399

ANN, see Artiﬁcial neural network

Annealed importance sampling, 495, 531

Approximate inference, 392

Artiﬁcial intelligence, 1

Artiﬁcial neural network, see Neural net-

work

Asymptotically unbiased, 111

Audio, 268

Autoencoder, 4

Automatic diﬀerentiation, 176

Back-propagation, 167

Back-Propagation Through Time, 285

Bagging, 214

Bayes’ rule, 63, 64

Bayesian hyperparameter optimization, 335

Bayesian network, see directed graphical model

Bayesian probability, 48

Bayesian statistics,

boldindex122

Beam Search, 330

Beam search, 318

Belief network, see directed graphical model

Bernoulli distribution, 57

Bias, 111

Boltzmann distribution, 382

Boltzmann machine, 382

Boltzmann Machines, 511

BPTT, see Back-Propagation Through Time

Broadcasting, 29

CAE, see contractive auto-encoder

Calculus of variations, 507

Categorical distribution, see multinoulli dis-

tribution57

CD, see contrastive divergence

Centering trick (DBM), 535

Central limit theorem, 58

Chain rule of probability, 52

Chess, 2

Chord, 388

Chordal graph, 388

Classical dynamical system, 281

Classical regularization, 192

Classiﬁcation, 90

Cliﬀs, 229

Clipping the gradient, 314

Clique potential, see factor (graphical model)

CNN, see convolutional neural network

Collider, see explaining away

Color images, 268

Computer vision, 344

Concept drift, 440

Conditional computation, see dynamically

structured nets, 343

593

INDEX

Conditional Computation in Neural Nets,

365

Conditional independence, vi, 52

Conditional probability, 51

Connectionism, 15, 338

Connectionist temporal classiﬁcation, 318

consistency, 118

Constrained optimization, 85

Context-speciﬁc independence, 385

Continuation methods, 245

Contractive auto-encoder, 427, 476

Contractive autoencoders, 407

Contrast, 345

Contrastive divergence, 482, 531, 534

Convolution, 247, 538

Convolutional network, 14

Convolutional neural network, 247

Coordinate descent, 240, 534

Correlation, 53

Cost function, see objective function

Covariance, vi, 53

Covariance matrix, 54

Cross entropy, 119, 156

Cross-correlation, 249

Cross-validation, 109

CTC, see connectionist temporal classiﬁca-

tion

Curriculum-learning, 245

curse of dimensionality, 135

Cyc, 2

D-separation, 384

DAE, see denoising auto-encoder

Data generating distribution, 100

Data generating process, 100

Data parallelism, 341

Dataset, 94

Dataset augmentation, 345, 349

DBM, see deep Boltzmann machine

Decision trees, 449

Decoder, 4

Deep belief network, 23, 500, 512, 521, 539

Deep Blue, 2

Deep Boltzmann machine, 20, 23, 500, 512,

524, 534, 539

Deep learning, 1, 5

Denoising auto-encoder, 421

Denoising autoencoders, 180

Denoising score matching, 490

Density estimation, 93

Derivative, vi, 76

Design matrix,

boldindex96

Detector layer, 255

Diagonal matrix, 36

Dirac delta function, 60

Directed graphical model, 66, 376

Directional derivative, 80

Distributed Representation, 448

Distributed representation, 15

domain adaptation, 438

Dot product, 30

Doubly block circulant matrix, 249

Dream sleep, 481, 510

DropConnect, 220

Dropout, 180, 217, 333, 334, 534

Dynamic structure, 343

Dynamically structured networks, 343

E-step, 503

Early stopping, 166, 207, 209–212

EBM, see energy-based model

Echo state network, 20, 23, 305

Eﬀective number of parameters, 195

Eﬃciency, 121

Eigendecomposition, 37

Eigenvalue, 37

Eigenvector, 37

ELBO, see evidence lower bound

Element-wise product, see Hadamard prod-

uct, see Hadamard product

EM, see expectation maximization

Embedding, 464

Empirical distribution, 60

Empirical risk, 226

Empirical risk minimization, 226

Encoder, 4

Energy function, 382

Energy-based model, 382, 524

Ensemble methods, 215

Epoch, 227, 236

Equality constraint, 86

594

INDEX

Equivariance, 254

Error function, see objective function

ESN, see echo state network

Euclidean norm, 34

Euler-Lagrange equation, 508

Evidence lower bound, 502–505, 523

Example, 94

Expectation, 53

Expectation maximization, 503

Expected value, see expectation

Explaining away, 386

Factor (graphical model), 379

Factor analysis, 412

Factor graph, 390

Factors of variation, 4

Feature, 94

Finite diﬀerences, 337

Forward-Backward algorithm, 319

Fourier transform, 268, 270

Fovea, 274

Frequentist probability, 48

Frequentist statistics,

boldindex122

Functional derivatives, 507

Gabor function, 275

Gaussian distribution, see Normal distri-

bution58

Gaussian kernel, 130

Gaussian mixture, 61

GCN, see Global contrast normalization

Generalization, 99

Generalized Lagrange function, see Gener-

alized Lagrangian

Generalized Lagrangian, 86

Generative adversarial networks, 180

Gibbs distribution, 380

Gibbs sampling, 400

Global contrast normalization, 346

GPU, see Graphics processing unit

Gradient, 80

Gradient clipping, 314

Gradient descent, 80

Graph, v

Graph Transformer, 328

Graph transformer, 325

Graphical model, see structured probabilis-

tic model

Graphics processing unit, 339

Greedy layer-wise unsupervised pre-training,

431

Grid search, 335

Hadamard product, v, 30

Hard tanh, 155

Harmonium, see Restricted Boltzmann ma-

chine395

Harmony theory, 383

Helmholtz free energy, see evidence lower

bound

Hessian matrix, vi, 81

Hidden layer, 6

Hidden Markov model, 280

HMM, see hidden Markov model

Hyperbolic tangent, 155

Hyperparameters, 108, 333

Hypothesis space, 101, 106

i.i.d assumptions, 221

i.i.d., 110

i.i.d. assumptions, 100

Identity matrix, 31

Immorality, 388

Independence, vi, 52

Independent and identically distributed, 110

Independent component analysis, 413

Inequality constraint, 86

Inference, 375, 392, 500, 502–506, 509

Integral, vi

Invariance, 258

Isomap, 435

Jacobian matrix, vi, 65, 80

Joint probability, 49

Karush-Kuhn-Tucker conditions, 87

Karush–Kuhn–Tucker, 86

Kernel (convolution), 248, 249

Kernel machine, 449

Kernel trick, 129

KKT, see Karush–Kuhn–Tucker

595

INDEX

KKT conditions, see Karush-Kuhn-Tucker

conditions

KL divergence, see Kllback-Leibler diver-

gence55

Knowledge base, 2

Kullback-Leibler divergence, vi, 55

Lagrange multipliers, 86, 87, 508

Lagrangian, see Gneralized Lagrangian86

Latent variable, 408

LCN, see local contrast normalization

Leaky units, 308

Line search, 80

Linear combination, 33

Linear dependence, 33

Linear factor models, 411

Linear regression,

boldindex97, 99, 128

Liquid state machine, 305

Local conditional probability distribution,

376

Local contrast normalization, 347

Logistic regression, 2, 129

Logistic sigmoid, 7, 62

Long short-term memory, 309

Loop, 388

Loss function, see objective function

LSTM, 21, see lng short-term memory309

M-step, 503

Machine learning, 2

Main diagonal, 29

Manifold, 143

Manifold hypothesis, 460

Manifold hypothesis, 145

Manifold learning, 144, 460

Manifold Tangent Classiﬁer, 475

MAP inference, 505

Marginal probability, 51

Markov chain, 320, 399

Markov network, see undirected model378

Markov property, 320

Markov random ﬁeld, see undirected model378

Matrix, iv, v, 28

Matrix inverse, 32

Matrix product, 30

Max pooling, 255

Maximum likelihood, 118

Maxout, 155

Mean ﬁeld, 531, 534

Mean squared error, 98

Measure theory, 64

Measure zero, 64

Method of steepest descent, see gradient de-

scent

Missing inputs, 90

Mixing (Markov chain), 401

Mixture distribution, 61

Mixture of experts, 449

MLP, see multilayer perception

MNIST, 534

Model averaging, 215

Model capacity, 333

Model compression, 342

Model parallelism, 341

Moore-Penrose Pseudoinverse, 40

Moore-Penrose pseudoinverse, 201

Moralized graph, 388

MP-DBM, see multi-prediction DBM

MRF (Markov Random Field), see undi-

rected model378

MSE, see mean squared error98

Multi-modal learning, 444

Multi-prediction DBM, 533, 535

Multi-task learning, 221, 440

Multilayer perception, 5

Multilayer perceptron, 23

Multinomial distribution, 57

Multinoulli distribution, 57

Naive Bayes, 2, 68

Nat, 55

natural image, 372

Negative deﬁnite, 82

Negative phase, 479, 481

Neocognitron, 14, 20, 23

Nesterov momentum, 237

Netﬂix Grand Prize, 217

Neural network, 12

Neuroscience, 13

Noise-contrastive estimation, 491

Non-parametric, 103

596

INDEX

Norm, vi, 34

Normal distribution, 58, 60

Normal equations, 195

Numerical diﬀerentiation, 176, see ﬁnite dif-

ferences

Object detection, 344

Object recognition, 344

Objective function, 76

Oﬀset, 153

One-shot learning, 442

Orthodox statistics, see frequentist statis-

tics

Orthogonal matrix, 37

Orthogonality, 36

Overﬁtting, 333

Parallel distributed processing, 15

Parameter sharing, 251

Parameter tying , Parameter sharing214

Parametric, 103

Partial derivative, 80

Partition function, 381, 477, 531

PCA, see principal components analysis

PCD, see stochastic maximum likelihood

Perceptron, 13, 23

Perplexity, 121

Persistent contrastive divergence, see stochas-

tic maximum likelihood

Point Estimator, 110

Pooling, 247, 538

Positive deﬁnite, 81

Positive phase, 479, 481

Pre-training, 431

Precision (of a normal distribution), 58, 60

Predictive sparse decomposition, 271, 406,

418, 420

Preprocessing, 344

Primary visual cortex, 272

Principal components analysis, 42, 349, 412,

500

Principle components analysis, 132–134, 146

Prior probability distribution,

boldindex122

Probabilistic max pooling, 538

Probability density function, 50

Probability distribution, 49

Probability function estimation, 93

Probability mass function, 49

Product rule of probability, see chain rule

of probability

PSD, see predictive sparse decomposition

Pseudolikelihood, 486

Quadrature pair, 276

Radial basis function, 155

Random search, 335

Random variable, 48

Ratio matching, 490

RBF, 155

RBM, see restricted Boltzmann machine

Receptive ﬁeld, 252

Rectiﬁed linear unit, 155

Rectiﬁer, 155

Recurrent network, 23

Recurrent neural network, 283

Regression, 91

Regularization,

boldindex107, 107, 189, 333

Reinforcement learning, 180

ReLU, 155

Representation learning, 3

Restricted Boltzmann machine, 395, 500,

512, 514, 534, 535, 537, 538

Ridge regression, 193

Risk, 226

Sample mean, 112

Scalar, iv, v, 27

Score matching, 489

Second derivative, 81

Second derivative test, 81

Self-information, 55

Semi-supervised learning, 131, 445

Separable convolution, 270

Separation (probabilistic modeling), 384

Set, v

SGD, see stochastic gradient descent, see

stochastic gradient descent

Shannon entropy, vi, 55, 508

Sigmoid, vi, see logistic sigmoid, 155

Sigmoid belief network, 23

597

INDEX

Simple cell, 273

Simulated annealing, 245

Singular value, see singular value decompo-

sition

Singular value decomposition, 39, 133

Singular vector, see singular value decom-

position

SML, see stochastic maximum likelihood

Softmax, 155, 158

Softplus, vi, 62, 155

Spam detection, 2

Sparse coding, 406, 415, 500

Sparse representations, 417

Spearmint, 335

spectral radius, 306

Sphering, see Whitening, 347

Spike and slab restricted Boltzmann ma-

chine, 537

Square matrix, 33

ssRBM, see spike and slab restricted Boltz-

mann machine

Standard deviation, 53

Statistic, 110

Statistical learning theory, 100

Steepest descent, see gradient descent

Stochastic gradient descent, 13, 227, 236,

534

Stochastic maximum likelihood, 483, 531,

534

Stochastic pooling, 221

Structure learning, 392

Structured output, 91

Structured probabilistic model, 66, 371

Student-t, 407

Sum rule of probability, 51

Sum-product network, 455

Supervised learning,

boldindex95

Support vector machine, 129

Surrogate loss function, 226

SVD, see singular value decomposition

Symbolic diﬀerentiation, 177

Symmetric matrix, 36, 39

t-SNE, 435

Tangent Distance, 473

Tangent plane, 464

Tangent-Prop, 474

Tanh, 155

Teacher forcing, 284

Tensor, iv, v, 29

Test set, 100

Tiled convolution, 265

Toeplitz matrix, 249

Trace operator, 41

Training error, 99

Transcription, 91

Transfer learning, 438

Transpose, v, 29

Triangle inequality, 34

Triangulated graph, see chordal graph

Unbiased, 111

Undirected graphical model, 66

Undirected model, 378

Uniform distribution, 50

Unit norm, 36

Unit vector, 36

Universal approximation theorem, 180

Universal approximator, 454

Unnormalized probability distribution, 379

Unsupervised learning,

boldindex95, 131

Unsupervised pre-training, 431

V-structure, see explaining away

V1, 272

Variance, vi, 53

Variational autoencoder, 180

Variational derivatives, see functional deriva-

tives

Variational free energy, see evidence lower

bound

Vector, iv, v, 28

Visible layer, 6

Viterbi algorithm, 319

Viterbi decoding, 322

Volumetric data, 268

Weight decay, 106, 193, 334

Weights, 13, 97

Whitening, 347, 349

598

INDEX

ZCA, see zero-phase components analysis

Zero-data learning, 442

Zero-phase components analysis, 349

Zero-shot learning, 442

599

INDEX

dlbook.indIndex

600