Bibliography

Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data generating

distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal. 224

Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data generating

distribution. In ICLR’2013. also arXiv report 1211.4246. 224, 226

Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian gradient.

In Advances in Neural Information Processing Systems, pages 127–133. MIT Press. 105

Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition with

continuous-parameter hidden Markov models. Computer, Speech and Language, 2, 219–234.

Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural Information

Processing Systems 26 , pages 2814–2822. 128

Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal function.

IEEE Trans. on Information Theory, 39, 930–945. 112

Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University Press.

159

Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and Applications.

Wiley. 159

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard,

N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning

and Unsupervised Feature Learning NIPS 2012 Workshop. 55

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data

representation. Neural Computation, 15(6), 1373–1396. 91, 211

Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distributions

using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining

and Knowledge Discovery, 11(3), 550–557. 202

Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 19, 87, 114

Bengio, Y. (2011). Deep learning of representations for unsupervised and transfer learning. In

JMLR W&CP: Proc. Unsupervised and Transfer Learning. 18

Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-layer

neural networks. In NIPS’99, pages 400–406. MIT Press. 202, 204, 205

283

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural

Computation, 21(6), 1601–1621. 240

Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In Large Scale

Kernel Machines. 87

Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In L. Bottou,

O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press.

115

Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04, pages

129–136. MIT Press. 90, 212, 213

Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language model. In

NIPS’00 , pages 932–938. MIT Press. 15

Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language model. In

T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in NIPS 13, pages 932–938.

MIT Press. 215, 216

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language

model. Journal of Machine Learning Research, 3, 1137–1155. 215, 216

Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions for

local kernel machines. In NIPS’2005. 87

Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows. In

NIPS’2005 . MIT Press. 90, 212

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of

deep networks. In NIPS’2006. 15, 121, 164

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In

ICML’09 . 106

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013). Generalized denoising auto-encoders as

generative models. In Advances in Neural Information Processing Systems 26 (NIPS’13). 227

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative stochastic

networks trainable by backprop. In Proceedings of the 30th International Conference on

Machine Learning (ICML’14). 227, 228

Bennett, C. (1976). Eﬃcient estimation of free energy diﬀerences from Monte Carlo data. Journal

of Computational Physics, 22(2), 245–268. 236

Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive diver-

gence and persistent contrastive divergence. CoRR, abs/1312.6002. 243

Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Classiﬁcation.

Ph.D. thesis, Universit´e de Montr´eal. 155

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J.,

Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.

In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy). Oral Presentation.

284

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195. 245

Bishop, C. M. (1994). Mixture density networks. 102

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the

vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 75, 76

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and meaning

representations for open-text semantic parsing. AISTATS’2012 . 200

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin

classiﬁers. In COLT ’92: Proceedings of the ﬁfth annual workshop on Computational learning

theory, pages 144–152, New York, NY, USA. ACM. 15, 87, 98

Bottou, L. (2011). From machine learning to machine reasoning. Technical report,

arXiv.1102.1808. 199, 200

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular

value decomposition. Biological Cybernetics, 59, 291–294. 152

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New

York, NY, USA. 61

Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 91, 211

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 123

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and Regres-

sion Trees. Wadsworth International Group, Belmont, CA. 87

Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In

R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005 , pages 33–40. Society for Artiﬁcial

Intelligence and Statistics. 240

Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923, UCSD.

14, 91, 207

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural network

for traﬃc sign classiﬁcation. Neural Networks, 32, 333–338. 16, 114

Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in unsuper-

vised feature learning. In Proceedings of the Thirteenth International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS 2011). 278

Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris VI, LIP6.

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing, 36,

287–314. 162

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.

15, 87

285

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmentation using

depth information. In International Conference on Learning Representations (ICLR2013). 16,

114

Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by spike-and-

slab RBMs. In ICML’11. 134

Cover, T. (2006). Elements of Information Theory. Wiley-Interscience. 41

Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304, 111–114.

239

Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the

mean-covariance restricted Boltzmann machine. In NIPS’2010 . 15

Dauphin, Y. and Bengio, Y. (2013a). Big neural networks waste capacity. In ICLR’2013 work-

shops track (oral presentation), arXiv: 1301.3583 . 18

Dauphin, Y. and Bengio, Y. (2013b). Stochastic ratio matching of RBMs for sparse high-

dimensional inputs. In NIPS26. NIPS Foundation. 248

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying

and attacking the saddle point problem in high-dimensional non-convex optimization. In

NIPS’2014 . 59

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T. (2014).

The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics

(Proc. SIGGRAPH), 33(4), 79:1–79:10. 276

Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS. 114

Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010). Binary coding

of speech spectrograms using a deep auto-encoder. In Interspeech 2010 , Makuhari, Chiba,

Japan. 15

Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision.

Technical Report 1327, D´epartement d’Informatique et de Recherche Op´erationnelle, Univer-

sit´e de Montr´eal. 268

Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function. In

NIPS’2011 . 237

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and

robust neural network joint models for statistical machine translation. In Proc. ACL’2014.

Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding tech-

niques for high-dimensional data. Technical Report 2003-08, Dept. Statistics, Stanford Uni-

versity. 91, 211

Dugas, C., Bengio, Y., B´elisle, F., and Nadeau, C. (2001). Incorporating second-order functional

knowledge for better option pricing. In NIPS’00, pages 472–478. MIT Press. 99

286

Ebrahimi, S., Pal, C., Bouthillier, X., Froumenty, P., Jean, S., Konda, K. R., Vincent, P.,

Courville, A., and Bengio, Y. (2013). Combining modality speciﬁc deep neural network mod-

els for emotion recognition in video. In Emotion Recognition In The Wild Challenge and

Workshop (Emotiw2013). 7, 114

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why

does unsupervised pre-training help deep learning? JMLR, 11, 625–660. 18

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013). Learning hierarchical features for

scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence. in press.

16, 114

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of

Eugenics, 7, 179–188. 70

Frasconi, P., Gori, M., and Sperduti, A. (1997). On the eﬃcient classiﬁcation of data structures

by neural networks. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence. 199, 200

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive processing of

data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. 200

Frey, B. J. (1998). Graphical models for machine learning and digital communication. MIT

Press. 201

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism

of pattern recognition unaﬀected by shift in position. Biological Cybernetics, 36, 193–202. 15

Girosi, F. (1994). Regularization theory, radial basis functions and networks. In V. Cherkassky,

J. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks, volume 136 of

NATO ASI Series, pages 166–187. Springer Berlin Heidelberg. 112

Glorot, X. and Bengio, Y. (2010). Understanding the diﬃculty of training deep feedforward

neural networks. In AISTATS’2010. 15

Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks. In

AISTATS’2011 . 15, 99, 221, 222

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for large-scale sentiment

classiﬁcation: A deep learning approach. In ICML’2011. 221

Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face Recog-

nition. Imperial College Press. 211, 214

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In

NIPS’2009 , pages 646–654. 155, 221

Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L. (2010).

Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction

(HRI), Osaka, Japan. ACM Press, ACM Press. 68

Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with spike-

and-slab sparse coding. In ICML’2012 . 164

287

Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution for

autoencoders. Technical report, Universit´e de Montr´eal. 183

Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsu-

pervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.

10, 18, 114, 166

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout

networks. In ICML’2013 . 15

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013b). Maxout

networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–1327. 126, 278

Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013c). Multi-prediction deep

Boltzmann machines. In NIPS26. NIPS Foundation. 246, 265, 267

Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in

Computational Intelligence. Springer. 197

Graves, A. (2013). Generating sequences with recurrent neural networks. Technical report, arXiv

preprint arXiv:1308.0850. 103

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁcation with bidirectional

LSTM and other neural network architectures. Neural Networks, 18(5), 602–610. 197

Graves, A. and Schmidhuber, J. (2009). Oﬄine handwriting recognition with multidimensional

recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,

NIPS’2008 , pages 545–552. 197

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Unconstrained

on-line handwriting recognition with recurrent neural networks. In J. Platt, D. Koller,

Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 197

Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent

neural networks. In ICASSP’2013, pages 6645–6649. IEEE. 198

Gulcehre, C. and Bengio, Y. (2013). Knowledge matters: Importance of prior information for

optimization. In International Conference on Learning Representations (ICLR’2013). 19

Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation prin-

ciple for unnormalized statistical models. In Proceedings of The Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10). 248

H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of

the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California.

ACM Press. 114

H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Compu-

tational Complexity, 1, 113–129. 114

Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modiﬁables: D´ecodage de mes-

sages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie des Sci-

ences, 299(III-13), 525––528. 162

288

Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen,

P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in

speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 10, 16

Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence. Tech-

nical Report GCNU TR 2000-004, Gatsby Unit, University College London. 239

Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 211

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with Neural

Networks. Science, 313, 504–507. 121

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural

networks. Science, 313(5786), 504–507. 157, 164, 165

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

Helmholtz free energy. In NIPS’1993 . 152

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets.

Neural Computation, 18, 1527–1554. 15, 121, 164, 165, 262

Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World Chess

Champion. Princeton University Press, Princeton, NJ, USA. 2

Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov random

ﬁelds on lattice. Annals of the Institute of Statistical Mathematics, 54(1), 1–18. 245

Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96, pages

13–24. 189

Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys,

2, 94–128. 162

Hyv¨arinen, A. (2005). Estimation of non-normalized statistical models using score matching.

Journal of Machine Learning Research, 6, 695–709. 246

Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence, and pseu-

dolikelihood for continuous-valued variables. IEEE Transactions on Neural Networks, 18,

1529–1531. 247

Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and Data

Analysis, 51, 2499–2512. 247

Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. Wiley-

Interscience. 162

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture of local

experts. Neural Computation, 3, 79–87. 102

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage

architecture for object recognition? In ICCV’09 . 99

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. 34

Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 15

289

Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algorithm

based on neuromimetic architecture. Signal Processing, 24, 1–10. 162

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008). Fast inference in sparse coding algorithms

with applications to object recognition. CBLL-TR-2008-12-01, NYU. 155

Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary Mathe-

matics ; V. 1). American Mathematical Society. 138

Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score matching.

In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances

in Neural Information Processing Systems 23 , pages 1126–1134. 248

Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the

International Conference on Learning Representations (ICLR). 214, 215

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques.

MIT Press. 145, 147

Koren, Y. (2009). 1 the bellkor solution to the netﬂix grand prize. 125

Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.

Technical report, University of Toronto. 134

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classiﬁcation with deep convo-

lutional neural networks. In NIPS’2012. 7, 10, 15, 114

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classiﬁcation with deep

convolutional neural networks. In Advances in Neural Information Processing Systems 25

(NIPS’2012). 68

Lake, B., Salakhutdinov, R., and Tenenbaum, J. (2013). One-shot learning by inverting a

compositional causal process. In NIPS’2013 . 18

Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural network archi-

tecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University.

187

Larochelle, H. and Bengio, Y. (2008). Classiﬁcation using discriminative restricted Boltzmann

machines. In ICML’2008 . 155

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. In

AISTATS’2011 . 200, 204

Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI

Conference on Artiﬁcial Intelligence. 18

Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural gradient

algorithm. In NIPS’07. 105

LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de Paris

VI. 14, 152

290

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel,

L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Compu-

tation, 1(4), 541–551. 15

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998a). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 15

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998b). Gradient based learning applied to

document recognition. Proc. IEEE . 15

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In

NIPS’07 . 155

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief networks for

scalable unsupervised learning of hierarchical representations. In L. Bottou and M. Littman,

editors, ICML 2009 . ACM, Montreal, Canada. 268, 269

Linde, N. (1992). The machine that changed the world, episode 3. Documentary miniseries. 3

Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to approxi-

mately evaluate or simulate. In Proceedings of the 27th International Conference on Machine

Learning (ICML’10). 260

Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine invented by

Charles Babbage”. 2

Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with convolu-

tional spike-and-slab RBMs and deep extensions. In AISTATS’2013. 69

Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09. 247

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for restricted

Boltzmann machine learning. In Proceedings of The Thirteenth International Conference on

Artiﬁcial Intelligence and Statistics (AISTATS’10), volume 9, pages 509–516. 243, 247

Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous state

space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612. 245

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London.

100

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller,

X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011).

Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP:

Proc. Unsupervised and Transfer Learning, volume 7. 10, 18, 114, 166

Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surﬁng on the manifold.

Learning Workshop, Snowbird. 226

Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno

University of Technology. 103

Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cambridge

UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 233

291

Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 14

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 67

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings eﬃciently with noise-

contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Wein-

berger, editors, Advances in Neural Information Processing Systems 26 , pages 2265–2273.

Curran Associates, Inc. 250

Mor-Yosef, S., Samueloﬀ, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking the

risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Gynecol,

75(6), 944–7. 3

Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cambridge,

MA, USA. 101

Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In

ICML’2014 . 103, 204, 205

Nair, V. and Hinton, G. (2010a). Rectiﬁed linear units improve restricted Boltzmann machines.

In ICML’2010. 99

Nair, V. and Hinton, G. E. (2010b). Rectiﬁed linear units improve restricted Boltzmann ma-

chines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International

Conference on Machine Learning (ICML-10), pages 807–814. ACM. 15

Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis. In

NIPS’2010 . 14, 91, 207

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Springer. 127

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125–139.

235, 236

Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance sampling.

236, 237

Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in classifying

static speech patterns. Computer Speech and Language, 4, 275–289. 98

Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 65

Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld properties by

learning a sparse code for natural images. Nature, 381, 607–609. 154, 155

Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: a

strategy employed by V1? Vision Research, 37, 3311–3325. 221

Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning algorithms

for various stochastic models. Neural Networks, 13(7), 755 – 764. 105

Pascanu, R. and Bengio, Y. (2012). On the diﬃculty of training recurrent neural networks.

Technical Report arXiv:1211.5063, Universite de Montreal. 103

292

Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Technical

report, arXiv:1301.3584. 105

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the diﬃculty of training recurrent neural

networks. In ICML’2013 . 103

Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions of

deep feed forward networks with piece-wise linear activations. Technical report, U. Montreal,

arXiv:1312.6098. 114

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent

neural networks. In ICLR’2014. 127

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning.

In Proceedings of the 7th Conference of the Cognitive Science Society, University of California,

Irvine, pages 329–334. 136

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.

Morgan Kaufmann. 35

Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 20

Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recognition

hard? PLoS Comput Biol, 4. 269

Pollack, J. B. (1990). Recursive distributed representations. Artiﬁcial Intelligence, 46(1), 77–

105. 199

Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In

UAI’2011 , Barcelona, Spain. 114

Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 98

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP

Magazine, pages 257–285. 187

Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive distribution

estimator (NADE-k). Technical report, arXiv preprint arXiv:1406.1485. 204

Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Foundations

of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster University

Archive for the History of Economic Thought. 36

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Eﬃcient learning of sparse

representations with an energy-based model. In NIPS’2006 . 15, 121, 122, 164

Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning through

cross-modal transfer. In 27th Annual Conference on Neural Information Processing Systems

(NIPS 2013). 18

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011).

Higher order contractive auto-encoder. In European Conference on Machine Learning and

Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 155

293

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling

contractive auto-encoders. In ICML’2012 . 226

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and

organization in the brain. Psychological Review, 65, 386–408. 14

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 14

Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed-

ding. Science, 290(5500). 91, 211

Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-

propagating errors. Nature, 323, 533–536. 14

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning representations by

back-propagating errors. Nature, 323, 533–536. 95, 187

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986c). Parallel Dis-

tributed Processing: Explorations in the Microstructure of Cognition, volume 1. MIT Press,

Cambridge. 95

Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of the

International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages 448–455.

165, 263, 265

Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings of the

Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2009),

volume 8. 266

Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks.

In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, volume 25, pages

872–879. ACM. 236

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history

compression. Neural Computation, 4(2), 234–242. 15

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a kernel

eigenvalue problem. Neural Computation, 10, 1299–1319. 91, 211

Sch¨olkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods — Support

Vector Learning. MIT Press, Cambridge, MA. 15, 98, 114

Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions

on Signal Processing, 45(11), 2673–2681. 197

Sch¨olkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 87

Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-dependent

deep neural networks. In Interspeech 2011 , pages 437–440. 15

Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection with

unsupervised multi-stage feature learning. In Proc. International Conference on Computer

Vision and Pattern Recognition (CVPR’13). IEEE. 16, 114

294

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2014). Overfeat:

Integrated recognition, localization and detection using convolutional networks. International

Conference on Learning Representations. 68

Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications. 20

Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–548.

189

Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied Mathe-

matics Letters, 4(6), 77–80. 189

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony

theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing,

volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 141, 148

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic

pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011. 200

Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural language

with recursive neural networks. In Proceedings of the Twenty-Eighth International Conference

on Machine Learning (ICML’2011). 200

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c). Semi-

supervised recursive autoencoders for predicting sentiment distributions. In EMNLP’2011.

200

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.

(2013). Recursive deep models for semantic compositionality over a sentiment treebank. In

EMNLP’2013 . 200

Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural networks.

Complex Systems, 2, 625–639. 101

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:

A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning

Research, 15, 1929–1958. 125, 127, 265

Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive Divergence.

In Y. W. Teh and M. Titterington, editors, Proc. of the International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS), volume 9, pages 789–795. 242

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural

networks. Technical report, arXiv preprint arXiv:1409.3215. 10

Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On autoencoders

and score matching for energy based models. In ICML’2011. ACM. 248

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,

and Rabinovich, A. (2014). Going deeper with convolutions. Technical report, arXiv preprint

arXiv:1409.4842. 10

Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework for

nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 91, 211

295

Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the

likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,

pages 1064–1071. ACM. 243

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis. Journal

of the Royal Statistical Society B, 61(3), 611–622. 159, 160

Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autoregressive

density-estimator. In NIPS’2013 . 203, 204

Utgoﬀ, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14,

2497–2539. 15

van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using t-SNE. J. Machine Learning

Res., 9. 211, 215

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag,

Berlin. 75, 76

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York. 75, 76

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies

of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. 75,

Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural

Computation, 23(7), 1661–1674. 227, 248

Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press. 212

Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In

Advances in Neural Information Processing Systems 26 , pages 351–359. 127

Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme recogni-

tion using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal

Processing, 37, 328–339. 187

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural

networks using dropconnect. In ICML’2013 . 128

Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 128

Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical analysis

of dropout in piecewise linear networks. In ICLR’2014. 128

Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by semidef-

inite programming. In CVPR’2004, pages 988–995. 91, 211

Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding.

In ICML 2008 . 19

Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to rank

with joint word-image embeddings. Machine Learning, 81(1), 21–35. 200

296

White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can

learn arbitrary mappings. Neural Networks, 3(5), 535–549. 112

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON

Convention Record, volume 4, pages 96–104. IRE, New York. 14

Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In NIPS’95,

pages 514–520. MIT Press, Cambridge, MA. 114

Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural

Computation, 8(7), 1341–1390. 113

Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated splicing

using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562. 127

Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly decreas-

ing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 243

Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In

ECCV’14 . 6, 68

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal

of the Royal Statistical Society, Series B, 67(2), 301–320. 105

297

Index

norm, 26

Active constraint, 65

AIS, see annealed importance sampling

Almost everywhere, 51

Ancestral sampling, 145

Annealed importance sampling, 233, 264

Artiﬁcial intelligence, 2

Asymptotically unbiased, 79

Bagging, 123

Bayes’ rule, 50

Bayesian network, see directed graphical model

Bayesian probability, 36

Belief network, see directed graphical model

Bernoulli distribution, 43

Boltzmann distribution, 140

Boltzmann machine, 140

Calculus of variations, 257

CD, see contrastive divergence

Centering trick (DBM), 266

Central limit theorem, 44

Chain rule of probability, 39

Chess, 2

Classiﬁcation, 68

Clique potential, see factor (graphical model)

CNN, see convolutional neural network

Collider, see explaining away

Computer vision, 276

Conditional computation, see dynamically struc-

tured nets, 271

Conditional independence, 39

Conditional probability, 38

Constrained optimization, 64

Context-speciﬁc independence, 143

Contrast, 277

Contrastive divergence, 239, 264, 265

Convolution, 169, 268

Convolutional neural network, 169

Correlation, 40

Cost function, see objective function

Covariance, 40

Covariance matrix, 40

curse of dimensionality, 91

Cyc, 2

D-separation, 143

Dataset augmentation, 277, 282

DBM, see deep Boltzmann machine

Deep belief network, 251, 261, 262, 269

Deep Blue, 2

Deep Boltzmann machine, 251, 261, 263, 265,

269

Deep learning, 2, 5

Denoising score matching, 248

Density estimation, 68

Derivative, 56

Detector layer, 174

Dirac delta function, 46

Directed graphical model, 136

Directional derivative, 58

domain adaptation, 165

Dot product, 22

Doubly block circulant matrix, 171

Dream sleep, 239, 259

DropConnect, 128

Dropout, 125, 265

Dynamically structured networks, 271

E-step, 254

Early stopping, 105, 120, 129

EBM, see energy-based model

Eigendecomposition, 28

Eigenvalue, 28

Eigenvector, 28

ELBO, see evidence lower bound

Element-wise product, see Hadamard product

298

EM, see expectation maximization

Empirical distribution, 46

Energy function, 140

Energy-based model, 140, 263

Ensemble methods, 123

Equality constraint, 64

Equivariance, 172

Error function, see objective function

Euclidean norm, 26

Euler-Lagrange equation, 257

Evidence lower bound, 251, 253–255, 263

Expectation, 40

Expectation maximization, 253

Expected value, see expectation

Explaining away, 144

Factor (graphical model), 138

Factor graph, 143

Factors of variation, 5

Frequentist probability, 36

Functional derivatives, 257

Gaussian distribution, see Normal distribu-

tion44

Gaussian mixture, 47

GCN, see Global contrast normalization

Gibbs distribution, 139

Gibbs sampling, 147

Global contrast normalization, 278

Global minimum, 13

Gradient, 58

Gradient descent, 58

Graphical model, see structured probabilistic

model

Hadamard product, 22

Harmonium, see Restricted Boltzmann ma-

chine148

Harmony theory, 141

Helmholtz free energy, see evidence lower bound

Hessian matrix, 60

Identity matrix, 24

Independence, 39

Inequality constraint, 64

Inference, 133, 251, 253–256, 258, 259

Invariance, 177

Jacobian matrix, 60

Joint probability, 37

Karush-Kuhn-Tucker conditions, 65

Kernel (convolution), 170

KKT conditions, see Karush-Kuhn-Tucker con-

ditions

KL divergence, see Kllback-Leibler divergence41

Kullback-Leibler divergence, 41

Lagrange function, see Lagrangian

Lagrange multipliers, 64, 258

Lagrangian, 64

Learner, 3

Line search, 58

Linear combination, 25

Linear dependence, 26

Local conditional probability distribution, 136

Local minimum, 13

Logistic regression, 3

Logistic sigmoid, 47

Loss function, see objective function

M-step, 254

Machine learning, 3

Manifold hypothesis, 207

manifold hypothesis, 91

Manifold learning, 90, 207

MAP inference, 255

Marginal probability, 38

Markov chain, 145

Markov network, see undirected model138

Markov random ﬁeld, see undirected model138

Matrix, 21

Matrix inverse, 24

Matrix product, 22

Max pooling, 177

Mean ﬁeld, 264, 265

Measure theory, 50

Measure zero, 50

Method of steepest descent, see gradient de-

scent

Missing inputs, 68

Mixing (Markov chain), 150

Mixture distribution, 47

MNIST, 265

Model averaging, 123

MP-DBM, see multi-prediction DBM

Multi-prediction DBM, 264, 266

Multinomial distribution, 43

Multinoulli distribution, 43

Naive Bayes, 51

299

Nat, 41

Negative deﬁnite, 60

Negative phase, 238

Netﬂix Grand Prize, 125

Noise-contrastive estimation, 248

Norm, 26

Normal distribution, 44, 46

Normalized probability distribution, 139

Object detection, 276

Object recognition, 276

Objective function, 13, 55

Orthogonality, 28

Overﬁtting, 75

Parameter sharing, 172

Partial derivative, 58

Partition function, 94, 139, 231, 264

PCA, see principal components analysis

PCD, see stochastic maximum likelihood

Persistent contrastive divergence, see stochas-

tic maximum likelihood

Pooling, 169, 268

Positive deﬁnite, 60

Positive phase, 238

Precision (of a normal distribution), 44, 46

Predictive sparse decomposition, 155, 220

Preprocessing, 277

Principal components analysis, 31, 251, 279

Probabilistic max pooling, 268

Probability density function, 37

Probability distribution, 36

Probability mass function, 36

Product rule of probability, see chain rule of

probability

PSD, see predictive sparse decomposition

Pseudolikelihood, 244

Random variable, 36

Ratio matching, 247

RBM, see restricted Boltzmann machine

Receptive ﬁeld, 173

Regression, 68

Representation learning, 3

Restricted Boltzmann machine, 148, 164, 251,

260, 261, 265, 266, 268

Scalar, 20

Score matching, 246

Second derivative, 60

Second derivative test, 60

Self-information, 41

Separable convolution, 186

Separation (probabilistic modeling), 141

Shannon entropy, 41, 257

Sigmoid, see logistic sigmoid

SML, see stochastic maximum likelihood

Softmax, 101

Softplus, 47

Spam detection, 3

Sparse coding, 163, 251

Sphering, see Whitening, 279

Square matrix, 26

Standard deviation, 40

Statistic, 79

Steepest descent, see gradient descent

Stochastic gradient descent, 265

Stochastic maximum likelihood, 243, 264, 265

Stochastic pooling, 128

Structure learning, 147

Structured output, 68

Structured probabilistic model, 132

Sum rule of probability, 38

Surrogate loss function, 129

Symmetric matrix, 28

Tangent plane, 210

Tensor, 21

Test example, 13

Tiled convolution, 182

Toeplitz matrix, 171

Trace operator, 30

Training criterion, 13

Transcription, 68

Transfer learning, 165

Transpose, 21

Unbiased, 79

Underﬁtting, 75

Undirected model, 138

Uniform distribution, 37

Unit norm, 28

Unnormalized probability distribution, 138

V-structure, see explaining away

Variance, 40

Variational derivatives, see functional deriva-

tives

Variational free energy, see evidence lower bound

Vector, 20

300

Whitening, 279

ZCA, see zero-phase components analysis

Zero-phase components analysis, 279

301