Bibliography

Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data generating

distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal. 302

Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data generating

distribution. In ICLR’2013. also arXiv report 1211.4246. 286, 302, 304

Alain, G., Bengio, Y., Yao, L.,

Eric Thibodeau-Laufer, Yosinski, J., and Vincent, P. (2015).

GSNs: Generative stochastic networks. arXiv:1503.05571. 288

Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian gradient.

In Advances in Neural Information Processing Systems, pages 127–133. MIT Press. 116

Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris Society,

59, 2–5. 14

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning

to align and translate. Technical report, arXiv:1409.0473. 251

Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition with

continuous-parameter hidden Markov models. Computer, Speech and Language, 2, 219–234.

48, 229

Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT Press.

231

Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural Information

Processing Systems 26 , pages 2814–2822. 153

Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the past and

the future in protein secondary structure prediction. Bioinformatics, 15(11), 937–946. 202

Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal function.

IEEE Trans. on Information Theory, 39, 930–945. 126

Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University Press.

290

Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and Applications.

Wiley. 290

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard,

N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning

and Unsupervised Feature Learning NIPS 2012 Workshop. 57

413

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of ﬁnite state

Markov chains. Ann. Math. Stat., 37, 1559–1563. 227

Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th International

Conference on Computational Learning Theory (COLT’95), pages 311–320, Santa Cruz, Cal-

ifornia. ACM Press. 154

Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces in

random-dot stereograms. Nature, 355, 161–163. 335

Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for embedding

and clustering. In NIPS’01, Cambridge, MA. MIT Press. 322

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data

representation. Neural Computation, 15(6), 1373–1396. 98, 340

Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distributions

using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining

and Knowledge Discovery, 11(3), 550–557. 207

Bengio, Y. (1991). Artiﬁcial Neural Networks and their Application to Sequence Recognition.

Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 212, 231

Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal on

Pattern Recognition and Artiﬁcial Intelligence, 7(4), 647–668. 229

Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys, 2,

129–162. 229

Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys, 2,

129–162. 231

Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 95, 128

Bengio, Y. (2013). Estimating or propagating gradients through stochastic neurons. Technical

Report arXiv:1305.2982, Universite de Montreal. 275

Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-layer

neural networks. In NIPS’99, pages 400–406. MIT Press. 207, 209, 210

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural

Computation, 21(6), 1601–1621. 302, 363, 388

Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE Trans-

actions on Neural Networks, 7(5), 1231–1249. 231

Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In Large Scale

Kernel Machines. 95

Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In L. Bottou,

O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press.

129

Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04 , pages

129–136. MIT Press. 97, 341

414

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated acoustic

parameters for continuous speech recognition using artiﬁcial neural networks. In Proceedings

of EuroSpeech’91. 17

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Global optimization of a neural

network-hidden Markov model hybrid. IEEE Transactions on Neural Networks, 3(2), 252–259.

229, 231

Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term dependencies

in recurrent networks. In IEEE International Conference on Neural Networks, pages 1183–

1195, San Francisco. IEEE Press. (invited paper). 163, 218

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient

descent is diﬃcult. IEEE Tr. Neural Nets. 163, 164, 210, 216, 218, 219

Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for on-line

handwriting recognition. Neural Computation, 7(6), 1289–1303. 231

Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language model. In

NIPS’00 , pages 932–938. MIT Press. 16

Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language model. In

NIPS’2000 , pages 932–938. 248, 249

Bengio, Y., Ducharme, R., and Vincent, P. (2001c). A neural probabilistic language model. In

T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages 932–938. MIT Press.

343, 344

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003a). A neural probabilistic language

model. JMLR, 3, 1137–1155. 248

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003b). A neural probabilistic language

model. Journal of Machine Learning Research, 3, 1137–1155. 343, 344

Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions for

local kernel machines. In NIPS’2005. 94

Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows. In

NIPS’2005 . MIT Press. 97, 340

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of

deep networks. In NIPS’2006. 16, 308, 311

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In

ICML’09 . 117

Bengio, Y., L´eonard, N., and Courville, A. (2013a). Estimating or propagating gradients through

stochastic neurons for conditional computation. arXiv:1308.3432. 275

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-encoders as

generative models. In NIPS’2013 . 304, 405, 408

Bengio, Y., Courville, A., and Vincent, P. (2013c). Representation learning: A review and

new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35(8),

1798–1828. 333, 403

415

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014a). Deep generative stochastic

networks trainable by backprop. Technical Report arXiv:1306.1091. 275

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic

networks trainable by backprop. In Proceedings of the 30th International Conference on

Machine Learning (ICML’14). 275, 405, 407, 409, 410

Bennett, C. (1976). Eﬃcient estimation of free energy diﬀerences from Monte Carlo data. Journal

of Computational Physics, 22(2), 245–268. 357

Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive diver-

gence and persistent contrastive divergence. CoRR, abs/1312.6002. 364

Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Classiﬁcation.

Ph.D. thesis, Universit´e de Montr´eal. 285

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J.,

Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.

In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy). Oral Presentation.

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195. 366

Bishop, C. M. (1994). Mixture density networks. 113

Bishop, C. M. (1995). Regularization and complexity control in feed-forward networks. In

Proceedings International Conference on Artiﬁcial Neural Networks ICANN’95 , volume 1,

page 141–148. 149

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the

vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 78, 79

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and meaning

representations for open-text semantic parsing. AISTATS’2012 . 205

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin

classiﬁers. In COLT ’92: Proceedings of the ﬁfth annual workshop on Computational learning

theory, pages 144–152, New York, NY, USA. ACM. 13, 95, 109

Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications `a la

reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 231

Bottou, L. (2011). From machine learning to machine reasoning. Technical report,

arXiv.1102.1808. 204, 205

Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker independent

isolated digit recognition: multilayer perceptrons vs dynamic time warping. Neural Networks,

3, 453–465. 231

Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing systems

using graph transformer networks. In Proceedings of the Computer Vision and Pattern Recog-

nition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 223, 230, 231, 232, 233,

235

416

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular

value decomposition. Biological Cybernetics, 59, 291–294. 282

Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Approach,

volume 247 of The Kluwer international series in engineering and computer science. Kluwer

Academic Publishers, Boston. 231

Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multilayer

perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 1167–

1178. 231

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New

York, NY, USA. 65

Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate where

perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665–674. 158

Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 98, 340

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 142

Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by

the author). Statistical Science, 16(3), 199–231. 5

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and Regres-

sion Trees. Wadsworth International Group, Belmont, CA. 95

Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition. Ph.D.

thesis, Dept. of Computer Science, Carnegie-Mellon University. 229

Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992). Class-based

n-gram models of natural language. Computational Linguistics, 18, 467–479. 250

Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Proceedings of

the 12th ACM SIGKDD international conference on Knowledge discovery and data mining,

pages 535–541. ACM. 236

Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In

R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005 , pages 33–40. Society for Artiﬁcial

Intelligence and Statistics. 361, 388

Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models

Summer School, pages 372–379. 154

Cauchy, A. (1847). M´ethode g´en´erale pour la r´esolution de syst`emes d’´equations simultan´ees.

In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538. 58

Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923, UCSD.

98, 336

Chapelle, O., Weston, J., and Sch¨olkopf, B. (2003). Cluster kernels for semi-supervised learning.

In NIPS’02, pages 585–592, Cambridge, MA. MIT Press. 322

Chapelle, O., Sch¨olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT Press,

Cambridge, MA. 322

417

Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neural Net-

works for Document Processing. In Guy Lorette, editor, Tenth International Workshop on

Frontiers in Handwriting Recognition, La Baule (France). Universit´e de Rennes 1, Suvisoft.

http://www.suvisoft.com. 15, 17

Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for language

modeling. Computer, Speech and Language, 13(4), 359–393. 222, 223

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and

Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical

machine translation. In Proceedings of the Empiricial Methods in Natural Language Processing

(EMNLP 2014). 216

Choromanska, A., Henaﬀ, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The loss

surface of multilayer networks. arXiv 1412.0233. 311

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural network

for traﬃc sign classiﬁcation. Neural Networks, 32, 333–338. 128

Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple

neural nets for handwritten digit recognition. Neural Computation, 22, 1–14. 15, 17

Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse coding

and vector quantization. In ICML’2011. 17

Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in unsuper-

vised feature learning. In Proceedings of the Thirteenth International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS 2011). 243

Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep learning

with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th In-

ternational Conference on Machine Learning (ICML-13), volume 28, pages 1337–1345. JMLR

Workshop and Conference Proceedings. 15, 17

Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris VI, LIP6.

109

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing, 36,

287–314. 291, 292

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.

13, 95

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmentation using

depth information. In International Conference on Learning Representations (ICLR2013). 128

Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by spike-and-

slab RBMs. In ICML’11. 258, 401

Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab RBM and

extensions to discrete and sparse data distributions. Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on, 36(9), 1874–1887. 401

418

Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition. Wiley-

Interscience. 42

Cox, R. T. (1946). Probability, frequency and reasonable expectation. American Journal of

Physics, 14, 1––10. 36

Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304, 111–114.

360

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of

Control, Signals, and Systems, 2, 303–314. 331

Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-

dimensional inputs. In NIPS26. NIPS Foundation. 369

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying

and attacking the saddle point problem in high-dimensional non-convex optimization. In

NIPS’2014 . 61, 311

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T. (2014).

The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics

(Proc. SIGGRAPH), 33(4), 79:1–79:10. 241

de Finetti, B. (1937). La pr´evision: ses lois logiques, ses sources subjectives. Annales de l’institut

Henri Poincar´e, 7, 1–68. 36

Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS . 127,

331, 332

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A Large-Scale

Hierarchical Image Database. In CVPR09. 14

Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010). What does classifying more than 10,000

image categories tell us? In Proceedings of the 11th European Conference on Computer Vision:

Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag. 14

Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and Adam,

H. (2014). Large-scale object classiﬁcation using label relation graphs. In ECCV’2014 , pages

48–64. 223

Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision.

Technical Report 1327, D´epartement d’Informatique et de Recherche Op´erationnelle, Univer-

sit´e de Montr´eal. 402

Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function. In

NIPS’2011 . 358

Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random ﬁelds. In International Con-

ference on Artiﬁcial Intelligence and Statistics, pages 177–184. 223

Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding tech-

niques for high-dimensional data. Technical Report 2003-08, Dept. Statistics, Stanford Uni-

versity. 98, 340

419

Doob, J. (1953). Stochastic processes. Wiley: New York. 36

Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning. IEEE

Transactions on Neural Networks, 1, 75–80. 164, 210

Dugas, C., Bengio, Y., B´elisle, F., and Nadeau, C. (2001). Incorporating second-order functional

knowledge for better option pricing. In NIPS’00, pages 472–478. MIT Press. 109

Ebrahimi, S., Pal, C., Bouthillier, X., Froumenty, P., Jean, S., Konda, K. R., Vincent, P.,

Courville, A., and Bengio, Y. (2013). Combining modality speciﬁc deep neural network mod-

els for emotion recognition in video. In Emotion Recognition In The Wild Challenge and

Workshop (Emotiw2013). 128

El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term depen-

dencies. In NIPS 8 . MIT Press. 217, 221, 222

ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term depen-

dencies. In NIPS’1995. 213

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010). Why

does unsupervised pre-training help deep learning? J. Machine Learning Res. 309, 311, 312,

313

Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., and Talay,

S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekkerman, M. Bilenko,

and J. Langford, editors, Scaling up Machine Learning: Parallel and Distributed Approaches.

Cambridge University Press. 298

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical features for

scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence. 128

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical features

for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8),

1915–1929. 223

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611. 319

Fischer, A. and Igel, C. (2011). Bounding the bias of contrastive divergence learning. Neural

Computation, 23(3), 664–73. 388

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of

Eugenics, 7, 179–188. 14, 74

Frasconi, P., Gori, M., and Sperduti, A. (1997). On the eﬃcient classiﬁcation of data structures

by neural networks. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence. 204, 205

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive processing of

data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. 205

Frey, B. J. (1998). Graphical models for machine learning and digital communication. MIT

Press. 206

420

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism

of pattern recognition unaﬀected by shift in position. Biological Cybernetics, 36, 193–202. 15,

16, 17

Garson, J. (1900). The metric system of identiﬁcation of criminals, as used in in great britain

and ireland. The Journal of the Anthropological Institute of Great Britain and Ireland, (2),

177–227. 14

Girosi, F. (1994). Regularization theory, radial basis functions and networks. In V. Cherkassky,

J. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks, volume 136 of

NATO ASI Series, pages 166–187. Springer Berlin Heidelberg. 126

Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks. In

AISTATS’2011 . 109, 297

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectiﬁer neural networks. In JMLR

W&CP: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and

Statistics (AISTATS 2011). 130, 297

Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale sentiment

classiﬁcation: A deep learning approach. In ICML’2011. 297, 316

Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face Recog-

nition. Imperial College Press. 339, 342

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In

NIPS’2009 , pages 646–654. 285, 297

Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L. (2010).

Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction

(HRI), Osaka, Japan. ACM Press, ACM Press. 71

Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with spike-

and-slab sparse coding. In ICML’2012 . 293

Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution for

autoencoders. Technical report, Universit´e de Montr´eal. 187

Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsu-

pervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.

128, 317

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout

networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–1327. 130, 152,

243

Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep

Boltzmann machines. In NIPS26. NIPS Foundation. 367, 398, 399

Goodfellow, I. J., Courville, A., and Bengio, Y. (2013c). Scaling up spike-and-slab models for

unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 35(8), 1902–1914. 401

421

Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE

Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(1), 76–86. 158

Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Originally published

under the pseudonym “Student”. 14

Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed representa-

tions without word alignments. Technical report, arXiv:1410.2455. 320

Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in

Computational Intelligence. Springer. 202, 215, 216, 223

Graves, A. (2013). Generating sequences with recurrent neural networks. Technical report,

arXiv:1308.0850. 114, 215, 217

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁcation with bidirectional

LSTM and other neural network architectures. Neural Networks, 18(5), 602–610. 202

Graves, A. and Schmidhuber, J. (2009). Oﬄine handwriting recognition with multidimensional

recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,

NIPS’2008 , pages 545–552. 202

Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-

ral classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks. In

ICML’2006 , pages 369–376, Pittsburgh, USA. 223

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Unconstrained

on-line handwriting recognition with recurrent neural networks. In J. Platt, D. Koller,

Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 202

Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent

neural networks. In ICASSP’2013, pages 6645–6649. 202, 215, 216

Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation prin-

ciple for unnormalized statistical models. In Proceedings of The Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10). 370

Haﬀner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural networks

for high performance continuous speech recognition. In International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pages 105–108, Toronto. 231

H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of

the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California.

ACM Press. 127, 332

H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Compu-

tational Complexity, 1, 113–129. 127, 332

Henaﬀ, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning of sparse

features for scalable audio classiﬁcation. In ISMIR’11 . 298

Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modiﬁables: D´ecodage de mes-

sages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie des Sci-

ences, 299(III-13), 525––528. 291

422

Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence. Tech-

nical Report GCNU TR 2000-004, Gatsby Unit, University College London. 361

Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 340

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural

networks. Science, 313(5786), 504–507. 287, 308, 309

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with Neural

Networks. Science, 313, 504–507. 311

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

Helmholtz free energy. In NIPS’1993 . 282

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets.

Neural Computation, 18, 1527–1554. 16, 17, 308, 309, 311, 389

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012).

Improving neural networks by preventing co-adaptation of feature detectors. Technical report,

arXiv:1207.0580. 139

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis,

T.U. M¨unich. 163, 210, 218

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,

9(8), 1735–1780. 215, 216

Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000). Gradient

ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies. In J. Kolen and

S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE Press. 216

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are uni-

versal approximators. Neural Networks, 2, 359–366. 331

Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World Chess

Champion. Princeton University Press, Princeton, NJ, USA. 4

Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov random

ﬁelds on lattice. Annals of the Institute of Statistical Mathematics, 54(1), 1–18. 366

Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96 , pages

13–24. 193

Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys,

2, 94–128. 291

Hyv¨arinen, A. (2005a). Estimation of non-normalized statistical models using score matching.

J. Machine Learning Res., 6. 301

Hyv¨arinen, A. (2005b). Estimation of non-normalized statistical models using score matching.

Journal of Machine Learning Research, 6, 695–709. 367

Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence, and pseu-

dolikelihood for continuous-valued variables. IEEE Transactions on Neural Networks, 18,

1529–1531. 368

423

Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and Data

Analysis, 51, 2499–2512. 368

Hyv¨arinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Existence

and uniqueness results. Neural Networks, 12(3), 429–439. 292

Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. Wiley-

Interscience. 291

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture of local

experts. Neural Computation, 3, 79–87. 113

Jaeger, H. (2003). Adaptive nonlinear system identiﬁcation with echo state networks. In Advances

in Neural Information Processing Systems 15 . 211

Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state net-

works. Technical report, Jacobs University. 217

Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 210

Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving

energy in wireless communication. Science, 304(5667), 78–80. 17, 210

Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Sch¨olkopf, B. (2012). On

causal and anticausal learning. In ICML’2012 , pages 1255–1262. 324

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best multi-stage

architecture for object recognition? In Proc. International Conference on Computer Vision

(ICCV’09), pages 2146–2153. IEEE. 15, 17, 129, 130

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best multi-stage

architecture for object recognition? In ICCV’09 . 109, 298

Jarzynski, C. (1997). Nonequilibrium equality for free energy diﬀerences. Phys. Rev. Lett., 78,

2690–2693. 357

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. 35

Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters from

sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice.

North-Holland, Amsterdam. 222

Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 13

Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classiﬁcation.

IEEE Transactions on Signal Processing, 40(12), 3043–3054. 229

Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algorithm

based on neuromimetic architecture. Signal Processing, 24, 1–10. 291

Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder. IEEE

Transactions on Pattern Analysis and Machine Intelligence. 304

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-

scale video classiﬁcation with convolutional neural networks. In CVPR. 14

424

Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model compo-

nent of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing,

ASSP-35(3), 400–401. 222

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding algo-

rithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 285

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding algo-

rithms with applications to object recognition. Technical report, Computational and Biological

Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01. 298

Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant features

through topographic ﬁlter maps. In CVPR’2009. 298

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010).

Learning convolutional feature hierarchies for visual recognition. In NIPS’2010. 298

Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary Mathe-

matics ; V. 1). American Mathematical Society. 261

Kingma, D. and LeCun, Y. (2010a). Regularized estimation of image statistics by score matching.

In NIPS’2010. 301

Kingma, D. and LeCun, Y. (2010b). Regularized estimation of image statistics by score matching.

In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances

in Neural Information Processing Systems 23 , pages 1126–1134. 369

Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning with

deep generative models. In NIPS’2014 . 275

Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable models in

auxiliary form. Technical report, arxiv:1306.0733. 275

Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings of the

International Conference on Learning Representations (ICLR). 275, 342, 343

Kingma, D. P. and Welling, M. (2014b). Eﬃcient gradient-based inference through transforma-

tions between bayes nets and neural nets. Technical report, arxiv:1402.0480. 275

Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed represen-

tations of words. In Proceedings of COLING 2012 . 320

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques.

MIT Press. 227, 273, 279

Koren, Y. (2009). 1 the bellkor solution to the netﬂix grand prize. 143

Koutnik, J., Greﬀ, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In

ICML’2014 . 217, 222

Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties of

DBNs with binary hidden units and real-valued visible units. In ICML’2013 . 331

Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.

Technical report, University of Toronto. 14, 258

425

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classiﬁcation with deep

convolutional neural networks. In Advances in Neural Information Processing Systems 25

(NIPS’2012). 15, 17, 71

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classiﬁcation with deep convo-

lutional neural networks. In NIPS’2012. 128, 297

Laﬀerty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random ﬁelds: Probabilistic

models for segmenting and labeling sequence data. In C. E. Brodley and A. P. Danyluk, editors,

ICML 2001 . Morgan Kaufmann. 223, 229

Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural network archi-

tecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University.

191, 212

Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear inde-

pendent component analysis using ensemble learning: Experiments and discussion. In Proc.

ICA. Citeseer. 292

Larochelle, H. and Bengio, Y. (2008a). Classiﬁcation using discriminative restricted Boltzmann

machines. In ICML’2008 . 285, 411

Larochelle, H. and Bengio, Y. (2008b). Classiﬁcation using discriminative restricted Boltzmann

machines. In ICML’08 , pages 536–543. ACM. 322

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. In

AISTATS’2011 . 205, 209

Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI

Conference on Artiﬁcial Intelligence. 319

Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative

and discriminative models. In Proceedings of the Computer Vision and Pattern Recognition

Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer Society. 322

Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng, A. (2012).

Building high-level features using large scale unsupervised learning. In ICML’2012 . 15, 17

Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approximators.

Neural Computation, 22(8), 2192–2207. 331

Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural gradient

algorithm. In NIPS’07. 116

LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de Paris

VI. 13, 282

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel,

L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Compu-

tation, 1(4), 541–551. 16

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998a). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 13, 14, 223, 230, 232

426

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998b). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 17

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998c). Gradient based learning applied to

document recognition. Proc. IEEE. 16

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In

NIPS’07 . 285

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief networks for

scalable unsupervised learning of hierarchical representations. In L. Bottou and M. Littman,

editors, ICML 2009 . ACM, Montreal, Canada. 402

Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; representation

and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc. 5

Leprieur, H. and Haﬀner, P. (1995). Discriminant learning with minimum memory loss for

improved non-vocabulary rejection. In EUROSPEECH’95 , Madrid, Spain. 229

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies is not

as diﬃcult with NARX recurrent neural networks. IEEE Transactions on Neural Networks,

7(6), 1329–1338. 213

Linde, N. (1992). The machine that changed the world, episode 3. Documentary miniseries. 5

Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to approxi-

mately evaluate or simulate. In Proceedings of the 27th International Conference on Machine

Learning (ICML’10). 384

Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine invented by

Charles Babbage”. 4

Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 224, 229, 233

Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent neural

network training. Computer Science Review, 3(3), 127–149. 210

Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with convolu-

tional spike-and-slab RBMs and deep extensions. In AISTATS’2013. 72

Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09. 368

Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without stable

states: A new framework for neural computation based on perturbations. Neural Computation,

14(11), 2531–2560. 210

MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge Uni-

versity Press. 42

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for restricted

Boltzmann machine learning. In Proceedings of The Thirteenth International Conference on

Artiﬁcial Intelligence and Statistics (AISTATS’10), volume 9, pages 509–516. 364, 368, 369,

385

427

Martens, J. and Medabalimi, V. (2014). On the expressive eﬃciency of sum product networks.

arXiv:1411.7717 . 332

Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free

optimization. In Proc. ICML’2011 . ACM. 219

Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous state

space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612. 366

Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition using

a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo CA. Morgan

Kaufmann. 231

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London.

110

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller,

X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011).

Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP:

Proc. Unsupervised and Transfer Learning, volume 7. 128, 317

Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surﬁng on the manifold.

Learning Workshop, Snowbird. 404

Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno

University of Technology. 114, 220

Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages for

machine translation. Technical report, arXiv:1309.4168. 320

Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cambridge

UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 354

Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 13

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 70

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings eﬃciently with noise-

contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Wein-

berger, editors, Advances in Neural Information Processing Systems 26 , pages 2265–2273.

Curran Associates, Inc. 371

Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks with

discrete units. Neural Computation, 26. 331

Mont´ufar, G. and Ay, N. (2011). Reﬁnements of universal approximation results for deep belief

networks and restricted Boltzmann machines. Neural Computation, 23(5), 1306–1319. 331

Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product of

mixtures? SIAM Journal on Discrete Mathematics (SIDMA). 330

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions

of deep neural networks. In NIPS’2014. 329, 332, 333

428

Mor-Yosef, S., Samueloﬀ, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking the

risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Gynecol,

75(6), 944–7. 5

Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages 275–

282, San Mateo, CA. Morgan Kaufmann. 213, 222

Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cambridge,

MA, USA. 111

Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In

ICML’2014 . 114, 209, 210

Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method for

speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-

36(9), 1432–1436. 229

Nair, V. and Hinton, G. (2010). Rectiﬁed linear units improve restricted Boltzmann machines.

In ICML’2010. 109, 297

Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis. In

NIPS’2010 . 98, 336

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Springer. 153

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125–139.

356, 357

Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance sampling.

357, 358

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in

natural images with unsupervised feature learning. Deep Learning and Unsupervised Feature

Learning Workshop, NIPS. 14

Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical language

modelling. In European Conference on Speech Communication and Technology (Eurospeech),

pages 973–976, Berlin. 250

Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of part-of-

speech and automatically derived category-based language models for speech recognition. In

International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 177–

180. 250

Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in classifying

static speech patterns. Computer Speech and Language, 4, 275–289. 109

Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 65, 68

Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld properties by

learning a sparse code for natural images. Nature, 381, 607–609. 285, 335

429

Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: a

strategy employed by V1? Vision Research, 37, 3311–3325. 296

Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning algorithms

for various stochastic models. Neural Networks, 13(7), 755 – 764. 116

Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de Montr´eal. 160,

161

Pascanu, R. and Bengio, Y. (2012). On the diﬃculty of training recurrent neural networks.

Technical Report arXiv:1211.5063, Universite de Montreal. 114

Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Technical

report, arXiv:1301.3584. 116

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the diﬃculty of training recurrent neural

networks. In ICML’2013 . 114, 164, 210, 213, 220, 221, 222

Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions of

deep feed forward networks with piece-wise linear activations. Technical report, U. Montreal,

arXiv:1312.6098. 127

Pascanu, R., G¨ul¸cehre, C¸., Cho, K., and Bengio, Y. (2014a). How to construct deep recurrent

neural networks. In ICLR’2014. 153

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014b). How to construct deep recurrent

neural networks. In ICLR’2014. 215, 217, 332

Pascanu, R., Montufar, G., and Bengio, Y. (2014c). On the number of inference regions of deep

feed forward networks with piece-wise linear activations. In ICLR’2014 . 329

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning.

In Proceedings of the 7th Conference of the Cognitive Science Society, University of California,

Irvine, pages 329–334. 259

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.

Morgan Kaufmann. 36

Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 20

Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recognition

hard? PLoS Comput Biol, 4. 402

Pollack, J. B. (1990). Recursive distributed representations. Artiﬁcial Intelligence, 46(1), 77–

105. 204

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR

Computational Mathematics and Mathematical Physics, 4(5), 1–17. 166

Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In

UAI’2011 , Barcelona, Spain. 127, 331, 332

Poundstone, W. (2005). Fortune’s Formula: The untold story of the scientiﬁc betting system

that beat the casinos and Wall Street. Macmillan. 42

430

Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 109

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech

recognition. Proceedings of the IEEE , 77(2), 257–286. 227

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP

Magazine, pages 257–285. 191, 227

Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive distribution

estimator (NADE-k). Technical report, arXiv:1406.1485. 209

Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning using

graphics processors. In L. Bottou and M. Littman, editors, ICML 2009 , pages 873–880, New

York, NY, USA. ACM. 17

Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Foundations

of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster University

Archive for the History of Economic Thought. 37

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Eﬃcient learning of sparse

representations with an energy-based model. In NIPS’2006. 16, 296, 308, 309, 311

Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief net-

works. In NIPS’2007 . 296

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approx-

imate inference in deep generative models. In ICML’2014 . 275

Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning through

cross-modal transfer. In 27th Annual Conference on Neural Information Processing Systems

(NIPS 2013). 319, 320

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-encoders:

Explicit invariance during feature extraction. In ICML’2011 . 304, 305, 306, 338

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011b).

Higher order contractive auto-encoder. In European Conference on Machine Learning and

Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 285

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011c).

Higher order contractive auto-encoder. In ECML PKDD. 304

Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold tangent

classiﬁer. In NIPS’2011. 350, 351

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling

contractive auto-encoders. In ICML’2012 . 404

Roberts, S. and Everson, R. (2001). Independent component analysis: principles and practice.

Cambridge University Press. 292

Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech recognition

system. Computer Speech and Language, 5(3), 259–274. 17

431

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and

organization in the brain. Psychological Review, 65, 386–408. 13, 17

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 13, 17

Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed-

ding. Science, 290(5500). 98, 340

Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-

propagating errors. Nature, 323, 533–536. 13, 248

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal representations

by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed

Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cambridge. 14, 17

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986c). Learning representations by

back-propagating errors. Nature, 323, 533–536. 104, 191

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986d). Parallel Dis-

tributed Processing: Explorations in the Microstructure of Cognition, volume 1. MIT Press,

Cambridge. 104

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,

Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014). ImageNet Large Scale Visual

Recognition Challenge. 14

Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of the

International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages 448–455.

15, 17, 309, 391, 395, 398

Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings of the

Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2009),

volume 8. 400, 409

Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance kernels

for Gaussian processes. In NIPS’07 , pages 1249–1256, Cambridge, MA. MIT Press. 322

Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks.

In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, volume 25, pages

872–879. ACM. 357

Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean ﬁeld theory for sigmoid belief networks.

Journal of Artiﬁcial Intelligence Research, 4, 61–76. 17

Schaul, T., Zhang, S., and LeCun, Y. (2012). No More Pesky Learning Rates. Technical report,

New York University, arxiv 1206.1106. 171

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history

compression. Neural Computation, 4(2), 234–242. 16, 217

Sch¨olkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 95

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a kernel

eigenvalue problem. Neural Computation, 10, 1299–1319. 98, 340

432

Sch¨olkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods — Support

Vector Learning. MIT Press, Cambridge, MA. 13, 109, 128

Schulz, H. and Behnke, S. (2012). Learning two-layer contractive encodings. In ICANN’2012 ,

pages 620–628. 305

Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions

on Signal Processing, 45(11), 2673–2681. 202

Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection with

unsupervised multi-stage feature learning. In Proc. International Conference on Computer

Vision and Pattern Recognition (CVPR’13). IEEE. 128

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2014). Overfeat:

Integrated recognition, localization and detection using convolutional networks. International

Conference on Learning Representations. 71

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Jour-

nal, 27(3), 379—-423. 42

Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the Institute of

Radio Engineers, 37(1), 10–21. 42

Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications. 20

Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–548.

193

Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied Mathe-

matics Letters, 4(6), 77–80. 193

Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets. Journal

of Computer and Systems Sciences, 50(1), 132–150. 164

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism for

specifying selected invariances in an adaptive network. In NIPS’1991. 349, 350, 351

Simard, P. Y., LeCun, Y., and Denker, J. (1993). Eﬃcient pattern recognition using a new

transformation distance. In NIPS’92. 348

Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation invariance in

pattern recognition — tangent distance and tangent propagation. Lecture Notes in Computer

Science, 1524. 348

Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum,

with application to neural networks. International Journal of Control, 62(6), 1391–1407. 149

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony

theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing,

volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 266, 277

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic

pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011 . 205

433

Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural language

with recursive neural networks. In Proceedings of the Twenty-Eighth International Conference

on Machine Learning (ICML’2011). 205

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c). Semi-

supervised recursive autoencoders for predicting sentiment distributions. In EMNLP’2011.

205

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.

(2013). Recursive deep models for semantic compositionality over a sentiment treebank. In

EMNLP’2013 . 205

Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural networks.

Complex Systems, 2, 625–639. 112

Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local minima

even for networks without hidden layers. Complex Systems, 3, 91–106. 158

Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann ma-

chines. In NIPS’2012 . 321

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:

A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning

Research, 15, 1929–1958. 151, 152, 153, 398

Stewart, L., He, X., and Zemel, R. S. (2007). Learning ﬂexible features for conditional random

ﬁelds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1415–1426.

223

Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Departement of com-

puter science, University of Toronto. 211, 219

Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive Divergence.

In Y. W. Teh and M. Titterington, editors, Proc. of the International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS), volume 9, pages 789–795. 364

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization

and momentum in deep learning. In ICML. 167, 211, 219

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural

networks. Technical report, arXiv:1409.3215. 215, 216

Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines. Master’s

thesis, University of British Columbia. 302

Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On autoencoders

and score matching for energy based models. In ICML’2011. ACM. 369

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-

houcke, V., and Rabinovich, A. (2014). Going deeper with convolutions. Technical report,

arXiv:1409.4842. 15, 17

Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework for

nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 98, 312, 313, 340

434

Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994. 350

Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society B, 58, 267–288. 137

Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the

likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,

pages 1064–1071. ACM. 364, 388

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis. Journal

of the Royal Statistical Society B, 61(3), 611–622. 291

Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autoregressive

density-estimator. In NIPS’2013 . 208, 209

Utgoﬀ, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14,

2497–2539. 16

van der Maaten, L. and Hinton, G. E. (2008a). Visualizing data using t-SNE. J. Machine

Learning Res., 9. 312, 340, 343

van der Maaten, L. and Hinton, G. E. (2008b). Visualizing data using t-SNE. Journal of Machine

Learning Research, 9, 2579–2605. 313

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag,

Berlin. 78, 79

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York. 78, 79,

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies

of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. 78,

Vincent, P. (2011a). A connection between score matching and denoising autoencoders. Neural

Computation, 23(7). 301, 302, 304, 404

Vincent, P. (2011b). A connection between score matching and denoising autoencoders. Neural

Computation, 23(7), 1661–1674. 369, 405

Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press. 340

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing

robust features with denoising autoencoders. In ICML 2008 . 298

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denois-

ing autoencoders: Learning useful representations in a deep network with a local denoising

criterion. J. Machine Learning Res., 11. 298

Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In

Advances in Neural Information Processing Systems 26 , pages 351–359. 153

Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme recogni-

tion using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal

Processing, 37, 328–339. 191

435

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural

networks using dropconnect. In ICML’2013. 154

Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 153

Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical analysis

of dropout in piecewise linear networks. In ICLR’2014. 153

Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by semidef-

inite programming. In CVPR’2004, pages 988–995. 98, 340

Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding.

In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, pages 1168–1175,

New York, NY, USA. ACM. 322

Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to rank

with joint word-image embeddings. Machine Learning, 81(1), 21–35. 205

White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can

learn arbitrary mappings. Neural Networks, 3(5), 535–549. 126

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON

Convention Record, volume 4, pages 96–104. IRE, New York. 13, 14, 15, 17

Wikipedia (2015). List of animals by number of neurons — wikipedia, the free encyclopedia.

[Online; accessed 4-March-2015]. 15, 17

Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In NIPS’95,

pages 514–520. MIT Press, Cambridge, MA. 128

Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural

Computation, 8(7), 1341–1390. 127

Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated splicing

using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562. 153

Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian

mixtures. Neural Computation, 8, 129–151. 228

Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly decreas-

ing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 364, 388

Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions of Space

by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical Society. American

Mathematical Society. 330

Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In

ECCV’14 . 9, 71

Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative stochastic

network for protein secondary structure prediction. In ICML’2014 . 410

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal

of the Royal Statistical Society, Series B, 67(2), 301–320. 116

Z¨ohrer, M. and Pernkopf, F. (2014). General stochastic networks for classiﬁcation. In NIPS’2014 .

411

436

Index

norm, 26

Active constraint, 68

ADALINE, see Adaptive Linear Element

Adaptive Linear Element, 13, 15, 17

AIS, see annealed importance sampling

Almost everywhere, 52

Ancestral sampling, 279

Annealed importance sampling, 354, 396

Approximate inference, 274

Artiﬁcial intelligence, 4

Asymptotically unbiased, 84

Autoencoder, 7

Bagging, 142

Bayes’ rule, 51

Bayesian network, see directed graphical model

Bayesian probability, 37

Beam Search, 233

Belief network, see directed graphical model

Bernoulli distribution, 44

Boltzmann distribution, 265

Boltzmann machine, 265

Broadcasting, 22

Calculus of variations, 378

CD, see contrastive divergence

Centering trick (DBM), 398

Central limit theorem, 45

Chain rule of probability, 40

Chess, 4

Chord, 272

Chordal graph, 272

Classical regularization, 132

Classiﬁcation, 71

Cliﬀs, 159

Clipping the gradient, 220

Clique potential, see factor (graphical model)

CNN, see convolutional neural network

Collider, see explaining away

Computer vision, 241

Conditional computation, see dynamically struc-

tured nets, 237

Conditional independence, 40

Conditional probability, 39

Constrained optimization, 67

Context-speciﬁc independence, 267

Contrast, 242

Contrastive divergence, 361, 397, 398

Convolution, 173, 401

Convolutional neural network, 173

Coordinate descent, 168, 169, 398

Correlation, 41

Cost function, see objective function

Covariance, 41

Covariance matrix, 42

curse of dimensionality, 98

Cyc, 5

D-separation, 266

Dataset augmentation, 242, 247

DBM, see deep Boltzmann machine

Decoder, 7

Deep belief network, 17, 372, 383, 389, 402

Deep Blue, 4

Deep Boltzmann machine, 15, 17, 372, 383,

391, 398, 402

Deep learning, 4, 7

Denoising score matching, 369

Density estimation, 71

Derivative, 58

Detector layer, 178

Dirac delta function, 47

Directed graphical model, 259

Directional derivative, 62

Distributed Representation, 325

domain adaptation, 315

Dot product, 23

437

Doubly block circulant matrix, 175

Dream sleep, 360, 381

DropConnect, 154

Dropout, 151, 398

Dynamically structured networks, 237

E-step, 375

Early stopping, 116, 143, 146, 148

EBM, see energy-based model

Echo state network, 15, 17

Eﬀective number of parameters, 134

Eigendecomposition, 29

Eigenvalue, 29

Eigenvector, 29

ELBO, see evidence lower bound

Element-wise product, see Hadamard product

EM, see expectation maximization

Embedding, 339

Empirical distribution, 47

Empirical risk, 156

Empirical risk minimization, 157

Encoder, 7

Energy function, 265

Energy-based model, 265, 392

Ensemble methods, 142

Epoch, 158, 166

Equality constraint, 67

Equivariance, 176

Error function, see objective function

Euclidean norm, 27

Euler-Lagrange equation, 379

Evidence lower bound, 372, 374–376, 391

Expectation, 41

Expectation maximization, 375

Expected value, see expectation

Explaining away, 268

Factor (graphical model), 262

Factor graph, 272

Factors of variation, 7

Frequentist probability, 37

Functional derivatives, 378

Gaussian distribution, see Normal distribu-

tion45

Gaussian mixture, 48

GCN, see Global contrast normalization

Generalized Lagrange function, see Generalized

Lagrangian

Generalized Lagrangian, 67

Gibbs distribution, 263

Gibbs sampling, 280

Global contrast normalization, 243

Gradient, 62

Gradient clipping, 220

Gradient descent, 62

Graph Transformer, 232

Graphical model, see structured probabilistic

model

Greedy layer-wise unsupervised pre-training, 308

Hadamard product, 23

Harmonium, see Restricted Boltzmann ma-

chine277

Harmony theory, 266

Helmholtz free energy, see evidence lower bound

Hessian matrix, 63

Hidden layer, 9

Identity matrix, 24

Immorality, 270

Independence, 40

Inequality constraint, 67

Inference, 257, 274, 372, 374–376, 378, 380

Invariance, 181

Jacobian matrix, 52, 62

Joint probability, 38

Karush-Kuhn-Tucker conditions, 68

Karush–Kuhn–Tucker, 67

Kernel (convolution), 174

KKT, see Karush–Kuhn–Tucker

KKT conditions, see Karush-Kuhn-Tucker con-

ditions

KL divergence, see Kllback-Leibler divergence43

Knowledge base, 5

Kullback-Leibler divergence, 43

Lagrange multipliers, 67, 68, 379

Lagrangian, see Gneralized Lagrangian67

Latent variable, 286

Line search, 62

Linear combination, 25

Linear dependence, 26

Local conditional probability distribution, 260

Logistic regression, 5

Logistic sigmoid, 48

Loop, 272

Loss function, see objective function

438

M-step, 375

Machine learning, 5

Manifold hypothesis, 336

manifold hypothesis, 98

Manifold learning, 97, 336

MAP inference, 376

Marginal probability, 39

Markov chain, 279

Markov network, see undirected model261

Markov random ﬁeld, see undirected model261

Matrix, 21

Matrix inverse, 24

Matrix product, 22

Max pooling, 181

Mean ﬁeld, 397, 398

Measure theory, 51

Measure zero, 52

Method of steepest descent, see gradient de-

scent

Missing inputs, 71

Mixing (Markov chain), 281

Mixture distribution, 48

MLP, see multilayer perception

MNIST, 398

Model averaging, 142

Model compression, 236

Moore-Penrose pseudoinverse, 140

Moralized graph, 270

MP-DBM, see multi-prediction DBM

MRF (Markov Random Field), see undirected

model261

Multi-modal learning, 321

Multi-prediction DBM, 397, 398

Multi-task learning, 154

Multilayer perception, 8

Multilayer perceptron, 17

Multinomial distribution, 44

Multinoulli distribution, 44

Naive Bayes, 5, 53

Nat, 42

natural image, 256

Negative deﬁnite, 63

Negative phase, 360

Neocognitron, 15, 17

Nesterov momentum, 167

Netﬂix Grand Prize, 143

Noise-contrastive estimation, 370

Norm, 26

Normal distribution, 45

Normal equations, 135

Object detection, 241

Object recognition, 241

Objective function, 58

one-shot learning, 319

Orthogonality, 28

Overﬁtting, 79

Parameter sharing, 176

Partial derivative, 58

Partition function, 101, 263, 352, 397

PCA, see principal components analysis

PCD, see stochastic maximum likelihood

Perceptron, 13, 17

Persistent contrastive divergence, see stochas-

tic maximum likelihood

Pooling, 173, 402

Positive deﬁnite, 63

Positive phase, 360

Pre-training, 308

Precision (of a normal distribution), 45, 47

Predictive sparse decomposition, 285, 296

Preprocessing, 242

Principal components analysis, 32, 244, 372

Principle components analysis, 91

Probabilistic max pooling, 402

Probability density function, 38

Probability distribution, 37

Probability mass function, 38

Product rule of probability, see chain rule of

probability

PSD, see predictive sparse decomposition

Pseudolikelihood, 365

Random variable, 37

Ratio matching, 368

RBM, see restricted Boltzmann machine

Receptive ﬁeld, 177

Recurrent network, 17

Regression, 71

Regularization, 131

Representation learning, 5

Restricted Boltzmann machine, 277, 372, 382,

383, 398, 400–402

Ridge regression, 133

Risk, 156

Scalar, 20

Score matching, 367

439

Second derivative, 62

Second derivative test, 63

Self-information, 42

Separable convolution, 190

Separation (probabilistic modeling), 266

SGD, see stochastic gradient descent, see stochas-

tic gradient descent

Shannon entropy, 42, 379

Sigmoid, see logistic sigmoid

Sigmoid belief network, 17

Singular value decomposition, 30, 140

SML, see stochastic maximum likelihood

Softmax, 111

Softplus, 48

Spam detection, 5

Sparse coding, 292, 372

spectral radius, 211

Sphering, see Whitening, 244

Spike and slab restricted Boltzmann machine,

401

Square matrix, 26

ssRBM, see spike and slab restricted Boltzmann

machine

Standard deviation, 41

Statistic, 83

Steepest descent, see gradient descent

Stochastic gradient descent, 158, 165, 398

Stochastic maximum likelihood, 364, 397, 398

Stochastic pooling, 154

Structure learning, 273

Structured output, 71

Structured probabilistic model, 255

Sum rule of probability, 39

Surrogate loss function, 157

SVD, see singular value decomposition

Symmetric matrix, 28

Tangent plane, 340

Tensor, 21

Tiled convolution, 186

Toeplitz matrix, 175

Trace operator, 31

Transcription, 71

Transfer learning, 315

Transpose, 22

Triangle inequality, 27

Triangulated graph, see chordal graph

Unbiased, 84

Underﬁtting, 79

Undirected model, 261

Uniform distribution, 38

Unit norm, 28

Unnormalized probability distribution, 262

Unsupervised pre-training, 308

V-structure, see explaining away

Vapnik-Chervonenkis dimension, 78

Variance, 41

Variational derivatives, see functional deriva-

tives

Variational free energy, see evidence lower bound

VC dimension, see Vapnik-Chervonenkis dimen-

sion

Vector, 20

Visible layer, 9

Viterbi decoding, 226

Weight decay, 133

Whitening, 244

ZCA, see zero-phase components analysis

zero-data learning, 319

Zero-phase components analysis, 244

zero-shot learning, 319

440