Bibliography

Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data generating

distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal. 279

Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data generating

distribution. In ICLR’2013. also arXiv report 1211.4246. 279, 281

Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian gradient.

In Advances in Neural Information Processing Systems, pages 127–133. MIT Press. 113

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning

to align and translate. Technical report, arXiv preprint arXiv:1409.0473. 10

Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition with

continuous-parameter hidden Markov models. Computer, Speech and Language, 2, 219–234.

48, 254

Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT Press.

256

Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural Information

Processing Systems 26 , pages 2814–2822. 149

Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the past and

the future in protein secondary structure prediction. Bioinformatics, 15(11), 937–946. 228

Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal function.

IEEE Trans. on Information Theory, 39, 930–945. 121

Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University Press.

187

Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and Applications.

Wiley. 187

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard,

N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning

and Unsupervised Feature Learning NIPS 2012 Workshop. 57

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of ﬁnite state

Markov chains. Ann. Math. Stat., 37, 1559–1563. 252

Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces in

random-dot stereograms. Nature, 355, 161–163. 300

355

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data

representation. Neural Computation, 15(6), 1373–1396. 98, 265

Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distributions

using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining

and Knowledge Discovery, 11(3), 550–557. 232

Bengio, Y. (1991). Artiﬁcial Neural Networks and their Application to Sequence Recognition.

Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 237, 256

Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal on

Pattern Recognition and Artiﬁcial Intelligence, 7(4), 647–668. 254

Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys, 2,

129–162. 254

Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys, 2,

129–162. 256

Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 18, 95, 122

Bengio, Y. (2011). Deep learning of representations for unsupervised and transfer learning. In

JMLR W&CP: Proc. Unsupervised and Transfer Learning. 18

Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-layer

neural networks. In NIPS’99, pages 400–406. MIT Press. 232, 234, 235

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural

Computation, 21(6), 1601–1621. 310

Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE Trans-

actions on Neural Networks, 7(5), 1231–1249. 256

Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In Large Scale

Kernel Machines. 95

Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In L. Bottou,

O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press.

123

Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04, pages

129–136. MIT Press. 97, 266

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Global optimization of a neural

network-hidden Markov model hybrid. IEEE Transactions on Neural Networks, 3(2), 252–259.

254, 256

Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term dependencies

in recurrent networks. In IEEE International Conference on Neural Networks, pages 1183–

1195, San Francisco. IEEE Press. (invited paper). 155, 243

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient

descent is diﬃcult. IEEE Tr. Neural Nets. 155, 156, 235, 241, 243

356

Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for on-line

handwriting recognition. Neural Computation, 7(6), 1289–1303. 256

Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language model. In

NIPS’00 , pages 932–938. MIT Press. 14

Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language model. In

T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages 932–938. MIT Press.

267, 269

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language

model. Journal of Machine Learning Research, 3, 1137–1155. 267, 269

Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions for

local kernel machines. In NIPS’2005. 94

Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows. In

NIPS’2005 . MIT Press. 97, 265

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of

deep networks. In NIPS’2006. 15, 142, 192

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In

ICML’09 . 113

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013a). Generalized denoising auto-encoders as

generative models. In Advances in Neural Information Processing Systems 26 (NIPS’13). 282

Bengio, Y., Courville, A., and Vincent, P. (2013b). Representation learning: A review and

new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 35(8),

1798–1828. 298, 299, 339

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative stochastic

networks trainable by backprop. In Proceedings of the 30th International Conference on

Machine Learning (ICML’14). 282, 283

Bennett, C. (1976). Eﬃcient estimation of free energy diﬀerences from Monte Carlo data. Journal

of Computational Physics, 22(2), 245–268. 306

Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive diver-

gence and persistent contrastive divergence. CoRR, abs/1312.6002. 313

Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Classiﬁcation.

Ph.D. thesis, Universit´e de Montr´eal. 183

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J.,

Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.

In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy). Oral Presentation.

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195. 315

Bishop, C. M. (1994). Mixture density networks. 109

357

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the

vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 78, 79

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and meaning

representations for open-text semantic parsing. AISTATS’2012 . 230

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin

classiﬁers. In COLT ’92: Proceedings of the ﬁfth annual workshop on Computational learning

theory, pages 144–152, New York, NY, USA. ACM. 14, 95, 106

Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications `a la

reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 256

Bottou, L. (2011). From machine learning to machine reasoning. Technical report,

arXiv.1102.1808. 229, 230

Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker independent

isolated digit recognition: multilayer perceptrons vs dynamic time warping. Neural Networks,

3, 453–465. 256

Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing systems

using graph transformer networks. In Proceedings of the Computer Vision and Pattern Recog-

nition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 247, 254, 255, 256, 257,

258, 260

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular

value decomposition. Biological Cybernetics, 59, 291–294. 180

Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Approach,

volume 247 of The Kluwer international series in engineering and computer science. Kluwer

Academic Publishers, Boston. 256

Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multilayer

perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 1167–

1178. 256

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New

York, NY, USA. 65

Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 98, 265

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 144

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and Regres-

sion Trees. Wadsworth International Group, Belmont, CA. 95

Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition. Ph.D.

thesis, Dept. of Computer Science, Carnegie-Mellon University. 254

Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In

R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005 , pages 33–40. Society for Artiﬁcial

Intelligence and Statistics. 310

358

Cauchy, A. (1847). M´ethode g´en´erale pour la r´esolution de syst`emes d’´equations simultan´ees.

In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538. 58

Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923, UCSD.

13, 98, 261

Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for language

modeling. Computer, Speech and Language, 13(4), 359–393. 246, 247

Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014).

Learning phrase representations using rnn encoder-decoder for statistical machine translation.

In Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014). 241

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural network

for traﬃc sign classiﬁcation. Neural Networks, 32, 333–338. 15, 122

Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in unsuper-

vised feature learning. In Proceedings of the Thirteenth International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS 2011). 350

Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris VI, LIP6.

106

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing, 36,

287–314. 190

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.

14, 95

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmentation using

depth information. In International Conference on Learning Representations (ICLR2013). 15,

122

Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by spike-and-

slab RBMs. In ICML’11. 160

Cover, T. (2006). Elements of Information Theory. Wiley-Interscience. 42

Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304, 111–114.

309

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of

Control, Signals, and Systems, 2, 303–314. 297

Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the

mean-covariance restricted Boltzmann machine. In NIPS’2010 . 15

Dauphin, Y. and Bengio, Y. (2013a). Big neural networks waste capacity. In ICLR’2013 work-

shops track (oral presentation), arXiv: 1301.3583 . 17

Dauphin, Y. and Bengio, Y. (2013b). Stochastic ratio matching of RBMs for sparse high-

dimensional inputs. In NIPS26. NIPS Foundation. 318

359

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying

and attacking the saddle point problem in high-dimensional non-convex optimization. In

NIPS’2014 . 61

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T. (2014).

The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics

(Proc. SIGGRAPH), 33(4), 79:1–79:10. 348

Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS . 122,

296, 297

Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and Adam,

H. (2014). Large-scale object classiﬁcation using label relation graphs. In ECCV’2014 , pages

48–64. 248

Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010). Binary coding

of speech spectrograms using a deep auto-encoder. In Interspeech 2010 , Makuhari, Chiba,

Japan. 15

Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision.

Technical Report 1327, D´epartement d’Informatique et de Recherche Op´erationnelle, Univer-

sit´e de Montr´eal. 338

Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function. In

NIPS’2011 . 307

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and

robust neural network joint models for statistical machine translation. In Proc. ACL’2014.

Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random ﬁelds. In International Con-

ference on Artiﬁcial Intelligence and Statistics, pages 177–184. 248

Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding tech-

niques for high-dimensional data. Technical Report 2003-08, Dept. Statistics, Stanford Uni-

versity. 98, 265

Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning. IEEE

Transactions on Neural Networks, 1, 75–80. 156, 235

Dugas, C., Bengio, Y., B´elisle, F., and Nadeau, C. (2001). Incorporating second-order functional

knowledge for better option pricing. In NIPS’00, pages 472–478. MIT Press. 106

Ebrahimi, S., Pal, C., Bouthillier, X., Froumenty, P., Jean, S., Konda, K. R., Vincent, P.,

Courville, A., and Bengio, Y. (2013). Combining modality speciﬁc deep neural network mod-

els for emotion recognition in video. In Emotion Recognition In The Wild Challenge and

Workshop (Emotiw2013). 9, 122

El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term depen-

dencies. In NIPS 8 . MIT Press. 242, 245, 246

ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term depen-

dencies. In NIPS’1995. 238

360

Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why

does unsupervised pre-training help deep learning? JMLR, 11, 625–660. 18

Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., and Talay,

S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekkerman, M. Bilenko,

and J. Langford, editors, Scaling up Machine Learning: Parallel and Distributed Approaches.

Cambridge University Press. 276

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical features for

scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence. 15, 122

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical features

for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8),

1915–1929. 248

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611. 196

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of

Eugenics, 7, 179–188. 74

Frasconi, P., Gori, M., and Sperduti, A. (1997). On the eﬃcient classiﬁcation of data structures

by neural networks. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence. 229, 230

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive processing of

data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. 230

Frey, B. J. (1998). Graphical models for machine learning and digital communication. MIT

Press. 231

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism

of pattern recognition unaﬀected by shift in position. Biological Cybernetics, 36, 193–202. 15

Girosi, F. (1994). Regularization theory, radial basis functions and networks. In V. Cherkassky,

J. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks, volume 136 of

NATO ASI Series, pages 166–187. Springer Berlin Heidelberg. 121

Glorot, X. and Bengio, Y. (2010). Understanding the diﬃculty of training deep feedforward

neural networks. In AISTATS’2010. 15

Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks. In

AISTATS’2011 . 15, 106, 275

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectiﬁer neural networks. In JMLR

W&CP: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and

Statistics (AISTATS 2011). 124, 275

Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale sentiment

classiﬁcation: A deep learning approach. In ICML’2011. 193, 275

Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face Recog-

nition. Imperial College Press. 264, 267

361

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In

NIPS’2009 , pages 646–654. 183, 275

Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L. (2010).

Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction

(HRI), Osaka, Japan. ACM Press, ACM Press. 71

Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with spike-

and-slab sparse coding. In ICML’2012 . 192

Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution for

autoencoders. Technical report, Universit´e de Montr´eal. 213

Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsu-

pervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.

9, 18, 122, 194

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout

networks. In ICML’2013 . 15

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013b). Maxout

networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–1327. 124, 148,

350

Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013c). Multi-prediction deep

Boltzmann machines. In NIPS26. NIPS Foundation. 316, 335, 336

Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed representa-

tions without word alignments. Technical report, arXiv:1410.2455. 196

Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in

Computational Intelligence. Springer. 227, 240, 241, 247

Graves, A. (2013). Generating sequences with recurrent neural networks. Technical report, arXiv

preprint arXiv:1308.0850. 110, 240, 242

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁcation with bidirectional

LSTM and other neural network architectures. Neural Networks, 18(5), 602–610. 227

Graves, A. and Schmidhuber, J. (2009). Oﬄine handwriting recognition with multidimensional

recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,

NIPS’2008 , pages 545–552. 227

Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-

ral classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks. In

ICML’2006 , pages 369–376, Pittsburgh, USA. 247

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Unconstrained

on-line handwriting recognition with recurrent neural networks. In J. Platt, D. Koller,

Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 227

Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent

neural networks. In ICASSP’2013, pages 6645–6649. IEEE. 228, 240, 241

362

Gulcehre, C. and Bengio, Y. (2013). Knowledge matters: Importance of prior information for

optimization. In International Conference on Learning Representations (ICLR’2013). 18

Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation prin-

ciple for unnormalized statistical models. In Proceedings of The Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10). 318

Haﬀner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural networks

for high performance continuous speech recognition. In International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pages 105–108, Toronto. 256

H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of

the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California.

ACM Press. 122, 297

H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Compu-

tational Complexity, 1, 113–129. 122, 297

Henaﬀ, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning of sparse

features for scalable audio classiﬁcation. In ISMIR’11 . 276

Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modiﬁables: D´ecodage de mes-

sages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie des Sci-

ences, 299(III-13), 525––528. 190

Hermann, K. M. and Blunsom, P. (2014). Multilingual Distributed Representations without

Word Alignment. In Proceedings of ICLR. 10

Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen,

P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in

speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 10, 15

Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence. Tech-

nical Report GCNU TR 2000-004, Gatsby Unit, University College London. 309

Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 265

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with Neural

Networks. Science, 313, 504–507. 142

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural

networks. Science, 313(5786), 504–507. 185, 192, 193

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

Helmholtz free energy. In NIPS’1993 . 180

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets.

Neural Computation, 18, 1527–1554. 15, 142, 192, 193, 332

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b).

Improving neural networks by preventing co-adaptation of feature detectors. Technical report,

arXiv:1207.0580. 133

363

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis,

T.U. M¨unich. 155, 235, 243

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,

9(8), 1735–1780. 240, 241

Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000). Gradient

ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies. In J. Kolen and

S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE Press. 241

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are uni-

versal approximators. Neural Networks, 2, 359–366. 297

Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World Chess

Champion. Princeton University Press, Princeton, NJ, USA. 2

Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov random

ﬁelds on lattice. Annals of the Institute of Statistical Mathematics, 54(1), 1–18. 315

Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96 , pages

13–24. 219

Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys,

2, 94–128. 190

Hyv¨arinen, A. (2005). Estimation of non-normalized statistical models using score matching.

Journal of Machine Learning Research, 6, 695–709. 316

Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence, and pseu-

dolikelihood for continuous-valued variables. IEEE Transactions on Neural Networks, 18,

1529–1531. 317

Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and Data

Analysis, 51, 2499–2512. 317

Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. Wiley-

Interscience. 190

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture of local

experts. Neural Computation, 3, 79–87. 109

Jaeger, H. (2003). Adaptive nonlinear system identiﬁcation with echo state networks. In Advances

in Neural Information Processing Systems 15 . 236

Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state net-

works. Technical report, Jacobs University. 242

Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 235

Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving

energy in wireless communication. Science, 304(5667), 78–80. 235

Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Sch¨olkopf, B. (2012). On

causal and anticausal learning. In ICML’2012 , pages 1255–1262. 289

364

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best multi-stage

architecture for object recognition? In ICCV’09 . 106, 276

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best multi-stage

architecture for object recognition? In Proc. International Conference on Computer Vision

(ICCV’09), pages 2146–2153. IEEE. 124

Jarzynski, C. (1997). Nonequilibrium equality for free energy diﬀerences. Phys. Rev. Lett., 78,

2690–2693. 306

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. 35

Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters from

sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice.

North-Holland, Amsterdam. 246

Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 14

Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classiﬁcation.

IEEE Transactions on Signal Processing, 40(12), 3043–3054. 254

Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algorithm

based on neuromimetic architecture. Signal Processing, 24, 1–10. 190

Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model compo-

nent of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing,

ASSP-35(3), 400–401. 246

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding algo-

rithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 183

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding algo-

rithms with applications to object recognition. Technical report, Computational and Biological

Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01. 276

Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant features

through topographic ﬁlter maps. In CVPR’2009. 276

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010).

Learning convolutional feature hierarchies for visual recognition. In NIPS’2010. 276

Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary Mathe-

matics ; V. 1). American Mathematical Society. 164

Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score matching.

In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances

in Neural Information Processing Systems 23 , pages 1126–1134. 318

Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the

International Conference on Learning Representations (ICLR). 267, 268

Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed represen-

tations of words. In Proceedings of COLING 2012 . 196

365

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques.

MIT Press. 172, 173, 252

Koren, Y. (2009). 1 the bellkor solution to the netﬂix grand prize. 146

Koutnik, J., Greﬀ, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In

ICML’2014 . 242, 246

Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties of

DBNs with binary hidden units and real-valued visible units. In ICML’2013 . 297

Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.

Technical report, University of Toronto. 160

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classiﬁcation with deep convo-

lutional neural networks. In NIPS’2012. 9, 15, 122, 275

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classiﬁcation with deep

convolutional neural networks. In Advances in Neural Information Processing Systems 25

(NIPS’2012). 71

Laﬀerty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random ﬁelds: Probabilistic

models for segmenting and labeling sequence data. In C. E. Brodley and A. P. Danyluk, editors,

ICML 2001 . Morgan Kaufmann. 248, 254

Lake, B., Salakhutdinov, R., and Tenenbaum, J. (2013). One-shot learning by inverting a

compositional causal process. In NIPS’2013 . 18

Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural network archi-

tecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University.

217, 237

Larochelle, H. and Bengio, Y. (2008). Classiﬁcation using discriminative restricted Boltzmann

machines. In ICML’2008 . 183

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. In

AISTATS’2011 . 230, 234

Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI

Conference on Artiﬁcial Intelligence. 18, 196

Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approximators.

Neural Computation, 22(8), 2192–2207. 297

Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural gradient

algorithm. In NIPS’07. 113

LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de Paris

VI. 14, 180

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel,

L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Compu-

tation, 1(4), 541–551. 15

366

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998a). Gradient-based learning applied to

document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 14, 247, 254, 255, 256

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998b). Gradient based learning applied to

document recognition. Proc. IEEE . 15

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In

NIPS’07 . 183

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief networks for

scalable unsupervised learning of hierarchical representations. In L. Bottou and M. Littman,

editors, ICML 2009 . ACM, Montreal, Canada. 338, 339

Leprieur, H. and Haﬀner, P. (1995). Discriminant learning with minimum memory loss for

improved non-vocabulary rejection. In EUROSPEECH’95 , Madrid, Spain. 254

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies is not

as diﬃcult with NARX recurrent neural networks. IEEE Transactions on Neural Networks,

7(6), 1329–1338. 237

Linde, N. (1992). The machine that changed the world, episode 3. Documentary miniseries. 3

Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to approxi-

mately evaluate or simulate. In Proceedings of the 27th International Conference on Machine

Learning (ICML’10). 330

Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine invented by

Charles Babbage”. 2

Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 248, 253, 258

Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent neural

network training. Computer Science Review, 3(3), 127–149. 235

Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with convolu-

tional spike-and-slab RBMs and deep extensions. In AISTATS’2013. 72

Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09. 317

Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without stable

states: A new framework for neural computation based on perturbations. Neural Computation,

14(11), 2531–2560. 235

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for restricted

Boltzmann machine learning. In Proceedings of The Thirteenth International Conference on

Artiﬁcial Intelligence and Statistics (AISTATS’10), volume 9, pages 509–516. 313, 317

Martens, J. and Medabalimi, V. (2014). On the expressive eﬃciency of sum product networks.

arXiv preprint arXiv:1411.7717 . 297

Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free

optimization. In Proc. ICML’2011 . ACM. 243

Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous state

space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612. 315

367

Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition using

a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo CA. Morgan

Kaufmann. 256

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London.

107

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller,

X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011).

Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP:

Proc. Unsupervised and Transfer Learning, volume 7. 9, 18, 122, 194

Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surﬁng on the manifold.

Learning Workshop, Snowbird. 281

Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno

University of Technology. 110, 244

Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages for

machine translation. Technical report, arXiv:1309.4168. 196

Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cambridge

UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 303

Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 14

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 70

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings eﬃciently with noise-

contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Wein-

berger, editors, Advances in Neural Information Processing Systems 26, pages 2265–2273.

Curran Associates, Inc. 320

Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks with

discrete units. Neural Computation, 26. 297

Mont´ufar, G. and Ay, N. (2011). Reﬁnements of universal approximation results for deep belief

networks and restricted Boltzmann machines. Neural Computation, 23(5), 1306–1319. 297

Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product of

mixtures? SIAM Journal on Discrete Mathematics (SIDMA). 295

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions

of deep neural networks. In NIPS’2014. 294, 297, 298

Mor-Yosef, S., Samueloﬀ, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking the

risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Gynecol,

75(6), 944–7. 3

Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages 275–

282, San Mateo, CA. Morgan Kaufmann. 238, 246

Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cambridge,

MA, USA. 108

368

Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In

ICML’2014 . 110, 234, 235

Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method for

speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-

36(9), 1432–1436. 254

Nair, V. and Hinton, G. (2010a). Rectiﬁed linear units improve restricted Boltzmann machines.

In ICML’2010. 106, 275

Nair, V. and Hinton, G. E. (2010b). Rectiﬁed linear units improve restricted Boltzmann ma-

chines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International

Conference on Machine Learning (ICML-10), pages 807–814. ACM. 15

Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis. In

NIPS’2010 . 13, 98, 261

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Springer. 149

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125–139.

305, 306

Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance sampling.

306, 307

Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in classifying

static speech patterns. Computer Speech and Language, 4, 275–289. 106

Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 65, 68

Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld properties by

learning a sparse code for natural images. Nature, 381, 607–609. 182, 183, 300

Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: a

strategy employed by V1? Vision Research, 37, 3311–3325. 274

Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning algorithms

for various stochastic models. Neural Networks, 13(7), 755 – 764. 113

Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de Montr´eal. 152,

153

Pascanu, R. and Bengio, Y. (2012). On the diﬃculty of training recurrent neural networks.

Technical Report arXiv:1211.5063, Universite de Montreal. 110

Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Technical

report, arXiv:1301.3584. 113

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the diﬃculty of training recurrent neural

networks. In ICML’2013 . 110, 156, 235, 238, 244, 245, 246

Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions of

deep feed forward networks with piece-wise linear activations. Technical report, U. Montreal,

arXiv:1312.6098. 122

369

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014a). How to construct deep recurrent

neural networks. In ICLR’2014. 148

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014b). How to construct deep recurrent

neural networks. In ICLR’2014. 240, 242, 297

Pascanu, R., Montufar, G., and Bengio, Y. (2014c). On the number of inference regions of deep

feed forward networks with piece-wise linear activations. In ICLR’2014 . 294

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning.

In Proceedings of the 7th Conference of the Cognitive Science Society, University of California,

Irvine, pages 329–334. 162

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.

Morgan Kaufmann. 36

Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 20

Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recognition

hard? PLoS Comput Biol, 4. 339

Pollack, J. B. (1990). Recursive distributed representations. Artiﬁcial Intelligence, 46(1), 77–

105. 229

Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In

UAI’2011 , Barcelona, Spain. 122, 296, 297

Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 106

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech

recognition. Proceedings of the IEEE , 77(2), 257–286. 252

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP

Magazine, pages 257–285. 217, 252

Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive distribution

estimator (NADE-k). Technical report, arXiv preprint arXiv:1406.1485. 234

Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Foundations

of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster University

Archive for the History of Economic Thought. 37

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Eﬃcient learning of sparse

representations with an energy-based model. In NIPS’2006 . 15, 142, 192, 275

Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief net-

works. In NIPS’2007 . 275

Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning through

cross-modal transfer. In 27th Annual Conference on Neural Information Processing Systems

(NIPS 2013). 18, 196

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-encoders:

Explicit invariance during feature extraction. In ICML’2011 . 284

370

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011b).

Higher order contractive auto-encoder. In European Conference on Machine Learning and

Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 183

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011c).

Higher order contractive auto-encoder. In ECML PKDD. 284

Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold tangent

classiﬁer. In NIPS’2011. 286, 287

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling

contractive auto-encoders. In ICML’2012 . 281

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and

organization in the brain. Psychological Review, 65, 386–408. 14

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 14

Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed-

ding. Science, 290(5500). 98, 265

Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-

propagating errors. Nature, 323, 533–536. 14

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning representations by

back-propagating errors. Nature, 323, 533–536. 102, 217

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986c). Parallel Dis-

tributed Processing: Explorations in the Microstructure of Cognition, volume 1. MIT Press,

Cambridge. 102

Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of the

International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages 448–455.

193, 333, 335

Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings of the

Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2009),

volume 8. 337

Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks.

In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, volume 25, pages

872–879. ACM. 306

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history

compression. Neural Computation, 4(2), 234–242. 15, 242

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a kernel

eigenvalue problem. Neural Computation, 10, 1299–1319. 98, 265

Sch¨olkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods — Support

Vector Learning. MIT Press, Cambridge, MA. 14, 106, 122

Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions

on Signal Processing, 45(11), 2673–2681. 227

371

Sch¨olkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 95

Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-dependent

deep neural networks. In Interspeech 2011 , pages 437–440. 15

Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection with

unsupervised multi-stage feature learning. In Proc. International Conference on Computer

Vision and Pattern Recognition (CVPR’13). IEEE. 15, 122

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2014). Overfeat:

Integrated recognition, localization and detection using convolutional networks. International

Conference on Learning Representations. 71

Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications. 20

Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–548.

219

Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied Mathe-

matics Letters, 4(6), 77–80. 219

Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets. Journal

of Computer and Systems Sciences, 50(1), 132–150. 156

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism for

specifying selected invariances in an adaptive network. In NIPS’1991. 286, 287

Simard, P. Y., LeCun, Y., and Denker, J. (1993). Eﬃcient pattern recognition using a new

transformation distance. In NIPS’92. 285

Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation invariance in

pattern recognition — tangent distance and tangent propagation. Lecture Notes in Computer

Science, 1524. 285

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony

theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing,

volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 167, 177

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic

pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011 . 230

Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural language

with recursive neural networks. In Proceedings of the Twenty-Eighth International Conference

on Machine Learning (ICML’2011). 230

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c). Semi-

supervised recursive autoencoders for predicting sentiment distributions. In EMNLP’2011.

230

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.

(2013). Recursive deep models for semantic compositionality over a sentiment treebank. In

EMNLP’2013 . 230

372

Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural networks.

Complex Systems, 2, 625–639. 108

Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann ma-

chines. In NIPS’2012 . 197

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:

A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning

Research, 15, 1929–1958. 146, 148, 149, 335

Stewart, L., He, X., and Zemel, R. S. (2007). Learning ﬂexible features for conditional random

ﬁelds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1415–1426.

248

Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Departement of com-

puter science, University of Toronto. 236, 243

Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive Divergence.

In Y. W. Teh and M. Titterington, editors, Proc. of the International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS), volume 9, pages 789–795. 312

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization

and momentum in deep learning. In ICML. 236, 243

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural

networks. Technical report, arXiv preprint arXiv:1409.3215. 10, 240, 241

Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On autoencoders

and score matching for energy based models. In ICML’2011. ACM. 318

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,

and Rabinovich, A. (2014). Going deeper with convolutions. Technical report, arXiv preprint

arXiv:1409.4842. 9

Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework for

nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 98, 265

Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society B, 58, 267–288. 132

Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the

likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,

pages 1064–1071. ACM. 313

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis. Journal

of the Royal Statistical Society B, 61(3), 611–622. 187, 188

Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autoregressive

density-estimator. In NIPS’2013 . 233, 234

Utgoﬀ, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14,

2497–2539. 15

373

van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using t-SNE. J. Machine Learning

Res., 9. 265, 268

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag,

Berlin. 78, 79

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York. 78, 79,

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies

of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. 78,

Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural

Computation, 23(7), 1661–1674. 282, 318

Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press. 265

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing

robust features with denoising autoencoders. In ICML 2008 . 277

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denois-

ing autoencoders: Learning useful representations in a deep network with a local denoising

criterion. J. Machine Learning Res., 11. 277

Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In

Advances in Neural Information Processing Systems 26 , pages 351–359. 149

Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme recogni-

tion using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal

Processing, 37, 328–339. 217

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural

networks using dropconnect. In ICML’2013 . 149

Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 149

Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical analysis

of dropout in piecewise linear networks. In ICLR’2014. 149

Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by semidef-

inite programming. In CVPR’2004, pages 988–995. 98, 265

Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding.

In ICML 2008 . 18

Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to rank

with joint word-image embeddings. Machine Learning, 81(1), 21–35. 230

White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can

learn arbitrary mappings. Neural Networks, 3(5), 535–549. 121

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON

Convention Record, volume 4, pages 96–104. IRE, New York. 14

374

Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In NIPS’95,

pages 514–520. MIT Press, Cambridge, MA. 122

Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural

Computation, 8(7), 1341–1390. 121

Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated splicing

using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562. 149

Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian

mixtures. Neural Computation, 8, 129–151. 253

Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly decreas-

ing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 313

Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions of Space

by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical Society. American

Mathematical Society. 295

Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In

ECCV’14 . 6, 71

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal

of the Royal Statistical Society, Series B, 67(2), 301–320. 112

375

Index

norm, 26

Active constraint, 68

AIS, see annealed importance sampling

Almost everywhere, 52

Ancestral sampling, 171

Annealed importance sampling, 278, 309

Approximate inference, 174

Artiﬁcial intelligence, 2

Asymptotically unbiased, 84

Bagging, 144

Bayes’ rule, 51

Bayesian network, see directed graphical model

Bayesian probability, 37

Belief network, see directed graphical model

Bernoulli distribution, 44

Boltzmann distribution, 166

Boltzmann machine, 166

Calculus of variations, 302

CD, see contrastive divergence

Centering trick (DBM), 312

Central limit theorem, 45

Chain rule of probability, 40

Chess, 2

Classical regularization, 128

Classiﬁcation, 71

Cliﬀs, 151

Clipping the gradient, 244

Clique potential, see factor (graphical model)

CNN, see convolutional neural network

Collider, see explaining away

Computer vision, 323

Conditional computation, see dynamically struc-

tured nets, 318

Conditional independence, 40

Conditional probability, 39

Constrained optimization, 65

Context-speciﬁc independence, 169

Contrast, 324

Contrastive divergence, 284, 309, 310

Convolution, 199, 313

Convolutional neural network, 199

Coordinate descent, 156, 310

Correlation, 41

Cost function, see objective function

Covariance, 41

Covariance matrix, 41

curse of dimensionality, 100

Cyc, 2

D-separation, 169

Dataset augmentation, 324, 329

DBM, see deep Boltzmann machine

Deep belief network, 296, 306, 307, 314

Deep Blue, 2

Deep Boltzmann machine, 296, 306, 308, 310,

314

Deep learning, 2, 5

Denoising score matching, 293

Density estimation, 71

density estimation, 92

Derivative, 58

Detector layer, 204

Dirac delta function, 47

Directed graphical model, 162

Directional derivative, 62

domain adaptation, 193

Dot product, 23

Doubly block circulant matrix, 201

Dream sleep, 284, 304

DropConnect, 149

Dropout, 146, 310

Dynamically structured networks, 318

E-step, 299

Early stopping, 114, 134, 136–138, 150

376

EBM, see energy-based model

Eﬀective number of parameters, 131

Eigendecomposition, 28

Eigenvalue, 29

Eigenvector, 28

ELBO, see evidence lower bound

Element-wise product, see Hadamard product

EM, see expectation maximization

Empirical distribution, 47

Energy function, 166

Energy-based model, 166, 308

Ensemble methods, 144

Equality constraint, 67

Equivariance, 202

Error function, see objective function

Euclidean norm, 26

Euler-Lagrange equation, 302

Evidence lower bound, 296, 298–300, 308

Expectation, 41

Expectation maximization, 298

Expected value, see expectation

Explaining away, 170

Factor (graphical model), 164

Factor graph, 169

Factors of variation, 5

Frequentist probability, 37

Functional derivatives, 302

Gaussian distribution, see Normal distribu-

tion45

Gaussian mixture, 48

GCN, see Global contrast normalization

Gibbs distribution, 165

Gibbs sampling, 173

Global contrast normalization, 325

Global minimum, 13

Gradient, 62

Gradient clipping, 244

Gradient descent, 62

Graphical model, see structured probabilistic

model

Hadamard product, 22

Harmonium, see Restricted Boltzmann ma-

chine177

Harmony theory, 167

Helmholtz free energy, see evidence lower bound

Hessian matrix, 63

Identity matrix, 24

Independence, 40

Inequality constraint, 67

Inference, 159, 174, 296, 298–300, 302, 304

Invariance, 207

Jacobian matrix, 52, 62

Joint probability, 38

Karush-Kuhn-Tucker conditions, 68

Kernel (convolution), 200

KKT conditions, see Karush-Kuhn-Tucker con-

ditions

KL divergence, see Kllback-Leibler divergence42

Kullback-Leibler divergence, 42

Lagrange function, see Lagrangian

Lagrange multipliers, 67, 303

Lagrangian, 67

Learner, 3

Line search, 62

Linear combination, 25

Linear dependence, 26

Local conditional probability distribution, 162

Local minimum, 13

Logistic regression, 3

Logistic sigmoid, 48

Loss function, see objective function

M-step, 299

Machine learning, 3

Manifold hypothesis, 252

manifold hypothesis, 100

Manifold learning, 99, 252

MAP inference, 300

Marginal probability, 39

Markov chain, 171

Markov network, see undirected model164

Markov random ﬁeld, see undirected model164

Matrix, 21

Matrix inverse, 24

Matrix product, 22

Max pooling, 207

Mean ﬁeld, 309, 310

Measure theory, 51

Measure zero, 52

Method of steepest descent, see gradient de-

scent

Missing inputs, 71

Mixing (Markov chain), 175

Mixture distribution, 48

377

MNIST, 310

Model averaging, 144

Moore-Penrose pseudoinverse, 139

MP-DBM, see multi-prediction DBM

Multi-modal learning, 197

Multi-prediction DBM, 309, 312

Multinomial distribution, 44

Multinoulli distribution, 44

Naive Bayes, 53

Nat, 42

Negative deﬁnite, 63

Negative phase, 283

Netﬂix Grand Prize, 146

Noise-contrastive estimation, 293

Norm, 26

Normal distribution, 45, 47

Normal equations, 131

Normalized probability distribution, 165

Object detection, 323

Object recognition, 323

Objective function, 12, 58

one-shot learning, 196

Orthogonality, 28

Overﬁtting, 79

Parameter sharing, 202

Partial derivative, 58

Partition function, 103, 165, 276, 309

PCA, see principal components analysis

PCD, see stochastic maximum likelihood

Persistent contrastive divergence, see stochas-

tic maximum likelihood

Pooling, 199, 313

Positive deﬁnite, 63

Positive phase, 283

Precision (of a normal distribution), 45, 47

Predictive sparse decomposition, 183, 265

Preprocessing, 324

Principal components analysis, 31, 296, 326

Principle components analysis, 90

Probabilistic max pooling, 313

Probability density function, 38

Probability distribution, 37

Probability mass function, 37

Product rule of probability, see chain rule of

probability

PSD, see predictive sparse decomposition

Pseudolikelihood, 289

Random variable, 37

Ratio matching, 292

RBM, see restricted Boltzmann machine

Receptive ﬁeld, 203

Regression, 71

Regularization, 127

Representation learning, 3

Restricted Boltzmann machine, 177, 192, 296,

305, 306, 310, 312, 313

Ridge regression, 129

Scalar, 20

Score matching, 291

Second derivative, 62

Second derivative test, 63

Self-information, 42

Separable convolution, 216

Separation (probabilistic modeling), 167

Shannon entropy, 42, 303

Sigmoid, see logistic sigmoid

Singular value decomposition, 30, 140

SML, see stochastic maximum likelihood

Softmax, 110

Softplus, 48

Spam detection, 3

Sparse coding, 191, 296

spectral radius, 236

Sphering, see Whitening, 326

Square matrix, 26

Standard deviation, 41

Statistic, 83

Steepest descent, see gradient descent

Stochastic gradient descent, 310

Stochastic maximum likelihood, 288, 309, 310

Stochastic pooling, 149

Structure learning, 173

Structured output, 71

Structured probabilistic model, 158

Sum rule of probability, 39

SVD, see singular value decomposition

Symmetric matrix, 28

Tangent plane, 255

Tensor, 21

Test example, 12

Tiled convolution, 212

Toeplitz matrix, 201

Trace operator, 31

Training criterion, 12

Transcription, 71

378

Transfer learning, 193

Transpose, 21

Triangle inequality, 26

Unbiased, 84

Underﬁtting, 78

Undirected model, 164

Uniform distribution, 38

Unit norm, 28

Unnormalized probability distribution, 164

V-structure, see explaining away

Variance, 41

Variational derivatives, see functional deriva-

tives

Variational free energy, see evidence lower bound

Vector, 20

Weight decay, 129

Whitening, 326

ZCA, see zero-phase components analysis

zero-data learning, 196

Zero-phase components analysis, 326

zero-shot learning, 196

379