Bibliography

Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for

Boltzmann machines. Cognitive Science, 9, 147–169. 478

Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data gen-

erating distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal.

390

Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data

generating distribution. In ICLR’2013 . also arXiv report 1211.4246. 373, 390, 392

Alain, G., Bengio, Y., Yao, L.,

Eric Thibodeau-Laufer, Yosinski, J., and Vincent, P.

(2015). GSNs: Generative stochastic networks. arXiv:1503.05571. 377

Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian

gradient. In Advances in Neural Information Processing Systems, pages 127–133. MIT

Press. 158

Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris

Society, 59, 2–5. 19

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly

learning to align and translate. Technical report, arXiv:1409.0473. 22, 86, 325, 334,

335

Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition

with continuous-parameter hidden Markov models. Computer, Speech and Language,

2, 219–234. 62, 288

Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT

Press. 290

Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural

Information Processing Systems 26 , pages 2814–2822. 201

Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the

past and the future in protein secondary structure prediction. Bioinformatics, 15(11),

937–946. 258

521

BIBLIOGRAPHY

Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-

energy physics with deep learning. Nature communications, 5. 22

Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal

function. IEEE Trans. on Information Theory, 39, 930–945. 170

Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University

Press. 378

Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and

Applications. Wiley. 378

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,

Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.

Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 70

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of

ﬁnite state Markov chains. Ann. Math. Stat., 37, 1559–1563. 286

Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th Inter-

national Conference on Computational Learning Theory (COLT’95), pages 311–320,

Santa Cruz, California. ACM Press. 202

Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces

in random-dot stereograms. Nature, 355, 161–163. 425

Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for em-

bedding and clustering. In NIPS’01, Cambridge, MA. MIT Press. 411

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and

data representation. Neural Computation, 15(6), 1373–1396. 139, 429

Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distri-

butions using neural networks. IEEE Transactions on Neural Networks, special issue

on Data Mining and Knowledge Discovery, 11(3), 550–557. 263

Bengio, Y. (1991). Artiﬁcial Neural Networks and their Application to Sequence Recog-

nition. Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 269,

290

Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal

on Pattern Recognition and Artiﬁcial Intelligence, 7(4), 647–668. 288

Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys,

2, 129–162. 288

Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys,

2, 129–162. 290

Bengio, Y. (2002). New distributed probabilistic language models. Technical Report

1215, Dept. IRO, Universit´e de Montr´eal. 326

522

BIBLIOGRAPHY

Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 133, 171

Bengio, Y. (2013). Estimating or propagating gradients through stochastic neurons.

Technical Report arXiv:1305.2982, Universite de Montreal. 360

Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-

layer neural networks. In NIPS’99 , pages 400–406. MIT Press. 263, 265, 266

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.

Neural Computation, 21(6), 1601–1621. 390, 449, 486

Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE

Transactions on Neural Networks, 7(5), 1231–1249. 290

Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold

cross-validation. In NIPS’03, Cambridge, MA. MIT Press, Cambridge. 102

Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In L. Bottou,

O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT

Press. 17, 172

Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In Large

Scale Kernel Machines. 133

Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04 ,

pages 129–136. MIT Press. 137, 431, 432

Bengio, Y. and S´en´ecal, J.-S. (2003). Quick training of probabilistic neural nets by

importance sampling. In Proceedings of AISTATS 2003 . 330

Bengio, Y. and S´en´ecal, J.-S. (2008). Adaptive importance sampling to accelerate training

of a neural probabilistic language model. IEEE Trans. Neural Networks, 19(4), 713–

722. 330

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated

acoustic parameters for continuous speech recognition using artiﬁcial neural networks.

In Proceedings of EuroSpeech’91 . 23, 318

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992a). Global optimization of a

neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks,

3(2), 252–259. 288, 290

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992b). Neural network - gaussian

mixture hybrid for speech recognition or density estimation. In NIPS 4, pages 175–182.

Morgan Kaufmann. 318

Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term depen-

dencies in recurrent networks. In IEEE International Conference on Neural Networks,

pages 1183–1195, San Francisco. IEEE Press. (invited paper). 213, 276

523

BIBLIOGRAPHY

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with

gradient descent is diﬃcult. IEEE Tr. Neural Nets. 213, 214, 267, 274, 276, 277

Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for

on-line handwriting recognition. Neural Computation, 7(6), 1289–1303. 290

Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language

model. In NIPS’00, pages 932–938. MIT Press. 16

Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language

model. In NIPS’2000, pages 932–938. 319, 321, 322, 332

Bengio, Y., Ducharme, R., and Vincent, P. (2001c). A neural probabilistic language

model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages

932–938. MIT Press. 433, 434

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003a). A neural probabilistic

language model. JMLR, 3, 1137–1155. 321, 325, 332

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003b). A neural probabilistic

language model. Journal of Machine Learning Research, 3, 1137–1155. 433, 434

Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions

for local kernel machines. In NIPS’2005 . 133

Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows.

In NIPS’2005 . MIT Press. 137, 431

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise

training of deep networks. In NIPS’2006 . 12, 16, 396, 397

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In

ICML’09 . 158

Bengio, Y., L´eonard, N., and Courville, A. (2013a). Estimating or propagating gradients

through stochastic neurons for conditional computation. arXiv:1308.3432. 332, 360

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-

encoders as generative models. In NIPS’2013. 392, 508, 512

Bengio, Y., Courville, A., and Vincent, P. (2013c). Representation learning: A review and

new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),

35(8), 1798–1828. 423, 506

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014a). Deep generative

stochastic networks trainable by backprop. Technical Report arXiv:1306.1091. 360

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative

stochastic networks trainable by backprop. In ICML’2014 . 360, 509, 510, 511, 513,

514

524

BIBLIOGRAPHY

Bennett, C. (1976). Eﬃcient estimation of free energy diﬀerences from Monte Carlo data.

Journal of Computational Physics, 22(2), 245–268. 465

Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy

approach to natural language processing. Computational Linguistics, 22, 39–71. 333

Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive

divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 451

Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Clas-

siﬁcation. Ph.D. thesis, Universit´e de Montr´eal. 373

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,

J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression

compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (SciPy).

Oral Presentation. 70

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.

453

Bishop, C. M. (1994). Mixture density networks. 154

Bishop, C. M. (1995). Regularization and complexity control in feed-forward networks.

In Proceedings International Conference on Artiﬁcial Neural Networks ICANN’95 , vol-

ume 1, page 141–148. 196

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 84, 132, 134

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability

and the vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 97

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and

meaning representations for open-text semantic parsing. AISTATS’2012 . 261

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal

margin classiﬁers. In COLT ’92: Proceedings of the ﬁfth annual workshop on Com-

putational learning theory, pages 144–152, New York, NY, USA. ACM. 16, 123, 133,

149

Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications

`a la reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 290

Bottou, L. (2011). From machine learning to machine reasoning. Technical report,

arXiv.1102.1808. 260, 261

Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker inde-

pendent isolated digit recognition: multilayer perceptrons vs dynamic time warping.

Neural Networks, 3, 453–465. 290

525

BIBLIOGRAPHY

Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing

systems using graph transformer networks. In Proceedings of the Computer Vision and

Pattern Recognition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 282,

289, 291, 300, 301, 302

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and

singular value decomposition. Biological Cybernetics, 59, 291–294. 369

Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Ap-

proach, volume 247 of The Kluwer international series in engineering and computer

science. Kluwer Academic Publishers, Boston. 290

Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered

perceptrons. Computer Speech and Language, 3, 1–19. 318

Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multi-

layer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence,

12, 1167–1178. 290

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University

Press, New York, NY, USA. 80

Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate

where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665–674.

208

Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 139,

429

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 188

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and

Regression Trees. Wadsworth International Group, Belmont, CA. 134

Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition.

Ph.D. thesis, Dept. of Computer Science, Carnegie-Mellon University. 288

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Laﬀerty, J. D.,

Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.

Computational linguistics, 16(2), 79–85. 19

Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992).

Class-based n-gram models of natural language. Computational Linguistics, 18, 467–

479. 323

Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Pro-

ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 535–541. ACM. 305

526

BIBLIOGRAPHY

Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.

In R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005, pages 33–40. Society for

Artiﬁcial Intelligence and Statistics. 449, 486

Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models

Summer School, pages 372–379. 202

Cauchy, A. (1847). M´ethode g´en´erale pour la r´esolution de syst`emes d’´equations simul-

tan´ees. In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538. 72

Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,

UCSD. 139, 426

Chapelle, O., Weston, J., and Sch¨olkopf, B. (2003). Cluster kernels for semi-supervised

learning. In NIPS’02 , pages 585–592, Cambridge, MA. MIT Press. 411

Chapelle, O., Sch¨olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT

Press, Cambridge, MA. 411

Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neu-

ral Networks for Document Processing. In Guy Lorette, editor, Tenth International

Workshop on Frontiers in Handwriting Recognition, La Baule (France). Universit´e de

Rennes 1, Suvisoft. http://www.suvisoft.com. 20, 23, 304

Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for

language modeling. Computer, Speech and Language, 13(4), 359–393. 280, 281, 333

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,

and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for

statistical machine translation. In Proceedings of the Empiricial Methods in Natural

Language Processing (EMNLP 2014). 274, 334

Choromanska, A., Henaﬀ, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The

loss surface of multilayer networks. 208, 399

Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous

speech recognition using attention-based recurrent nn: First results. arXiv:1412.1602.

319

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated

recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,

arXiv 1412.3555. 319

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural

network for traﬃc sign classiﬁcation. Neural Networks, 32, 333–338. 21, 171

Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big

simple neural nets for handwritten digit recognition. Neural Computation, 22, 1–14.

20, 23

527

BIBLIOGRAPHY

Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse

coding and vector quantization. In ICML’2011 . 23

Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in un-

supervised feature learning. In Proceedings of the Thirteenth International Conference

on Artiﬁcial Intelligence and Statistics (AISTATS 2011). 314

Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep

learning with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings

of the 30th International Conference on Machine Learning (ICML-13), volume 28 (3),

pages 1337–1345. JMLR Workshop and Conference Proceedings. 20, 23

Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris

VI, LIP6. 149

Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language process-

ing: Deep neural networks with multitask learning. In ICML’2008 . 331

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,

36, 287–314. 379, 380

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,

273–297. 16, 123, 133

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmenta-

tion using depth information. In International Conference on Learning Representations

(ICLR2013). 21, 171

Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by

spike-and-slab RBMs. In ICML’11 . 341, 503

Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab

RBM and extensions to discrete and sparse data distributions. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 504

Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.

Wiley-Interscience. 54

Cox, R. T. (1946). Probability, frequency and reasonable expectation. American Journal

of Physics, 14, 1––10. 47

Cram´er, H. (1946). Mathematical methods of statistics. Princeton University Press. 114

Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304,

111–114. 447

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-

matics of Control, Signals, and Systems, 2, 303–314. 420

Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition

with the mean-covariance restricted Boltzmann machine. In NIPS’2010 . 21

528

BIBLIOGRAPHY

Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained

deep neural networks for large vocabulary speech recognition. IEEE Transactions on

Audio, Speech, and Language Processing, 20(1), 33–42. 318

Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for

QSAR predictions. arXiv:1406.1231. 22

Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-

dimensional inputs. In NIPS26 . NIPS Foundation. 457

Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with

reconstruction sampling. In ICML’2011 . 330

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).

Identifying and attacking the saddle point problem in high-dimensional non-convex

optimization. In NIPS’2014 . 74, 208, 399

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.

(2014). The visual microphone: Passive recovery of sound from video. ACM Transac-

tions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 311

de Finetti, B. (1937). La pr´evision: ses lois logiques, ses sources subjectives. Annales de

l’institut Henri Poincar´e, 7, 1–68. 47

Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.

17, 171, 420, 421

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A

Large-Scale Hierarchical Image Database. In CVPR09 . 19, 128

Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than

10,000 image categories tell us? In Proceedings of the 11th European Conference on

Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.

Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and

Adam, H. (2014). Large-scale object classiﬁcation using label relation graphs. In

ECCV’2014 , pages 48–64. 282

Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and

Trends in Signal Processing. 318

Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Bi-

nary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010 ,

Makuhari, Chiba, Japan. 21

Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs

for vision. Technical Report 1327, D´epartement d’Informatique et de Recherche

Op´erationnelle, Universit´e de Montr´eal. 504

529

BIBLIOGRAPHY

Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.

In NIPS’2011 . 465

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast

and robust neural network joint models for statistical machine translation. In Proc.

ACL’2014 . 334

DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.

neurons vs. machines. NIPS Tutorial. 22, 247

Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random ﬁelds. In International

Conference on Artiﬁcial Intelligence and Statistics, pages 177–184. 282

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,

K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual

recognition and description. arXiv:1411.4389. 86

Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embed-

ding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,

Stanford University. 139, 429

Doob, J. (1953). Stochastic processes. Wiley: New York. 47

Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.

IEEE Transactions on Neural Networks, 1, 75–80. 214, 267

Dugas, C., Bengio, Y., B´elisle, F., and Nadeau, C. (2001). Incorporating second-order

functional knowledge for better option pricing. In NIPS’00 , pages 472–478. MIT Press.

62, 149

Ebrahimi, S., Pal, C., Bouthillier, X., Froumenty, P., Jean, S., Konda, K. R., Vincent,

P., Courville, A., and Bengio, Y. (2013). Combining modality speciﬁc deep neural

network models for emotion recognition in video. In Emotion Recognition In The Wild

Challenge and Workshop (Emotiw2013). 171

El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term

dependencies. In NIPS 8 . MIT Press. 275, 279, 280

ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term

dependencies. In NIPS’1995 . 270

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).

Why does unsupervised pre-training help deep learning? J. Machine Learning Res.

397, 399, 400, 401

Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll´ar, P., Gao, J., He, X.,

Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual

concepts and back. arXiv:1411.4952. 86

530

BIBLIOGRAPHY

Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P.,

and Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekker-

man, M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and

Distributed Approaches. Cambridge University Press. 386

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-

telligence. 21, 171

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-

telligence, 35(8), 1915–1929. 282

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.

408

Fischer, A. and Igel, C. (2011). Bounding the bias of contrastive divergence learning.

Neural Computation, 23(3), 664–73. 486

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals

of Eugenics, 7, 179–188. 19, 89

Frasconi, P., Gori, M., and Sperduti, A. (1997). On the eﬃcient classiﬁcation of data

structures by neural networks. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence. 261

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive pro-

cessing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786.

261

Frey, B. J. (1998). Graphical models for machine learning and digital communication.

MIT Press. 262

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-

anism of pattern recognition unaﬀected by shift in position. Biological Cybernetics, 36,

193–202. 14, 20, 23, 248

Garson, J. (1900). The metric system of identiﬁcation of criminals, as used in in great

britain and ireland. The Journal of the Anthropological Institute of Great Britain and

Ireland, (2), 177–227. 19

Girosi, F. (1994). Regularization theory, radial basis functions and networks. In

V. Cherkassky, J. Friedman, and H. Wechsler, editors, From Statistics to Neural Net-

works, volume 136 of NATO ASI Series, pages 166–187. Springer Berlin Heidelberg.

170

Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks. In

AISTATS’2011 . 14, 149, 385

531

BIBLIOGRAPHY

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectiﬁer neural networks.

In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS 2011). 174, 385

Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale senti-

ment classiﬁcation: A deep learning approach. In ICML’2011 . 385, 405

Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face

Recognition. Imperial College Press. 430, 432

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep

networks. In NIPS’2009 , pages 646–654. 373, 385

Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L.

(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot

Interaction (HRI), Osaka, Japan. ACM Press, ACM Press. 85

Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with

spike-and-slab sparse coding. In ICML’2012 . 381

Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution

for autoencoders. Technical report, Universit´e de Montr´eal. 241

Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding

for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning

Hierarchical Models. 171, 405

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).

Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–

1327. 174, 200, 246, 314

Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction

deep Boltzmann machines. In NIPS26 . NIPS Foundation. 86, 455, 500, 501

Goodfellow, I. J., Courville, A., and Bengio, Y. (2013c). Scaling up spike-and-slab models

for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(8), 1902–1914. 504

Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014). Multi-digit

number recognition from Street View imagery using deep convolutional neural net-

works. In International Conference on Learning Representations. 21, 307

Goodman, J. (2001). Classes for fast maximum entropy training. In International Con-

ference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 326

Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE

Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(1), 76–86. 208

Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Originally

published under the pseudonym “Student”. 19

532

BIBLIOGRAPHY

Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed

representations without word alignments. Technical report, arXiv:1410.2455. 409

Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies

in Computational Intelligence. Springer. 258, 273, 274, 282, 319

Graves, A. (2013). Generating sequences with recurrent neural networks. Technical

report, arXiv:1308.0850. 155, 273, 275

Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent

neural networks. In ICML’2014 . 273

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁcation with bidirec-

tional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.

258

Graves, A. and Schmidhuber, J. (2009). Oﬄine handwriting recognition with multidi-

mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and

L. Bottou, editors, NIPS’2008 , pages 545–552. 258

Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-

ral classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks.

In ICML’2006 , pages 369–376, Pittsburgh, USA. 282, 319

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Uncon-

strained on-line handwriting recognition with recurrent neural networks. In J. Platt,

D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 258

Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., and Schmidhuber,

J. (2009). A novel connectionist system for unconstrained handwriting recognition.

Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868.

273

Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recur-

rent neural networks. In ICASSP’2013 , pages 6645–6649. 258, 273, 274, 319

Graves, A., Wayne, G., and Danihelka, I. (2014). Neural Turing machines.

arXiv:1410.5401. 22

G¨ul¸cehre, C¸ . and Bengio, Y. (2013). Knowledge matters: Importance of prior infor-

mation for optimization. In International Conference on Learning Representations

(ICLR’2013). 21

Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-

tion principle for unnormalized statistical models. In Proceedings of The Thirteenth

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10). 457

Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.

(2007). Online learning for oﬀroad robots: Spatial label propagation to learn long-

range traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA,

USA. 312

533

BIBLIOGRAPHY

Haﬀner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural

networks for high performance continuous speech recognition. In International Confer-

ence on Acoustics, Speech and Signal Processing (ICASSP), pages 105–108, Toronto.

290

H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings

of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,

California. ACM Press. 171, 420

H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.

Computational Complexity, 1, 113–129. 171, 420

Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 15

Henaﬀ, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning

of sparse features for scalable audio classiﬁcation. In ISMIR’11 . 386

Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modiﬁables: D´ecodage de

messages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie

des Sciences, 299(III-13), 525––528. 379

Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,

Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic

modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 21,

318

Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.

Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 448

Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 429

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with

neural networks. Science, 313(5786), 504–507. 375, 396, 397

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with

Neural Networks. Science, 313, 504–507. 399

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

Helmholtz free energy. In NIPS’1993 . 369

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief

nets. Neural Computation, 18, 1527–1554. 12, 16, 23, 124, 396, 397, 487

Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,

Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural

networks for acoustic modeling in speech recognition: The shared views of four research

groups. IEEE Signal Process. Mag., 29(6), 82–97. 86

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.

(2012c). Improving neural networks by preventing co-adaptation of feature detectors.

Technical report, arXiv:1207.0580. 185

534

BIBLIOGRAPHY

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma

thesis, T.U. M¨unich. 213, 267, 276

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computa-

tion, 9(8), 1735–1780. 22, 273, 274

Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).

Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies. In

J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE

Press. 274

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are

universal approximators. Neural Networks, 2, 359–366. 420

Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World

Chess Champion. Princeton University Press, Princeton, NJ, USA. 2

Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov

random ﬁelds on lattice. Annals of the Institute of Statistical Mathematics, 54(1),

1–18. 454

Hubel, D. and Wiesel, T. (1968). Receptive ﬁelds and functional architecture of monkey

striate cortex. Journal of Physiology (London), 195, 215–243. 245

Hubel, D. H. and Wiesel, T. N. (1959). Receptive ﬁelds of single neurons in the cat’s

striate cortex. Journal of Physiology, 148, 574–591. 245

Hubel, D. H. and Wiesel, T. N. (1962). Receptive ﬁelds, binocular interaction, and

functional architecture in the cat’s visual cortex. Journal of Physiology (London),

160, 106–154. 245

Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96,

pages 13–24. 253

Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing

Surveys, 2, 94–128. 379

Hyv¨arinen, A. (2005a). Estimation of non-normalized statistical models using score

matching. J. Machine Learning Res., 6. 390

Hyv¨arinen, A. (2005b). Estimation of non-normalized statistical models using score

matching. Journal of Machine Learning Research, 6, 695–709. 455

Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence,

and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural

Networks, 18, 1529–1531. 456

Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and

Data Analysis, 51, 2499–2512. 456

535

BIBLIOGRAPHY

Hyv¨arinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Ex-

istence and uniqueness results. Neural Networks, 12(3), 429–439. 380

Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis.

Wiley-Interscience. 379

Ioﬀe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training

by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 . 85

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture

of local experts. Neural Computation, 3, 79–87. 154

Jaeger, H. (2003). Adaptive nonlinear system identiﬁcation with echo state networks. In

Advances in Neural Information Processing Systems 15 . 268

Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state

networks. Technical report, Jacobs University. 275

Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 267

Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and

saving energy in wireless communication. Science, 304(5667), 78–80. 23, 267

Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Sch¨olkopf, B. (2012).

On causal and anticausal learning. In ICML’2012 , pages 1255–1262. 412, 414

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best

multi-stage architecture for object recognition? In ICCV’09. 14, 149, 386

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best

multi-stage architecture for object recognition? In Proc. International Conference on

Computer Vision (ICCV’09), pages 2146–2153. IEEE. 20, 23, 173

Jarzynski, C. (1997). Nonequilibrium equality for free energy diﬀerences. Phys. Rev.

Lett., 78, 2690–2693. 464

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University

Press. 46

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target

vocabulary for neural machine translation. arXiv:1412.2007. 334

Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters

from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in

Practice. North-Holland, Amsterdam. 280, 333

Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 16

Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classi-

ﬁcation. IEEE Transactions on Signal Processing, 40(12), 3043–3054. 288

536

BIBLIOGRAPHY

Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algo-

rithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 379

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In

EMNLP’2013 . 334

Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.

IEEE Transactions on Pattern Analysis and Machine Intelligence. 392

Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image

descriptions. In CVPR’2015 . arXiv:1412.2306. 86

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).

Large-scale video classiﬁcation with convolutional neural networks. In CVPR. 19

Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side

Constraints. Master’s thesis, Dept.˜of Mathematics, Univ.˜of Chicago. 82

Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model

component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal

Processing, ASSP-35(3), 400–401. 280, 333

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding

algorithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 372

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding

algorithms with applications to object recognition. Technical report, Computational

and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-

12-01. 386

Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant

features through topographic ﬁlter maps. In CVPR’2009. 386

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.

(2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’2010 .

386

Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary

Mathematics ; V. 1). American Mathematical Society. 345

Kingma, D. and LeCun, Y. (2010a). Regularized estimation of image statistics by score

matching. In NIPS’2010 . 390

Kingma, D. and LeCun, Y. (2010b). Regularized estimation of image statistics by score

matching. In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,

editors, Advances in Neural Information Processing Systems 23, pages 1126–1134. 457

Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning

with deep generative models. In NIPS’2014. 360

537

BIBLIOGRAPHY

Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable

models in auxiliary form. Technical report, arxiv:1306.0733. 360

Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings

of the International Conference on Learning Representations (ICLR). 360, 432, 433

Kingma, D. P. and Welling, M. (2014b). Eﬃcient gradient-based inference through trans-

formations between bayes nets and neural nets. Technical report, arxiv:1402.0480. 360

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.

In ICML’2014 . 86

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embed-

dings with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 86, 273

Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed

representations of words. In Proceedings of COLING 2012 . 409

Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and

Pﬁster, H. (2014). Deep learning for the connectome. GPU Technology Conference. 22

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and

Techniques. MIT Press. 286, 358, 365

Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and max-

imization of A posteriori probabilities – application to transition-based connectionist

speech recognition. In NIPS’95 . MIT Press, Cambridge, MA. 318

Koren, Y. (2009). 1 the bellkor solution to the netﬂix grand prize. 191

Koutnik, J., Greﬀ, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In

ICML’2014 . 275, 280

Koˇcisk´y, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-

sentations by Marginalizing Alignments. In Proceedings of ACL. 335

Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties

of DBNs with binary hidden units and real-valued visible units. In ICML’2013. 420

Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny

images. Technical report, University of Toronto. 19, 341

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classiﬁcation with deep

convolutional neural networks. In Advances in Neural Information Processing Systems

25 (NIPS’2012). 20, 23, 85, 312

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classiﬁcation with deep

convolutional neural networks. In NIPS’2012 . 21, 171, 385

538

BIBLIOGRAPHY

Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the

Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–

492, Berkeley, Calif. University of California Press. 82

Laﬀerty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random ﬁelds:

Probabilistic models for segmenting and labeling sequence data. In C. E. Brodley and

A. P. Danyluk, editors, ICML 2001 . Morgan Kaufmann. 282, 289

Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural net-

work architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-

Mellon University. 250, 269

Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear

independent component analysis using ensemble learning: Experiments and discussion.

In Proc. ICA. Citeseer. 380

Larochelle, H. and Bengio, Y. (2008a). Classiﬁcation using discriminative restricted Boltz-

mann machines. In ICML’2008 . 373, 515

Larochelle, H. and Bengio, Y. (2008b). Classiﬁcation using discriminative restricted

Boltzmann machines. In ICML’08 , pages 536–543. ACM. 411

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.

In AISTATS’2011 . 262, 265

Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In

AAAI Conference on Artiﬁcial Intelligence. 409

Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative

and discriminative models. In Proceedings of the Computer Vision and Pattern Recog-

nition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer

Society. 411

Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,

A. (2012). Building high-level features using large scale unsupervised learning. In

ICML’2012 . 20, 23

Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approx-

imators. Neural Computation, 22(8), 2192–2207. 420

Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural

gradient algorithm. In NIPS’07 . 158

LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de

Paris VI. 16, 369

LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,

Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications

of neural network chips and automatic learning. IEEE Communications Magazine,

27(11), 41–46. 248

539

BIBLIOGRAPHY

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998a). Gradient-based learning

applied to document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 14, 23

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998b). Gradient-based learning

applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 16, 19,

282, 289, 291, 319

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area

V2. In NIPS’07 . 373

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief

networks for scalable unsupervised learning of hierarchical representations. In L. Bottou

and M. Littman, editors, ICML 2009. ACM, Montreal, Canada. 504, 505

Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; represen-

tation and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.

Leprieur, H. and Haﬀner, P. (1995). Discriminant learning with minimum memory loss

for improved non-vocabulary rejection. In EUROSPEECH’95, Madrid, Spain. 288

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies

is not as diﬃcult with NARX recurrent neural networks. IEEE Transactions on Neural

Networks, 7(6), 1329–1338. 270

Linde, N. (1992). The machine that changed the world, episode 3. Documentary minis-

eries. 2

Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to ap-

proximately evaluate or simulate. In Proceedings of the 27th International Conference

on Machine Learning (ICML’10). 482

Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine

invented by Charles Babbage”. 1

Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 282, 288, 292

Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent

neural network training. Computer Science Review, 3(3), 127–149. 267

Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with

convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 87

Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09 . 456

Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without

stable states: A new framework for neural computation based on perturbations. Neural

Computation, 14(11), 2531–2560. 267

MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge

University Press. 54

540

BIBLIOGRAPHY

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning

with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 86

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for

restricted Boltzmann machine learning. In Proceedings of The Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10), volume 9, pages

509–516. 451, 456, 484

Martens, J. and Medabalimi, V. (2014). On the expressive eﬃciency of sum product

networks. arXiv:1411.7717 . 421

Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-

free optimization. In Proc. ICML’2011 . ACM. 277

Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous

state space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612.

454

Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition

using a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo

CA. Morgan Kaufmann. 290

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall,

London. 150

McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous

activity. Bulletin of Mathematical Biophysics, 5, 115–133. 13

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,

Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,

J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In

JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 171, 405

Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surﬁng on the

manifold. Learning Workshop, Snowbird. 508

Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular

PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 321

Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,

Brno University of Technology. 155, 278

Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empiri-

cal evaluation and combination of advanced language modeling techniques. In Proc.

12th annual conference of the international speech communication association (INTER-

SPEECH 2011). 332

Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for

training large scale neural network language models. In Proc. ASRU’2011. 332

541

BIBLIOGRAPHY

Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages

for machine translation. Technical report, arXiv:1309.4168. 409

Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cam-

bridge UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 461

Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 13

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 84

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings eﬃciently with noise-

contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and

K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages

2265–2273. Curran Associates, Inc. 331, 459

Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural proba-

bilistic language models. In ICML’2012 , pages 1751–1758. 331

Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief

networks. IEEE Trans. on Audio, Speech and Language Processing, 20(1), 14–22. 318

Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks

with discrete units. Neural Computation, 26. 420

Mont´ufar, G. and Ay, N. (2011). Reﬁnements of universal approximation results for

deep belief networks and restricted Boltzmann machines. Neural Computation, 23(5),

1306–1319. 420

Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product

of mixtures? SIAM Journal on Discrete Mathematics, 29(1), 321–347. 419

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear

regions of deep neural networks. In NIPS’2014 . 17, 418, 421, 422

Mor-Yosef, S., Samueloﬀ, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking

the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet

Gynecol, 75(6), 944–7. 2

Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language

model. In AISTATS’2005. 326, 329

Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages

275–282, San Mateo, CA. Morgan Kaufmann. 270, 271, 280

Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cam-

bridge, MA, USA. 84, 132, 134

Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In

ICML’2014 . 155, 266, 267

542

BIBLIOGRAPHY

Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method

for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,

ASSP-36(9), 1432–1436. 288

Nair, V. and Hinton, G. (2010a). Rectiﬁed linear units improve restricted Boltzmann

machines. In ICML’2010 . 149, 385

Nair, V. and Hinton, G. E. (2010b). Rectiﬁed linear units improve restricted Boltzmann

machines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh

International Conference on Machine Learning (ICML-10), pages 807–814. ACM. 14

Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypoth-

esis. In NIPS’2010 . 139, 426

Neal, R. M. (1992). Connectionist learning of belief networks. Artiﬁcial Intelligence, 56,

71–113. 506

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Springer. 201

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2),

125–139. 463, 464

Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance

sampling. 465

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Read-

ing digits in natural images with unsupervised feature learning. Deep Learning and

Unsupervised Feature Learning Workshop, NIPS. 19

Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical

language modelling. In European Conference on Speech Communication and Technol-

ogy (Eurospeech), pages 973–976, Berlin. 323

Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of

part-of-speech and automatically derived category-based language models for speech

recognition. In International Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 177–180. 323

Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in

classifying static speech patterns. Computer Speech and Language, 4, 275–289. 149

Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 80, 82

Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural

Computation, 17, 1665–1699. 14

Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld prop-

erties by learning a sparse code for natural images. Nature, 381, 607–609. 372, 425

543

BIBLIOGRAPHY

Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set:

a strategy employed by V1? Vision Research, 37, 3311–3325. 317, 385

Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning

algorithms for various stochastic models. Neural Networks, 13(7), 755 – 764. 158

Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de

Montr´eal. 210, 211

Pascanu, R. and Bengio, Y. (2012). On the diﬃculty of training recurrent neural networks.

Technical Report arXiv:1211.5063, Universite de Montreal. 155

Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Tech-

nical report, arXiv:1301.3584. 158

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the diﬃculty of training recurrent

neural networks. In ICML’2013 . 155, 214, 267, 271, 278, 279, 280

Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. Technical report, U.

Montreal, arXiv:1312.6098. 171

Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014a). How to construct deep

recurrent neural networks. In ICLR’2014 . 17, 273, 275, 319, 421, 422

Pascanu, R., G¨ul¸cehre, C¸ ., Cho, K., and Bengio, Y. (2014b). How to construct deep

recurrent neural networks. In ICLR’2014 . 200

Pascanu, R., Montufar, G., and Bengio, Y. (2014c). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 419

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential

reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, Uni-

versity of California, Irvine, pages 329–334. 343

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann. 47

Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 27

Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recog-

nition hard? PLoS Comput Biol, 4. 315, 505

Pollack, J. B. (1990). Recursive distributed representations. Artiﬁcial Intelligence, 46(1),

77–105. 260

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.

USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 217

Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders

and deep networks. CoRR, abs/1406.1831. 188

544

BIBLIOGRAPHY

Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In

UAI’2011 , Barcelona, Spain. 171, 421

Poundstone, W. (2005). Fortune’s Formula: The untold story of the scientiﬁc betting

system that beat the casinos and Wall Street. Macmillan. 55

Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 149

Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual

representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.

246

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in

speech recognition. Proceedings of the IEEE, 77(2), 257–286. 286, 318

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE

ASSP Magazine, pages 257–285. 250, 286

Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive

distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 266

Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning

using graphics processors. In L. Bottou and M. Littman, editors, ICML 2009 , pages

873–880, New York, NY, USA. ACM. 23

Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Founda-

tions of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster

University Archive for the History of Economic Thought. 48

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Eﬃcient learning of sparse

representations with an energy-based model. In NIPS’2006. 12, 16, 385, 396, 397

Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief

networks. In NIPS’2007 . 385

Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical

parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 114

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and

approximate inference in deep generative models. In ICML’2014. 360

Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning

through cross-modal transfer. In 27th Annual Conference on Neural Information Pro-

cessing Systems (NIPS 2013). 409

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-

encoders: Explicit invariance during feature extraction. In ICML’2011. 392, 394,

428

545

BIBLIOGRAPHY

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.

(2011b). Higher order contractive auto-encoder. In European Conference on Machine

Learning and Principles and Practice of Knowledge Discovery in Databases (ECML

PKDD). 373

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.

(2011c). Higher order contractive auto-encoder. In ECML PKDD. 392

Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold

tangent classiﬁer. In NIPS’2011 . 441

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for

sampling contractive auto-encoders. In ICML’2012. 507, 508

Roberts, S. and Everson, R. (2001). Independent component analysis: principles and

practice. Cambridge University Press. 380

Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech

recognition system. Computer Speech and Language, 5(3), 259–274. 23, 318

Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 80

Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part

i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,

8(1), pp. 181–217. 80

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage

and organization in the brain. Psychological Review, 65, 386–408. 12, 13, 23

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 13, 23

Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear

embedding. Science, 290(5500). 139, 429

Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-

propagating errors. Nature, 323, 533–536. 12, 16, 21, 319

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal repre-

sentations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,

Parallel Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cam-

bridge. 19, 23

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986c). Learning representations

by back-propagating errors. Nature, 323, 533–536. 144, 250

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986d). Parallel

Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,

Cambridge. 15

546

BIBLIOGRAPHY

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986e). Parallel

Distributed Processing: Explorations in the Microstructure of Cognition, volume 1.

MIT Press, Cambridge. 144

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large

Scale Visual Recognition Challenge. 19

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition

challenge. arXiv preprint arXiv:1409.0575 . 24

Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal

elements of macaque V1 receptive ﬁelds. Neuron, 46(6), 945–956. 248

Sainath, T., rahman Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep

convolutional neural networks for LVCSR. In ICASSP 2013 . 319

Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages

448–455. 20, 23, 397, 489, 493, 498, 500

Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings

of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AIS-

TATS 2009), volume 8. 496, 502, 513

Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance

kernels for Gaussian processes. In NIPS’07 , pages 1249–1256, Cambridge, MA. MIT

Press. 412

Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief

networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,

volume 25, pages 872–879. ACM. 464

Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean ﬁeld theory for sigmoid belief

networks. Journal of Artiﬁcial Intelligence Research, 4, 61–76. 23

Schaul, T., Zhang, S., and LeCun, Y. (2012). No More Pesky Learning Rates. Technical

report, New York University, arxiv 1206.1106. 222

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of

history compression. Neural Computation, 4(2), 234–242. 275

Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on

Neural Networks, 7(1), 142–146. 321

Sch¨olkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 133

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a

kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 139, 429

547

BIBLIOGRAPHY

Sch¨olkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods —

Support Vector Learning. MIT Press, Cambridge, MA. 16, 149, 172

Schulz, H. and Behnke, S. (2012). Learning two-layer contractive encodings. In

ICANN’2012 , pages 620–628. 394

Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE

Transactions on Signal Processing, 45(11), 2673–2681. 258

Schwenk, H. (2007). Continuous space language models. Computer speech and language,

21, 492–518. 321, 325

Schwenk, H. (2010). Continuous space language models for statistical machine translation.

The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 321, 333

Schwenk, H. (2014). Cleaned subset of wmt ’14 dataset. 19

Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocab-

ulary continuous speech recognition. In International Conference on Acoustics, Speech

and Signal Processing (ICASSP), volume 1, pages 765–768. 321

Schwenk, H. and Gauvain, J.-L. (2005). Building continuous space language models for

transcribing european languages. In Interspeech, pages 737–740. 321

Schwenk, H., Costa-juss`a, M. R., and Fonollosa, J. A. R. (2006). Continuous space lan-

guage models for the iwslt 2006 task. In International Workshop on Spoken Language

Translation, pages 166–173. 321, 333

Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-

dependent deep neural networks. In Interspeech 2011 , pages 437–440. 21

Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied

to house numbers digit classiﬁcation. CoRR, abs/1204.3968. 316

Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection

with unsupervised multi-stage feature learning. In Proc. International Conference on

Computer Vision and Pattern Recognition (CVPR’13). IEEE. 21, 171

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical

Journal, 27(3), 379—-423. 55

Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the

Institute of Radio Engineers, 37(1), 10–21. 55

Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publica-

tions. 27

Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–

548. 253

548

BIBLIOGRAPHY

Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied

Mathematics Letters, 4(6), 77–80. 253

Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.

Journal of Computer and Systems Sciences, 50(1), 132–150. 214

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism

for specifying selected invariances in an adaptive network. In NIPS’1991 . 440, 441

Simard, P. Y., LeCun, Y., and Denker, J. (1993). Eﬃcient pattern recognition using a

new transformation distance. In NIPS’92 . 439

Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation

invariance in pattern recognition — tangent distance and tangent propagation. Lecture

Notes in Computer Science, 1524. 439

Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a min-

imum, with application to neural networks. International Journal of Control, 62(6),

1391–1407. 196

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of

harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed

Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 350, 362

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a).

Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In

NIPS’2011 . 261

Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural

language with recursive neural networks. In Proceedings of the Twenty-Eighth Inter-

national Conference on Machine Learning (ICML’2011). 261

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).

Semi-supervised recursive autoencoders for predicting sentiment distributions. In

EMNLP’2011 . 261

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.

(2013). Recursive deep models for semantic compositionality over a sentiment treebank.

In EMNLP’2013 . 261

Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural

networks. Complex Systems, 2, 625–639. 152

Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local

minima even for networks without hidden layers. Complex Systems, 3, 91–106. 208

Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann

machines. In NIPS’2012 . 410

549

BIBLIOGRAPHY

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).

Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Ma-

chine Learning Research, 15, 1929–1958. 198, 200, 201, 500

Stewart, L., He, X., and Zemel, R. S. (2007). Learning ﬂexible features for conditional

random ﬁelds. IEEE Transactions on Pattern Analysis and Machine Intelligence,

30(8), 1415–1426. 282

Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of

computer science, University of Toronto. 268, 269, 277

Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive

Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 9, pages 789–

795. 449

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of

initialization and momentum in deep learning. In ICML. 217, 268, 269, 277

Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with

neural networks. Technical report, arXiv:1409.3215. 22, 86, 273, 274

Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with

neural networks. In NIPS’2014 . 334, 335

Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.

Master’s thesis, University of British Columbia. 390

Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On

autoencoders and score matching for energy based models. In ICML’2011 . ACM. 457

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-

houcke, V., and Rabinovich, A. (2014). Going deeper with convolutions. Technical

report, arXiv:1409.4842. 20, 21, 23, 204, 236

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to

human-level performance in face veriﬁcation. In CVPR’2014 . 85

Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In

Proceedings of the 27th International Conference on Machine Learning, June 21-24,

2010, Haifa, Israel. 187

Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework

for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 139, 400, 401,

429

Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994 . 441

Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society B, 58, 267–288. 183

550

BIBLIOGRAPHY

Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to

the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors,

ICML 2008 , pages 1064–1071. ACM. 451, 487

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.

Journal of the Royal Statistical Society B, 61(3), 611–622. 378, 379

Torabi, A., Pal, C. J., Larochelle, H., and Courville, A. C. (2015). Using descriptive

video services to create a large data source for video annotation research. CoRR,

abs/1503.01070. 128

Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autore-

gressive density-estimator. In NIPS’2013 . 264, 266

van der Maaten, L. and Hinton, G. E. (2008a). Visualizing data using t-SNE. J. Machine

Learning Res., 9. 321, 400, 429, 433

van der Maaten, L. and Hinton, G. E. (2008b). Visualizing data using t-SNE. Journal of

Machine Learning Research, 9, 2579–2605. 401

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-

Verlag, Berlin. 97

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative

frequencies of events to their probabilities. Theory of Probability and Its Applications,

16, 264–280. 97

Vincent, P. (2011a). A connection between score matching and denoising autoencoders.

Neural Computation, 23(7). 390, 392, 507

Vincent, P. (2011b). A connection between score matching and denoising autoencoders.

Neural Computation, 23(7), 1661–1674. 457, 509

Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.

431

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and

composing robust features with denoising autoencoders. In ICML 2008 . 387

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked

denoising autoencoders: Learning useful representations in a deep network with a local

denoising criterion. J. Machine Learning Res., 11. 387

Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a). Gram-

mar as a foreign language. Technical report, arXiv:1412.7449. 273

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural

image caption generator. arXiv 1411.4555. 273

551

BIBLIOGRAPHY

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: a neural image

caption generator. In CVPR’2015 . arXiv:1411.4555. 86

Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal

projections directed to the auditory pathway. Nature, 404(6780), 871–876. 14

Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.

In Advances in Neural Information Processing Systems 26 , pages 351–359. 201

Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme

recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,

and Signal Processing, 37, 328–339. 250, 312, 318

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of

neural networks using dropconnect. In ICML’2013. 202

Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 201

Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical

analysis of dropout in piecewise linear networks. In ICLR’2014 . 201

Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by

semideﬁnite programming. In CVPR’2004 , pages 988–995. 139, 429

Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised em-

bedding. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 , pages

1168–1175, New York, NY, USA. ACM. 411

Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning

to rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 261

White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward net-

works can learn arbitrary mappings. Neural Networks, 3(5), 535–549. 170

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON

Convention Record, volume 4, pages 96–104. IRE, New York. 13, 19, 20, 23

Wikipedia (2015). List of animals by number of neurons — wikipedia, the free encyclo-

pedia. [Online; accessed 4-March-2015]. 20, 23

Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In

NIPS’95 , pages 514–520. MIT Press, Cambridge, MA. 172

Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms.

Neural Computation, 8(7), 1341–1390. 99, 170

Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image

recognition. arXiv:1501.02876. 21

552

BIBLIOGRAPHY

Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated

splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562.

201

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,

and Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with

visual attention. 86

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,

and Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with

visual attention. arXiv:1502.03044. 273

Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for

gaussian mixtures. Neural Computation, 8, 129–151. 287

Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly

decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 449,

487

Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions

of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical

Society. American Mathematical Society. 419

Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional net-

works. In ECCV’14 . 6

Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative

stochastic network for protein secondary structure prediction. In ICML’2014 . 514, 515

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.

Journal of the Royal Statistical Society, Series B, 67(2), 301–320. 157

Z¨ohrer, M. and Pernkopf, F. (2014). General stochastic networks for classiﬁcation. In

NIPS’2014 . 514

553

Index

norm, 47

, 347

Active constraint, 116

ADALINE, see Adaptive Linear Element

Adaptive Linear Element, 18, 28, 32

Aﬃne, 135

AIS, see annealed importance sampling

Almost everywhere, 92

Ancestral sampling, 505

ANN, see Artiﬁcial neural network

Annealed importance sampling, 638, 685

Approximate inference, 496

Artiﬁcial intelligence, 1

Artiﬁcial neural network, see Neural net-

work

Asymptotically unbiased, 150

Autoencoder, 6

Bagging, 269

Bayes’ rule, 90, 91

Bayesian hyperparameter optimization, 426

Bayesian network, see directed graphical model

Bayesian probability, 68

Bayesian statistics, 166

Beam Search, 416

Belief network, see directed graphical model

Bernoulli distribution, 82

Bias, 150

Boltzmann distribution, 482

Boltzmann machine, 483

Boltzmann Machines, 660

Broadcasting, 39

Calculus of variations, 654

Categorical distribution, see multinoulli dis-

tribution82

CD, see contrastive divergence

Centering trick (DBM), 690

Central limit theorem, 85

Chain rule of probability, 74

Chess, 2

Chord, 490

Chordal graph, 490

Classical regularization, 253

Classiﬁcation, 121

Cliﬀs, 297

Clipping the gradient, 395

Clique potential, see factor (graphical model)

CNN, see convolutional neural network

Collider, see explaining away

Computer vision, 431

Conditional computation, see dynamically

structured nets, 423

Conditional Computation in Neural Nets,

460

Conditional independence, vi, 75

Conditional probability, 73

Connectionism, 20

consistency, 159

Constrained optimization, 114

Context-speciﬁc independence, 486

Contrast, 434

Contrastive divergence, 619, 685, 691

Convolution, 316, 696

Convolutional network, 20

Convolutional neural network, 316

Coordinate descent, 311, 690

Correlation, 77

Cost function, see objective function

554

INDEX

Covariance, vi, 76

Covariance matrix, 77

Cross-entropy, 215

cross-entropy, 162

Cross-validation, 147

curse of dimensionality, 187

Cyc, 3

D-separation, 485

Data generating distribution, 136

Data generating process, 136

Dataset, 128

Dataset augmentation, 433, 440

DBM, see deep Boltzmann machine

Decoder, 6

Deep belief network, 32, 645, 661, 674, 697

Deep Blue, 2

Deep Boltzmann machine, 28, 32, 645, 661,

677, 691, 697

Deep learning, 2, 7

Denoising score matching, 631

Density estimation, 125

Derivative, vi, 102

Detector layer, 328

Diagonal matrix, 49

Dirac delta function, 86

Directed graphical model, 473

Directional derivative, 105

Distributed Representation, 574

Distributed representation, 21

domain adaptation, 559

Dot product, 41

Doubly block circulant matrix, 319

Dream sleep, 618, 659

DropConnect, 287

Dropout, 282, 426, 427, 691

Dynamically structured networks, 423

E-step, 649

Early stopping, 226, 272, 274, 276–278

EBM, see energy-based model

Echo state network, 28, 32

Eﬀective number of parameters, 257

Eﬃciency, 164

Eigendecomposition, 51

Eigenvalue, 52

Eigenvector, 52

ELBO, see evidence lower bound

Element-wise product, see Hadamard prod-

uct, see Hadamard product

EM, see expectation maximization

Embedding, 594

Empirical distribution, 86

Empirical risk, 292

Empirical risk minimization, 292

Encoder, 6

Energy function, 482

Energy-based model, 482, 678

Ensemble methods, 269

Epoch, 295, 307

Equality constraint, 115

Equivariance, 324

Error function, see objective function

Euclidean norm, 47

Euler-Lagrange equation, 655

Evidence lower bound, 647, 648, 650, 652,

676

Example, 128

Expectation, 76

Expectation maximization, 649

Expected value, see expectation

Explaining away, 487

Factor (graphical model), 478

Factor graph, 492

Factors of variation, 6

Feature, 128

Fourier transform, 344

Fovea, 348

Frequentist probability, 68

Frequentist statistics, 166

Functional derivatives, 654

Gaussian distribution, see Normal distri-

bution83

Gaussian kernel, 179

Gaussian mixture, 88

GCN, see Global contrast normalization

Generalization, 135

Generalized Lagrange function, see Gener-

alized Lagrangian

Generalized Lagrangian, 115

555

INDEX

Gibbs distribution, 479

Gibbs sampling, 507

Global contrast normalization, 434

GPU, see Graphics processing unit

Gradient, 105

Gradient clipping, 395

Gradient descent, 107

Graph, iv, v

Graph Transformer, 414

Graphical model, see structured probabilis-

tic model

Graphics processing unit, 419

Greedy layer-wise unsupervised pre-training,

550

Grid search, 426

Hadamard product, v, 41

Harmonium, see Restricted Boltzmann ma-

chine500

Harmony theory, 484

Helmholtz free energy, see evidence lower

bound

Hessian matrix, vi, 108

Hidden layer, 9

Hyperparameters, 144, 426

i.i.d., 148

i.i.d. assumptions, 136

Identity matrix, 43

Immorality, 489

Independence, vi, 75

Independent and identically distributed, 148

Inequality constraint, 115

Inference, 471, 496, 645, 646, 648–650, 652,

654, 658

Integral, vi

Invariance, 328

Jacobian matrix, vi, 93, 108

Joint probability, 70

Karush-Kuhn-Tucker conditions, 117

Karush–Kuhn–Tucker, 114

Kernel (convolution), 318, 319

Kernel trick, 178

KKT, see Karush–Kuhn–Tucker

KKT conditions, see Karush-Kuhn-Tucker

conditions

KL divergence, see Kllback-Leibler diver-

gence81

Knowledge base, 3

Kullback-Leibler divergence, vi, 81

Lagrange multipliers, 114, 117, 656

Lagrangian, see Gneralized Lagrangian115

Latent variable, 518

LCN, see local contrast normalization

Line search, 107

Linear combination, 45

Linear dependence, 46

Linear regression, 132, 135, 177

Local conditional probability distribution,

474

Local contrast normalization, 437

Logistic regression, 3, 178

Logistic sigmoid, 10, 88

Loop, 490

Loss function, see objective function

LSTM, 29

M-step, 649

Machine learning, 3

Main diagonal, 39

Manifold, 198

Manifold hypothesis, 590

Manifold hypothesis, 200

Manifold learning, 198, 590

MAP inference, 652

Marginal probability, 72

Markov chain, 505

Markov network, see undirected model476

Markov random ﬁeld, see undirected model476

Matrix, iv, v, 38

Matrix inverse, 43

Matrix product, 40

Max pooling, 328

Maximum likelihood, 160

Mean ﬁeld, 685, 691

Mean squared error, 133

Measure theory, 91

Measure zero, 92

Method of steepest descent, see gradient de-

scent

556

INDEX

Missing inputs, 122

Mixing (Markov chain), 508

Mixture distribution, 87

MLP, see multilayer perception

MNIST, 691

Model averaging, 269

Model capacity, 426

Model compression, 422

Moore-Penrose Pseudoinverse, 57

Moore-Penrose pseudoinverse, 265

Moralized graph, 490

MP-DBM, see multi-prediction DBM

MRF (Markov Random Field), see undi-

rected model476

MSE, see mean squared error133

Multi-modal learning, 568

Multi-prediction DBM, 689, 690

Multi-task learning, 287

Multilayer perception, 7

Multilayer perceptron, 32

Multinomial distribution, 82

Multinoulli distribution, 82

Naive Bayes, 4, 93

Nat, 79

natural image, 468

Negative deﬁnite, 109

Negative phase, 615, 617

Neocognitron, 20, 28, 32

Nesterov momentum, 308

Netﬂix Grand Prize, 272

Neural network, 16

Neuroscience, 18

Noise-contrastive estimation, 632

Non-parametric, 141

Norm, vi, 47

Normal distribution, 83, 85

Normal equations, 257

Object detection, 432

Object recognition, 432

Objective function, 101

Oﬀset, 211

one-shot learning, 565

Orthodox statistics, see frequentist statis-

tics

Orthogonal matrix, 51

Orthogonality, 50

Overﬁtting, 426

Parallel distributed processing, 20

Parameter sharing, 321

Parametric, 141

Partial derivative, 105

Partition function, 480, 613, 685

PCA, see principal components analysis

PCD, see stochastic maximum likelihood

Perceptron, 17, 32

Perplexity, 164

Persistent contrastive divergence, see stochas-

tic maximum likelihood

Point Estimator, 148

Pooling, 316, 696

Positive deﬁnite, 109

Positive phase, 615, 617

Pre-training, 550

Precision (of a normal distribution), 83, 86

Predictive sparse decomposition, 516, 532

Preprocessing, 432

Primary visual cortex, 346

Principal components analysis, 59, 438, 645

Principle components analysis, 183–186, 201

Prior, 166

Prior probability distribution, 166

Probabilistic max pooling, 696

Probability density function, 71

Probability distribution, 70

Probability function estimation, 125

Probability mass function, 70

Product rule of probability, see chain rule

of probability

PSD, see predictive sparse decomposition

Pseudolikelihood, 625

Random search, 426

Random variable, 69

Ratio matching, 630

RBM, see restricted Boltzmann machine

Receptive ﬁeld, 322

Recurrent network, 32

Regression, 123

Regularization, 250, 426

557

INDEX

Representation learning, 4

Restricted Boltzmann machine, 500, 645,

661, 664, 691, 693, 694, 696

Ridge regression, 254

Risk, 292

Sample mean, 151

Scalar, iv, v, 37

Score matching, 628

Second derivative, 108

Second derivative test, 108

Self-information, 79

Semi-supervised learning, 181

Separable convolution, 344

Separation (probabilistic modeling), 484

Set, iv, v

SGD, see stochastic gradient descent, see

stochastic gradient descent

Shannon entropy, vi, 79, 655

Sigmoid, vi, see logistic sigmoid

Sigmoid belief network, 32

Simple cell, 347

Singular value, see singular value decompo-

sition

Singular value decomposition, 55, 184, 185

Singular vector, see singular value decom-

position

SML, see stochastic maximum likelihood

Softmax, 217

Softplus, vi, 88

Spam detection, 4

Sparse coding, 527, 646

Spearmint, 426

spectral radius, 381

Sphering, see Whitening, 436

Spike and slab restricted Boltzmann ma-

chine, 694

Square matrix, 46

ssRBM, see spike and slab restricted Boltz-

mann machine

Standard deviation, 76

Statistic, 148

Statistical learning theory, 136

Steepest descent, see gradient descent

Stochastic gradient descent, 18, 295, 307,

691

Stochastic maximum likelihood, 622, 686,

691

Stochastic pooling, 287

Structure learning, 495

Structured output, 123, 124

Structured probabilistic model, 466

Sum rule of probability, 73

Supervised learning, 129

Support vector machine, 178

Surrogate loss function, 293

SVD, see singular value decomposition

Symmetric matrix, 50, 54

Tangent plane, 594

Tensor, iv, v, 39

Test set, 136

Tiled convolution, 338

Toeplitz matrix, 319

Trace operator, 58

Training error, 136

Transcription, 123

Transfer learning, 559

Transpose, v, 39

Triangle inequality, 47

Triangulated graph, see chordal graph

Unbiased, 150

Undirected model, 476

Uniform distribution, 71

Unit norm, 50

Unit vector, 50

Unnormalized probability distribution, 478

Unsupervised learning, 128, 181

Unsupervised pre-training, 550

V-structure, see explaining away

V1, 346

Variance, vi, 76

Variational derivatives, see functional deriva-

tives

Variational free energy, see evidence lower

bound

Vector, iv, v, 37

Visible layer, 9

Viterbi decoding, 406

Weight decay, 254, 427

558

INDEX

Weights, 17, 132

Whitening, 436, 438

ZCA, see zero-phase components analysis

zero-data learning, 565

Zero-phase components analysis, 438

zero-shot learning, 565

559