Bibliography

Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for

Boltzmann machines. Cognitive Science, 9, 147–169. 553

Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data gen-

erating distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal.

466

Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data

generating distribution. In ICLR’2013 . also arXiv report 1211.4246. 448, 466, 468

Alain, G., Bengio, Y., Yao, L.,

Eric Thibodeau-Laufer, Yosinski, J., and Vincent, P.

(2015). GSNs: Generative stochastic networks. arXiv:1503.05571. 451

Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris

Society, 59, 2–5. 19

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly

learning to align and translate. Technical report, arXiv:1409.0473. 23, 95, 325, 398,

407, 408

Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition

with continuous-parameter hidden Markov models. Computer, Speech and Language,

2, 219–234. 64, 356

Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT

Press. 358

Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural

Information Processing Systems 26 , pages 2814–2822. 232

Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the

past and the future in protein secondary structure prediction. Bioinformatics, 15(11),

937–946. 323

Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-

energy physics with deep learning. Nature communications, 5. 23

595

BIBLIOGRAPHY

Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal

function. IEEE Trans. on Information Theory, 39, 930–945. 189

Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University

Press. 453

Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and

Applications. Wiley. 453

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,

Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.

Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 78, 186, 379

Basu, S. and Christensen, J. (2013). Teaching classiﬁcation boundaries to humans. In

AAAI’2013 . 273

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of

ﬁnite state Markov chains. Ann. Math. Stat., 37, 1559–1563. 354

Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th Inter-

national Conference on Computational Learning Theory (COLT’95), pages 311–320,

Santa Cruz, California. ACM Press. 234

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2015). Automatic

diﬀerentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767 . 184

Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. arXiv

preprint arXiv:1411.7610 . 231

Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces

in random-dot stereograms. Nature, 355, 161–163. 500

Beiu, V., Quintana, J. M., and Avedillo, M. J. (2003). Vlsi implementations of threshold

logic-a comprehensive survey. Neural Networks, IEEE Transactions on, 14(5), 1217–

1243. 383

Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for em-

bedding and clustering. In NIPS’01, Cambridge, MA. MIT Press. 486

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and

data representation. Neural Computation, 15(6), 1373–1396. 152, 504

Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distri-

butions using neural networks. IEEE Transactions on Neural Networks, special issue

on Data Mining and Knowledge Discovery, 11(3), 550–557. 330

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for se-

quence prediction with recurrent neural networks. Technical report, arXiv:1506.03099.

314

596

BIBLIOGRAPHY

Bengio, Y. (1991). Artiﬁcial Neural Networks and their Application to Sequence Recog-

nition. Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 336,

358

Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal

on Pattern Recognition and Artiﬁcial Intelligence, 7(4), 647–668. 356

Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys,

2, 129–162. 356

Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys,

2, 129–162. 359

Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural Computa-

tion, 12(8), 1889–1900. 372

Bengio, Y. (2002). New distributed probabilistic language models. Technical Report

1215, Dept. IRO, Universit´e de Montr´eal. 400

Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 147, 190

Bengio, Y. (2013a). Deep learning of representations: looking forward. In Statistical

Language and Speech Processing, volume 7978 of Lecture Notes in Computer Science,

pages 1–37. Springer, also in arXiv at http://arxiv.org/abs/1305.0445. 381

Bengio, Y. (2013b). Estimating or propagating gradients through stochastic neurons.

Technical Report arXiv:1305.2982, Universite de Montreal. 435

Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-

layer neural networks. In NIPS’99 , pages 400–406. MIT Press. 329, 330, 332, 333

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.

Neural Computation, 21(6), 1601–1621. 466, 524, 562

Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE

Transactions on Neural Networks, 7(5), 1231–1249. 358

Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold

cross-validation. In NIPS’03, Cambridge, MA. MIT Press, Cambridge. 114

Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In L. Bottou,

O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT

Press. 17, 192

Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In Large

Scale Kernel Machines. 147

Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04 ,

pages 129–136. MIT Press. 150, 506, 507

597

BIBLIOGRAPHY

Bengio, Y. and S´en´ecal, J.-S. (2003). Quick training of probabilistic neural nets by

importance sampling. In Proceedings of AISTATS 2003 . 403

Bengio, Y. and S´en´ecal, J.-S. (2008). Adaptive importance sampling to accelerate training

of a neural probabilistic language model. IEEE Trans. Neural Networks, 19(4), 713–

722. 403

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated

acoustic parameters for continuous speech recognition using artiﬁcial neural networks.

In Proceedings of EuroSpeech’91 . 24, 391

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992a). Global optimization of a

neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks,

3(2), 252–259. 356, 358

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992b). Neural network - gaussian

mixture hybrid for speech recognition or density estimation. In NIPS 4, pages 175–182.

Morgan Kaufmann. 391

Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term depen-

dencies in recurrent networks. In IEEE International Conference on Neural Networks,

pages 1183–1195, San Francisco. IEEE Press. (invited paper). 247, 343

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with

gradient descent is diﬃcult. IEEE Tr. Neural Nets. 247, 248, 249, 334, 341, 343, 344

Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for

on-line handwriting recognition. Neural Computation, 7(6), 1289–1303. 358

Bengio, Y., Latendresse, S., and Dugas, C. (1999). Gradient-based learning of hyper-

parameters. Learning Conference, Snowbird. 372

Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language

model. In NIPS’00, pages 932–938. MIT Press. 16, 380

Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language

model. In NIPS’2000, pages 932–938. 394, 395, 396, 405

Bengio, Y., Ducharme, R., and Vincent, P. (2001c). A neural probabilistic language

model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages

932–938. MIT Press. 508, 509

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003a). A neural probabilistic

language model. JMLR, 3, 1137–1155. 395, 399, 405

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003b). A neural probabilistic

language model. Journal of Machine Learning Research, 3, 1137–1155. 508, 509

Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., and Marcotte, P. (2006a). Convex

neural networks. In NIPS’2005 , pages 123–130. 227

598

BIBLIOGRAPHY

Bengio, Y., Delalleau, O., and Le Roux, N. (2006b). The curse of highly variable functions

for local kernel machines. In NIPS’2005 . 147

Bengio, Y., Larochelle, H., and Vincent, P. (2006c). Non-local manifold Parzen windows.

In NIPS’2005 . MIT Press. 150, 506

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007a). Greedy layer-wise

training of deep networks. In NIPS’2006 . 12, 16, 472, 473

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007b). Greedy layer-wise

training of deep networks. In NIPS 19 , pages 153–160. MIT Press. 190

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In

ICML’09 . 272, 273

Bengio, Y., L´eonard, N., and Courville, A. (2013a). Estimating or propagating gradients

through stochastic neurons for conditional computation. arXiv:1308.3432. 188, 381,

435

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-

encoders as generative models. In NIPS’2013. 468, 584, 588

Bengio, Y., Courville, A., and Vincent, P. (2013c). Representation learning: A review and

new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),

35(8), 1798–1828. 498, 582

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014a). Deep generative

stochastic networks trainable by backprop. Technical Report arXiv:1306.1091. 435

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative

stochastic networks trainable by backprop. In ICML’2014 . 435, 585, 586, 587, 589,

590

Bennett, C. (1976). Eﬃcient estimation of free energy diﬀerences from Monte Carlo data.

Journal of Computational Physics, 22(2), 245–268. 540

Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy

approach to natural language processing. Computational Linguistics, 22, 39–71. 406

Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive

divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 526

Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Clas-

siﬁcation. Ph.D. thesis, Universit´e de Montr´eal. 226, 447

Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J.

Machine Learning Res., 13, 281–305. 369, 370, 371

599

BIBLIOGRAPHY

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,

J., Warde-Farley, D., and Bengio, Y. (2010a). Theano: a CPU and GPU math ex-

pression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference

(SciPy). Oral Presentation. 78, 379

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,

J., Warde-Farley, D., and Bengio, Y. (2010b). Theano: a CPU and GPU math expres-

sion compiler. In Proc. SciPy. 186

Bergstra, J., Bardenet, R., Bengio, Y., and K´egl, B. (2011). Algorithms for hyper-

parameter optimization. In NIPS’2011 . 372

Bertsekas, D. P. (2004). Nonlinear programming. Athena Scientiﬁc, 2 edition. 250

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.

528

Bishop, C. M. (1994). Mixture density networks. 171

Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks.

In Proceedings International Conference on Artiﬁcial Neural Networks ICANN’95 , vol-

ume 1, page 141–148. 212, 221

Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization.

Neural Computation, 7(1), 108–116. 212

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 92, 138

Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is np-complete.

249

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability

and the vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 106

Bonnet, G. (1964). Transformations des signaux al´eatoires `a travers les syst`emes non

lin´eaires sans m´emoire. Annales des T´el´ecommunications, 19(9–10), 203–220. 187

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and

meaning representations for open-text semantic parsing. AISTATS’2012 . 328

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal

margin classiﬁers. In COLT ’92: Proceedings of the ﬁfth annual workshop on Com-

putational learning theory, pages 144–152, New York, NY, USA. ACM. 16, 134, 148,

164

Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications

`a la reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 358

Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad, editor,

Online Learning in Neural Networks. Cambridge University Press, Cambridge, UK.

252

600

BIBLIOGRAPHY

Bottou, L. (2011). From machine learning to machine reasoning. Technical report,

arXiv.1102.1808. 326, 328

Bottou, L. and Bousquet, O. (2008). The tradeoﬀs of large scale learning. In NIPS’2008.

251, 253

Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker inde-

pendent isolated digit recognition: multilayer perceptrons vs dynamic time warping.

Neural Networks, 3, 453–465. 358

Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing

systems using graph transformer networks. In Proceedings of the Computer Vision and

Pattern Recognition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 349,

357, 358, 359, 360, 361

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal de-

pendencies in high-dimensional sequences: Application to polyphonic music generation

and transcription. In ICML’12 . 408

Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in

vision algorithms. In Proc. International Conference on Machine learning (ICML’10).

287

Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals:

multi-way local pooling for image recognition. In Proc. International Conference on

Computer Vision (ICCV’11). IEEE. 287

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and

singular value decomposition. Biological Cybernetics, 59, 291–294. 444

Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Ap-

proach, volume 247 of The Kluwer international series in engineering and computer

science. Kluwer Academic Publishers, Boston. 358

Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered

perceptrons. Computer Speech and Language, 3, 1–19. 391

Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multi-

layer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence,

12, 1167–1178. 358

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University

Press, New York, NY, USA. 88

Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate

where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665–674.

242

Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 152,

504

601

BIBLIOGRAPHY

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 226

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and

Regression Trees. Wadsworth International Group, Belmont, CA. 137

Bridle, J. S. (1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden

Markov model interpretation. Speech Communication, 9(1), 83–92. 167

Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition.

Ph.D. thesis, Dept. of Computer Science, Carnegie-Mellon University. 356

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Laﬀerty, J. D.,

Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.

Computational linguistics, 16(2), 79–85. 19

Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992).

Class-based n-gram models of natural language. Computational Linguistics, 18, 467–

479. 395

Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and

control. Blaisdell Pub. Co. 195

Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving

optimum programming problems. Technical Report BR-1303, Raytheon Company,

Missle and Space Division. 195

Buchberger, B., Collins, G. E., Loos, R., and Albrecht, R. (1983). Computer Algebra.

Springer-Verlag. 186

Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Pro-

ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 535–541. ACM. 380

Cai, M., Shi, Y., and Liu, J. (2013). Deep maxout neural networks for speech recognition.

In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop

on, pages 291–296. IEEE. 194

Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.

In R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005, pages 33–40. Society for

Artiﬁcial Intelligence and Statistics. 524, 562

Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models

Summer School, pages 372–379. 232

Cauchy, A. (1847a). M´ethode g´en´erale pour la r´esolution de syst`emes d’´equations si-

multan´ees. In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538.

Cauchy, L. A. (1847b). M´ethode g´en´erale pour la r´esolution des syst`emes d’´equations

simultan´ees. Compte Rendu `a l’Acad´emie des Sciences. 194

602

BIBLIOGRAPHY

Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,

UCSD. 152, 501

Chapelle, O., Weston, J., and Sch¨olkopf, B. (2003). Cluster kernels for semi-supervised

learning. In NIPS’02 , pages 585–592, Cambridge, MA. MIT Press. 486

Chapelle, O., Sch¨olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT

Press, Cambridge, MA. 486

Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neu-

ral Networks for Document Processing. In Guy Lorette, editor, Tenth International

Workshop on Frontiers in Handwriting Recognition, La Baule (France). Universit´e de

Rennes 1, Suvisoft. http://www.suvisoft.com. 21, 24, 378

Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for

language modeling. Computer, Speech and Language, 13(4), 359–393. 348, 406

Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., and Temam, O. (2014a). Diannao:

A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Pro-

ceedings of the 19th international conference on Architectural support for programming

languages and operating systems, pages 269–284. ACM. 383

Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N.,

et al. (2014b). Dadiannao: A machine-learning supercomputer. In Microarchitecture

(MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 609–622.

IEEE. 383

Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). Project adam:

Building an eﬃcient and scalable deep learning training system. In 11th USENIX

Symposium on Operating Systems Design and Implementation (OSDI’14). 380

Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio,

Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical

machine translation. In Proceedings of the Empiricial Methods in Natural Language

Processing (EMNLP 2014). 323, 341, 407

Choromanska, A., Henaﬀ, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The

loss surface of multilayer networks. 242, 475

Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous

speech recognition using attention-based recurrent nn: First results. arXiv:1412.1602.

393

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated

recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,

arXiv 1412.3555. 341, 393

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural

network for traﬃc sign classiﬁcation. Neural Networks, 32, 333–338. 22, 190

603

BIBLIOGRAPHY

Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big

simple neural nets for handwritten digit recognition. Neural Computation, 22, 1–14.

21, 24, 378

Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse

coding and vector quantization. In ICML’2011 . 24

Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in un-

supervised feature learning. In Proceedings of the Thirteenth International Conference

on Artiﬁcial Intelligence and Statistics (AISTATS 2011). 386

Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep

learning with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings

of the 30th International Conference on Machine Learning (ICML-13), volume 28 (3),

pages 1337–1345. JMLR Workshop and Conference Proceedings. 21, 24, 299, 380

Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris

VI, LIP6. 164

Collobert, R. (2011). Deep learning for eﬃcient discriminative parsing. In AISTATS’2011.

Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language process-

ing: Deep neural networks with multitask learning. In ICML’2008 . 404

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.

(2011a). Natural language processing (almost) from scratch. Journal of Machine

Learning Research, 12, 2493–2537. 273

Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011b). Torch7: A matlab-like envi-

ronment for machine learning. In BigLearn, NIPS Workshop. 379

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,

36, 287–314. 454, 455

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,

273–297. 16, 134, 148

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmenta-

tion using depth information. In International Conference on Learning Representations

(ICLR2013). 22, 190

Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Low precision arithmetic for deep

learning. In Arxiv:1412.7024, ICLR’2015 Workshop. 384

Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by

spike-and-slab RBMs. In ICML’11 . 415, 579

Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab

RBM and extensions to discrete and sparse data distributions. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 580

604

BIBLIOGRAPHY

Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.

Wiley-Interscience. 56

Cox, D. and Pinto, N. (2011). Beyond simple features: A large-scale feature search

approach to unconstrained face recognition. In Automatic Face & Gesture Recognition

and Workshops (FG 2011), 2011 IEEE International Conference on, pages 8–15. IEEE.

298

Cram´er, H. (1946). Mathematical methods of statistics. Princeton University Press. 126

Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304,

111–114. 522

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-

matics of Control, Signals, and Systems, 2, 303–314. 188, 495

Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition

with the mean-covariance restricted Boltzmann machine. In NIPS’2010 . 22

Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained

deep neural networks for large vocabulary speech recognition. IEEE Transactions on

Audio, Speech, and Language Processing, 20(1), 33–42. 392

Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for

QSAR predictions. arXiv:1406.1231. 23

Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-

dimensional inputs. In NIPS26 . NIPS Foundation. 532

Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with

reconstruction sampling. In ICML’2011 . 403

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).

Identifying and attacking the saddle point problem in high-dimensional non-convex

optimization. In NIPS’2014 . 242, 475

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.

(2014). The visual microphone: Passive recovery of sound from video. ACM Transac-

tions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 384

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M.,

Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep

networks. In NIPS’2012 . 380

Dean, T. and Kanazawa, K. (1989). A model for reasoning about persistence and causa-

tion. Computational Intelligence, 5(3), 142–150. 347

Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.

17, 189, 495, 496

605

BIBLIOGRAPHY

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A

Large-Scale Hierarchical Image Database. In CVPR09 . 19, 142

Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than

10,000 image categories tell us? In Proceedings of the 11th European Conference on

Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.

Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and

Adam, H. (2014). Large-scale object classiﬁcation using label relation graphs. In

ECCV’2014 , pages 48–64. 349

Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and

Trends in Signal Processing. 392

Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Bi-

nary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010 ,

Makuhari, Chiba, Japan. 22

Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs

for vision. Technical Report 1327, D´epartement d’Informatique et de Recherche

Op´erationnelle, Universit´e de Montr´eal. 580

Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.

In NIPS’2011 . 540

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast

and robust neural network joint models for statistical machine translation. In Proc.

ACL’2014 . 406

DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.

neurons vs. machines. NIPS Tutorial. 23, 301

Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random ﬁelds. In International

Conference on Artiﬁcial Intelligence and Statistics, pages 177–184. 349

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,

K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual

recognition and description. arXiv:1411.4389. 95

Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embed-

ding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,

Stanford University. 152, 504

Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.

IEEE Transactions on Neural Networks, 1, 75–80. 249, 334

Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Math-

ematical Analysis and Applications, 5(1), 30–45. 195

606

BIBLIOGRAPHY

Dreyfus, S. E. (1973). The computational solution of optimal control problems with time

lag. IEEE Transactions on Automatic Control, 18(4), 383–385. 195

Dugas, C., Bengio, Y., B´elisle, F., and Nadeau, C. (2001). Incorporating second-order

functional knowledge for better option pricing. In NIPS’00 , pages 472–478. MIT Press.

65, 164

Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., and Leyton-

Brown, K. (2013). Towards an empirical foundation for assessing bayesian optimization

of hyperparameters. NIPS workshop on Bayesian Optimization in Theory and Practice.

372

El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term

dependencies. In NIPS 8 . MIT Press. 326, 347

ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term

dependencies. In NIPS’1995 . 337

Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and Vincent, P. (2009). The diﬃculty

of training deep architectures and the eﬀect of unsupervised pre-training. In Proceedings

of AISTATS’2009. 190

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).

Why does unsupervised pre-training help deep learning? J. Machine Learning Res.

473, 475, 476, 477

Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll´ar, P., Gao, J., He, X.,

Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual

concepts and back. arXiv:1411.4952. 95

Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P.,

and Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekker-

man, M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and

Distributed Approaches. Cambridge University Press. 462

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-

telligence. 22, 190

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-

telligence, 35(8), 1915–1929. 349

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.

482

Fischer, A. and Igel, C. (2011). Bounding the bias of contrastive divergence learning.

Neural Computation, 23(3), 664–73. 562

607

BIBLIOGRAPHY

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals

of Eugenics, 7, 179–188. 19, 98

Frasconi, P., Gori, M., and Sperduti, A. (1997). On the eﬃcient classiﬁcation of data

structures by neural networks. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence. 326,

328

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive pro-

cessing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786.

326, 328

Freund, Y. and Schapire, R. E. (1996a). Experiments with a new boosting algorithm. In

Machine Learning: Proceedings of Thirteenth International Conference, pages 148–156,

USA. ACM. 227

Freund, Y. and Schapire, R. E. (1996b). Game theory, on-line prediction and boosting.

In Proceedings of the Ninth Annual Conference on Computational Learning Theory,

pages 325–332. 227

Frey, B. J. (1998). Graphical models for machine learning and digital communication.

MIT Press. 329

Frey, B. J., Hinton, G. E., and Dayan, P. (1996). Does the wake-sleep algorithm learn

good density estimators? In NIPS’95 , pages 661–670. MIT Press, Cambridge, MA.

329

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-

anism of pattern recognition unaﬀected by shift in position. Biological Cybernetics, 36,

193–202. 14, 21, 24, 302

Garson, J. (1900). The metric system of identiﬁcation of criminals, as used in in great

britain and ireland. The Journal of the Anthropological Institute of Great Britain and

Ireland, (2), 177–227. 19

Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual

prediction with LSTM. Neural computation, 12(10), 2451–2471. 342

Glorot, X. and Bengio, Y. (2010a). Understanding the diﬃculty of training deep feedfor-

ward neural networks. In AISTATS’2010 . 163

Glorot, X. and Bengio, Y. (2010b). Understanding the diﬃculty of training deep feedfor-

ward neural networks. In JMLR W&CP: Proceedings of the Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS 2010), volume 9, pages

249–256. 266

Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks. In

AISTATS’2011 . 14, 164, 461

608

BIBLIOGRAPHY

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectiﬁer neural networks.

In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS 2011). 193, 461

Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale senti-

ment classiﬁcation: A deep learning approach. In ICML’2011 . 461, 481

Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face

Recognition. Imperial College Press. 505, 507

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep

networks. In NIPS’2009 , pages 646–654. 226, 448, 460

Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L.

(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot

Interaction (HRI), Osaka, Japan. ACM Press, ACM Press. 93

Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with

spike-and-slab sparse coding. In ICML’2012 . 457

Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution

for autoencoders. Technical report, Universit´e de Montr´eal. 293

Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding

for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning

Hierarchical Models. 190, 481

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).

Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–

1327. 193, 230, 300, 386

Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction

deep Boltzmann machines. In NIPS26 . NIPS Foundation. 94, 530, 576, 578

Goodfellow, I. J., Courville, A., and Bengio, Y. (2013c). Scaling up spike-and-slab models

for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(8), 1902–1914. 580

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2014a). An em-

pirical investigation of catastrophic forgeting in gradient-based neural networks. In

ICLR’2014 . 194

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adver-

sarial examples. CoRR, abs/1412.6572. 235

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,

Courville, A., and Bengio, Y. (2014c). Generative adversarial networks. In NIPS’2014.

188

609

BIBLIOGRAPHY

Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014d). Multi-

digit number recognition from Street View imagery using deep convolutional neural

networks. In International Conference on Learning Representations. 22, 94, 190, 365,

382

Goodman, J. (2001). Classes for fast maximum entropy training. In International Con-

ference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 400

Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE

Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(1), 76–86. 242

Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Originally

published under the pseudonym “Student”. 19

Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed

representations without word alignments. Technical report, arXiv:1410.2455. 408, 484

Graf, H. P. and Jackel, L. D. (1989). Analog electronic neural network circuits. Circuits

and Devices Magazine, IEEE , 5(4), 44–49. 383

Graves, A. (2011a). Practical variational inference for neural networks. In J. Shawe-

Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in

Neural Information Processing Systems 24 , pages 2348–2356. Curran Associates, Inc.

211, 212

Graves, A. (2011b). Practical variational inference for neural networks. In NIPS’2011 .

214

Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies

in Computational Intelligence. Springer. 309, 323, 340, 341, 349, 393

Graves, A. (2013). Generating sequences with recurrent neural networks. Technical

report, arXiv:1308.0850. 172, 340

Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent

neural networks. In ICML’2014 . 340

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁcation with bidirec-

tional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.

323

Graves, A. and Schmidhuber, J. (2009). Oﬄine handwriting recognition with multidi-

mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and

L. Bottou, editors, NIPS’2008 , pages 545–552. 323

Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-

ral classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks.

In ICML’2006 , pages 369–376, Pittsburgh, USA. 349, 393

610

BIBLIOGRAPHY

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Uncon-

strained on-line handwriting recognition with recurrent neural networks. In J. Platt,

D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 323

Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., and Schmidhuber,

J. (2009). A novel connectionist system for unconstrained handwriting recognition.

Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868.

340

Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep re-

current neural networks. In ICASSP’2013 , pages 6645–6649. 323, 325, 326, 340, 341,

392, 393

Graves, A., Wayne, G., and Danihelka, I. (2014a). Neural Turing machines.

arXiv:1410.5401. 23

Graves, A., Wayne, G., and Danihelka, I. (2014b). Neural turing machines. arXiv preprint

arXiv:1410.5401 . 343

Greﬀ, K., Srivastava, R. K., Koutn´ık, J., Steunebrink, B. R., and Schmidhuber, J. (2015).

LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 . 342

Gregor, K. and LeCun, Y. (2010). Emergence of complex-like cells in a temporal product

network with local receptive ﬁelds. Technical report, arXiv:1006.0448. 292

G¨ul¸cehre, C¸ . and Bengio, Y. (2013). Knowledge matters: Importance of prior infor-

mation for optimization. In International Conference on Learning Representations

(ICLR’2013). 22, 270

Guo, H. and Gelfand, S. B. (1992). Classiﬁcation trees with neural network feature

extraction. Neural Networks, IEEE Transactions on, 3(6), 923–933. 382

Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. (2015). Deep learning

with limited numerical precision. CoRR, abs/1502.02551. 384

Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-

tion principle for unnormalized statistical models. In Proceedings of The Thirteenth

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10). 532

Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.

(2007). Online learning for oﬀroad robots: Spatial label propagation to learn long-

range traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA,

USA. 385

Haﬀner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural

networks for high performance continuous speech recognition. In International Confer-

ence on Acoustics, Speech and Signal Processing (ICASSP), pages 105–108, Toronto.

358

611

BIBLIOGRAPHY

H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings

of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,

California. ACM Press. 189, 495

H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.

Computational Complexity, 1, 113–129. 189, 495

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning:

data mining, inference and prediction. Springer Series in Statistics. Springer Verlag.

138

Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 15

Henaﬀ, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning

of sparse features for scalable audio classiﬁcation. In ISMIR’11 . 462

Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modiﬁables: D´ecodage de

messages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie

des Sciences, 299(III-13), 525––528. 454

Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures. 257

Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,

Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic

modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 22,

392

Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.

Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 523

Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 504

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with

neural networks. Science, 313(5786), 504–507. 450, 472, 473

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with

Neural Networks. Science, 313, 504–507. 475

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

Helmholtz free energy. In NIPS’1993 . 444

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief

nets. Neural Computation, 18, 1527–1554. 12, 16, 24, 135, 472, 473, 563

Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,

Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural

networks for acoustic modeling in speech recognition: The shared views of four research

groups. IEEE Signal Process. Mag., 29(6), 82–97. 94

612

BIBLIOGRAPHY

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.

(2012c). Improving neural networks by preventing co-adaptation of feature detectors.

Technical report, arXiv:1207.0580. 208

Hinton, G. E., Vinyals, O., and Dean, J. (2014). Dark knowledge. Invited talk at the

BayLearn Bay Area Machine Learning Symposium. 381

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma

thesis, T.U. M¨unich. 247, 334, 343

Hochreiter, S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering ﬂat

minima. In Advances in Neural Information Processing Systems 7 , pages 529–536.

MIT Press. 215

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computa-

tion, 9(8), 1735–1780. 23, 340, 341

Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).

Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies. In

J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE

Press. 341

Holi, J. L. and Hwang, J.-N. (1993). Finite precision error analysis of neural network

hardware implementations. Computers, IEEE Transactions on, 42(3), 281–290. 383

Holt, J. L. and Baker, T. E. (1991). Back propagation simulations using limited pre-

cision calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint

Conference on, volume 2, pages 121–126. IEEE. 383

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are

universal approximators. Neural Networks, 2, 359–366. 188, 495

Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an

unknown mapping and its derivatives using multilayer feedforward networks. Neural

networks, 3(5), 551–560. 188

Horst, R., Pardalos, P., and Thoai, N. (2000). Introduction to Global Optimization.

Kluwer Academic Publishers. Second Edition. 271

Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World

Chess Champion. Princeton University Press, Princeton, NJ, USA. 2

Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov

random ﬁelds on lattice. Annals of the Institute of Statistical Mathematics, 54(1),

1–18. 529

Hubel, D. and Wiesel, T. (1968). Receptive ﬁelds and functional architecture of monkey

striate cortex. Journal of Physiology (London), 195, 215–243. 299

613

BIBLIOGRAPHY

Hubel, D. H. and Wiesel, T. N. (1959). Receptive ﬁelds of single neurons in the cat’s

striate cortex. Journal of Physiology, 148, 574–591. 299

Hubel, D. H. and Wiesel, T. N. (1962). Receptive ﬁelds, binocular interaction, and

functional architecture in the cat’s visual cortex. Journal of Physiology (London),

160, 106–154. 299

Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Sequential model-based optimization

for general algorithm conﬁguration. In LION-5 . Extended version as UBC Tech report

TR-2010-10. 372

Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96,

pages 13–24. 311

Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing

Surveys, 2, 94–128. 454

Hyv¨arinen, A. (2005a). Estimation of non-normalized statistical models using score

matching. J. Machine Learning Res., 6. 465

Hyv¨arinen, A. (2005b). Estimation of non-normalized statistical models using score

matching. Journal of Machine Learning Research, 6, 695–709. 530

Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence,

and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural

Networks, 18, 1529–1531. 531

Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and

Data Analysis, 51, 2499–2512. 531

Hyv¨arinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Ex-

istence and uniqueness results. Neural Networks, 12(3), 429–439. 455

Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis.

Wiley-Interscience. 454

Hyv¨arinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A proba-

bilistic approach to early computational vision. Springer-Verlag. 305

Ioﬀe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training

by reducing internal covariate shift. 22, 93, 263

Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation.

Neural networks, 1(4), 295–307. 256

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture

of local experts. Neural Computation, 3, 79–87. 171

Jaeger, H. (2003). Adaptive nonlinear system identiﬁcation with echo state networks. In

Advances in Neural Information Processing Systems 15 . 335

614

BIBLIOGRAPHY

Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state

networks. Technical report, Jacobs University. 326

Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 334

Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and

saving energy in wireless communication. Science, 304(5667), 78–80. 24, 334

Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Sch¨olkopf, B. (2012).

On causal and anticausal learning. In ICML’2012 , pages 1255–1262. 487, 489

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best

multi-stage architecture for object recognition? In ICCV’09. 14, 164, 462

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best

multi-stage architecture for object recognition? In Proc. International Conference on

Computer Vision (ICCV’09), pages 2146–2153. IEEE. 21, 24, 192, 193, 298, 299

Jarzynski, C. (1997). Nonequilibrium equality for free energy diﬀerences. Phys. Rev.

Lett., 78, 2690–2693. 539

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University

Press. 48

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target

vocabulary for neural machine translation. arXiv:1412.2007. 407

Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters

from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in

Practice. North-Holland, Amsterdam. 348, 406

Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial pyramids: Receptive ﬁeld learn-

ing for pooled image features. In Computer Vision and Pattern Recognition (CVPR),

2012 IEEE Conference on, pages 3370–3377. IEEE. 287

Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural

networks: convergence and generalization. IEEE Transactions on Neural Networks,

7(6), 1424–1438. 211, 214

Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 16

Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015a). An empirical evaluation of

recurrent network architectures. In ICML’2015 . 342

Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015b). An empirical exploration of

recurrent network architectures. In Proceedings of The 32nd International Conference

on Machine Learning, pages 2342–2350. 268

Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classi-

ﬁcation. IEEE Transactions on Signal Processing, 40(12), 3043–3054. 356

615

BIBLIOGRAPHY

Judd, J. S. (1989). Neural Network Design and the Complexity of Learning. MIT press.

249

Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algo-

rithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 454

Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., G¨ul¸cehre, c., Memisevic, R., Vin-

cent, P., Courville, A., Bengio, Y., Ferrari, R. C., Mirza, M., Jean, S., Carrier, P.-L.,

Dauphin, Y., Boulanger-Lewandowski, N., Aggarwal, A., Zumer, J., Lamblin, P., Ray-

mond, J.-P., Desjardins, G., Pascanu, R., Warde-Farley, D., Torabi, A., Sharma, A.,

Bengio, E., Cˆot´e, M., Konda, K. R., and Wu, Z. (2013). Combining modality speciﬁc

deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM

on International Conference on Multimodal Interaction. 190

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In

EMNLP’2013 . 407

Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.

IEEE Transactions on Pattern Analysis and Machine Intelligence. 468

Kanazawa, K., Koller, D., and Russell, S. (1995). Stochastic simulation algorithms for

dynamic probabilistic networks. In Proc. UAI’1995 , pages 346–351. 347

Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image

descriptions. In CVPR’2015 . arXiv:1412.2306. 95

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).

Large-scale video classiﬁcation with convolutional neural networks. In CVPR. 19

Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side

Constraints. Master’s thesis, Dept.˜of Mathematics, Univ.˜of Chicago. 90

Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model

component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal

Processing, ASSP-35(3), 400–401. 348, 406

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding

algorithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 447

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding

algorithms with applications to object recognition. Technical report, Computational

and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-

12-01. 462

Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant

features through topographic ﬁlter maps. In CVPR’2009. 462

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.

(2010a). Learning convolutional feature hierarchies for visual recognition. In Advances

in Neural Information Processing Systems 23 (NIPS’10), pages 1090–1098. 299

616

BIBLIOGRAPHY

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and Le-

Cun, Y. (2010b). Learning convolutional feature hierarchies for visual recognition.

In NIPS’2010 . 462

Kelley, H. J. (1960). Gradient theory of optimal ﬂight paths. ARS Journal, 30(10),

947–954. 195

Khan, F., Zhu, X., and Mutlu, B. (2011). How do humans teach: On curriculum learning

and teaching dimension. In Advances in Neural Information Processing Systems 24

(NIPS’11), pages 1449–1457. 273

Kim, S. K., McAfee, L. C., McMahon, P. L., and Olukotun, K. (2009). A highly scalable

restricted Boltzmann machine FPGA implementation. In Field Programmable Logic

and Applications, 2009. FPL 2009. International Conference on, pages 367–372. IEEE.

383

Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary

Mathematics ; V. 1). American Mathematical Society. 419

Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980 . 258

Kingma, D. and LeCun, Y. (2010a). Regularized estimation of image statistics by score

matching. In NIPS’2010 . 465

Kingma, D. and LeCun, Y. (2010b). Regularized estimation of image statistics by score

matching. In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,

editors, Advances in Neural Information Processing Systems 23, pages 1126–1134. 532

Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning

with deep generative models. In NIPS’2014. 435

Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable

models in auxiliary form. Technical report, arxiv:1306.0733. 188, 435

Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings

of the International Conference on Learning Representations (ICLR). 188, 435, 507,

508

Kingma, D. P. and Welling, M. (2014b). Eﬃcient gradient-based inference through trans-

formations between bayes nets and neural nets. Technical report, arxiv:1402.0480. 188,

434, 435

Kirkpatrick, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by simulated

annealing. Science, 220, 671–680. 271

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.

In ICML’2014 . 95

617

BIBLIOGRAPHY

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embed-

dings with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 95, 340

Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed

representations of words. In Proceedings of COLING 2012 . 408, 484

Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and

Pﬁster, H. (2014). Deep learning for the connectome. GPU Technology Conference. 23

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and

Techniques. MIT Press. 354, 433, 440

Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and max-

imization of A posteriori probabilities – application to transition-based connectionist

speech recognition. In NIPS’95 . MIT Press, Cambridge, MA. 391

Koren, Y. (2009). 1 the bellkor solution to the netﬂix grand prize. 227

Koutnik, J., Greﬀ, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In

ICML’2014 . 326, 347

Koˇcisk´y, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-

sentations by Marginalizing Alignments. In Proceedings of ACL. 408

Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties

of DBNs with binary hidden units and real-valued visible units. In ICML’2013. 495

Krizhevsky, A. (2010). Convolutional deep belief networks on CIFAR-

10. Technical report, University of Toronto. Unpublished Manuscript:

http://www.cs.utoronto.ca/ kriz/conv-cifar10-aug2010.pdf. 379

Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny

images. Technical report, University of Toronto. 19, 415

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classiﬁcation with deep

convolutional neural networks. In Advances in Neural Information Processing Systems

25 (NIPS’2012). 21, 24, 93, 306, 385

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classiﬁcation with deep

convolutional neural networks. In NIPS’2012 . 22, 190, 461

Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the

Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–

492, Berkeley, Calif. University of California Press. 90

Kumar, M. P., Packer, B., and Koller, D. (2010). Self-paced learning for latent variable

models. In NIPS’2010 . 273

Laﬀerty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random ﬁelds:

Probabilistic models for segmenting and labeling sequence data. In C. E. Brodley and

A. P. Danyluk, editors, ICML 2001 . Morgan Kaufmann. 349, 357

618

BIBLIOGRAPHY

Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural net-

work architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-

Mellon University. 302, 309, 336

Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear

independent component analysis using ensemble learning: Experiments and discussion.

In Proc. ICA. Citeseer. 455

Larochelle, H. and Bengio, Y. (2008a). Classiﬁcation using discriminative restricted Boltz-

mann machines. In ICML’2008 . 226, 448, 591

Larochelle, H. and Bengio, Y. (2008b). Classiﬁcation using discriminative restricted

Boltzmann machines. In ICML’08 , pages 536–543. ACM. 486

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.

In AISTATS’2011 . 329, 332

Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In

AAAI Conference on Artiﬁcial Intelligence. 482

Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative

and discriminative models. In Proceedings of the Computer Vision and Pattern Recog-

nition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer

Society. 223, 486

Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled

convolutional neural networks. In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor,

R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems

23 (NIPS’10), pages 1279–1287. 292

Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,

A. (2012). Building high-level features using large scale unsupervised learning. In

ICML’2012 . 21, 24

Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approx-

imators. Neural Computation, 22(8), 2192–2207. 495

Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural

gradient algorithm. In NIPS 20 . MIT Press. 241

LeCun, Y. (1985). Une proc´edure d’apprentissage pour R´eseau `a seuil assym´etrique. In

Cognitiva 85: A la Fronti`ere de l’Intelligence Artiﬁcielle, des Sciences de la Connais-

sance et des Neurosciences, pages 599–604, Paris 1985. CESTA, Paris. 195

LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de

Paris VI. 16, 444

LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,

Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications

of neural network chips and automatic learning. IEEE Communications Magazine,

27(11), 41–46. 302

619

BIBLIOGRAPHY

LeCun, Y., Bottou, L., Orr, G. B., and M¨uller, K.-R. (1998a). Eﬃcient backprop. In

Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524.

Springer Verlag. 365

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998b). Gradient-based learning

applied to document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 14, 24

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998c). Gradient-based learning

applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 16, 19,

349, 357, 358, 359, 392

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998d). Gradient based learning

applied to document recognition. Proc. IEEE. 305

LeCun, Y., Kavukcuoglu, K., and Farabet, C. (2010). Convolutional networks and ap-

plications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE

International Symposium on, pages 253–256. IEEE. 306

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area

V2. In NIPS’07 . 226, 448

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief

networks for scalable unsupervised learning of hierarchical representations. In L. Bottou

and M. Littman, editors, ICML 2009. ACM, Montreal, Canada. 298, 580, 581

Lee, Y. J. and Grauman, K. (2011). Learning the easy things ﬁrst: self-paced visual

category discovery. In CVPR’2011 . 273

Leibniz, G. W. (1676). Memoir using the chain rule. (Cited in TMME 7:2&3 p 321-332,

2010). 194

Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; represen-

tation and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.

Leprieur, H. and Haﬀner, P. (1995). Discriminant learning with minimum memory loss

for improved non-vocabulary rejection. In EUROSPEECH’95, Madrid, Spain. 356

L’Hˆopital, G. F. A. (1696). Analyse des inﬁniment petits, pour l’intelligence des lignes

courbes. Paris: L’Imprimerie Royale. 194

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies

is not as diﬃcult with NARX recurrent neural networks. IEEE Transactions on Neural

Networks, 7(6), 1329–1338. 336, 337

Linde, N. (1992). The machine that changed the world, episode 3. Documentary minis-

eries. 2

Lindsey, C. and Lindblad, T. (1994). Review of hardware neural networks: a user’s

perspective. In Proc. Third Workshop on Neural Networks: From Biology to High

Energy Physics, pages 195––202, Isola d’Elba, Italy. 383

620

BIBLIOGRAPHY

Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Nu-

merical Mathematics, 16(2), 146–160. 195

Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to ap-

proximately evaluate or simulate. In Proceedings of the 27th International Conference

on Machine Learning (ICML’10). 557

Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine

invented by Charles Babbage”. 1

Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 350, 356, 362

Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent

neural network training. Computer Science Review, 3(3), 127–149. 334

Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with

convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 95

Lyness, J. N. and Moler, C. B. (1967). Numerical diﬀerentiation of analytic functions.

SIAM J.Numer. Anal., 4, 202––210. 184

Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09 . 531

Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without

stable states: A new framework for neural computation based on perturbations. Neural

Computation, 14(11), 2531–2560. 334

MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge

University Press. 56

Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015). Gradient-based hyperparameter

optimization through reversible learning. arXiv preprint arXiv:1502.03492 . 372

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning

with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 95

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for

restricted Boltzmann machine learning. In Proceedings of The Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10), volume 9, pages

509–516. 526, 531, 559

Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science,

194. 302

Martens, J. (2010). Deep learning via Hessian-free optimization. In L. Bottou and

M. Littman, editors, Proceedings of the Twenty-seventh International Conference on

Machine Learning (ICML-10), pages 735–742. ACM. 267

Martens, J. and Medabalimi, V. (2014). On the expressive eﬃciency of sum product

networks. arXiv:1411.7717 . 496

621

BIBLIOGRAPHY

Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-

free optimization. In Proc. ICML’2011 . ACM. 344

Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous

state space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612.

529

Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition

using a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo

CA. Morgan Kaufmann. 358

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall,

London. 165

McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous

activity. Bulletin of Mathematical Biophysics, 5, 115–133. 13

Mead, C. and Ismail, M. (2012). Analog VLSI implementation of neural systems, vol-

ume 80. Springer Science & Business Media. 383

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,

Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,

J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In

JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 190, 481

Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surﬁng on the

manifold. Learning Workshop, Snowbird. 584

Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular

PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 394

Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,

Brno University of Technology. 172, 345

Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empiri-

cal evaluation and combination of advanced language modeling techniques. In Proc.

12th annual conference of the international speech communication association (INTER-

SPEECH 2011). 405

Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for

training large scale neural network language models. In Proc. ASRU’2011. 273, 405

Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages

for machine translation. Technical report, arXiv:1309.4168. 484

Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cam-

bridge UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 536

Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 13

622

BIBLIOGRAPHY

Misra, J. and Saha, I. (2010). Artiﬁcial neural networks in hardware: A survey of two

decades of progress. Neurocomputing, 74(1), 239–255. 383

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 92

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings eﬃciently with noise-

contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and

K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages

2265–2273. Curran Associates, Inc. 404, 534

Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural proba-

bilistic language models. In ICML’2012 , pages 1751–1758. 404

Mnih, V. and Hinton, G. (2010). Learning to detect roads in high-resolution aerial images.

In Proceedings of the 11th European Conference on Computer Vision (ECCV). 95

Mobahi, H. and Fisher III, J. W. (2015). A theoretical analysis of optimization by gaussian

continuation. In AAAI’2015 . 272

Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief

networks. IEEE Trans. on Audio, Speech and Language Processing, 20(1), 14–22. 392

Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks

with discrete units. Neural Computation, 26. 495

Mont´ufar, G. and Ay, N. (2011). Reﬁnements of universal approximation results for

deep belief networks and restricted Boltzmann machines. Neural Computation, 23(5),

1306–1319. 495

Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product

of mixtures? SIAM Journal on Discrete Mathematics, 29(1), 321–347. 494

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear

regions of deep neural networks. In NIPS’2014 . 17, 493, 496, 497

Mor-Yosef, S., Samueloﬀ, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking

the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet

Gynecol, 75(6), 944–7. 2

Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language

model. In AISTATS’2005. 400, 402

Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages

275–282, San Mateo, CA. Morgan Kaufmann. 337, 338, 347

Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cam-

bridge, MA, USA. 92, 138

Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In

ICML’2014 . 172, 333, 334

623

BIBLIOGRAPHY

Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method

for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,

ASSP-36(9), 1432–1436. 356

Nair, V. and Hinton, G. (2010a). Rectiﬁed linear units improve restricted Boltzmann

machines. In ICML’2010 . 164, 461

Nair, V. and Hinton, G. E. (2010b). Rectiﬁed linear units improve restricted Boltzmann

machines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh

International Conference on Machine Learning (ICML-10), pages 807–814. ACM. 14

Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypoth-

esis. In NIPS’2010 . 152, 501

Neal, R. M. (1992). Connectionist learning of belief networks. Artiﬁcial Intelligence, 56,

71–113. 582

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Springer. 231

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2),

125–139. 538, 539

Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance

sampling. 540

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Read-

ing digits in natural images with unsupervised feature learning. Deep Learning and

Unsupervised Feature Learning Workshop, NIPS. 19

Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical

language modelling. In European Conference on Speech Communication and Technol-

ogy (Eurospeech), pages 973–976, Berlin. 395

Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of

part-of-speech and automatically derived category-based language models for speech

recognition. In International Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 177–180. 397

Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in

classifying static speech patterns. Computer Speech and Language, 4, 275–289. 164

Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 85, 90

Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural

Computation, 17, 1665–1699. 14

Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld prop-

erties by learning a sparse code for natural images. Nature, 381, 607–609. 226, 305,

447, 500

624

BIBLIOGRAPHY

Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set:

a strategy employed by V1? Vision Research, 37, 3311–3325. 389, 391, 460

Opper, M. and Archambeau, C. (2009). The variational gaussian approximation revisited.

Neural computation, 21(3), 786–792. 188

Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research

in Economics and Management Sci., MIT. 195

Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de

Montr´eal. 244, 245

Pascanu, R. and Bengio, Y. (2012). On the diﬃculty of training recurrent neural networks.

Technical Report arXiv:1211.5063, Universite de Montreal. 172

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the diﬃculty of training recurrent

neural networks. In ICML’2013 . 172, 249, 334, 338, 345, 346, 347

Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. Technical report, U.

Montreal, arXiv:1312.6098. 189

Pascanu, R., G¨ul¸cehre, C¸ ., Cho, K., and Bengio, Y. (2014a). How to construct deep

recurrent neural networks. In ICLR’2014 . 17, 231, 325, 326, 340, 393, 496, 497

Pascanu, R., Montufar, G., and Bengio, Y. (2014b). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 494

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential

reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, Uni-

versity of California, Irvine, pages 329–334. 417

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann. 49

Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 28

Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recog-

nition hard? PLoS Comput Biol, 4. 389, 581

Pinto, N., Stone, Z., Zickler, T., and Cox, D. (2011). Scaling up biologically-inspired

computer vision: A case study in unconstrained face recognition on facebook. In Com-

puter Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer

Society Conference on, pages 35–42. IEEE. 298

Pollack, J. B. (1990). Recursive distributed representations. Artiﬁcial Intelligence, 46(1),

77–105. 326

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.

USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 253

625

BIBLIOGRAPHY

Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders

and deep networks. CoRR, abs/1406.1831. 211

Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In

UAI’2011 , Barcelona, Spain. 189, 496

Poundstone, W. (2005). Fortune’s Formula: The untold story of the scientiﬁc betting

system that beat the casinos and Wall Street. Macmillan. 57

Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 164

Presley, R. K. and Haggard, R. L. (1994). A ﬁxed point implementation of the backprop-

agation learning algorithm. In Southeastcon’94. Creative Technology Transfer-A Global

Aﬀair., Proceedings of the 1994 IEEE, pages 136–138. IEEE. 383

Price, R. (1958). A useful theorem for nonlinear devices having gaussian inputs. IEEE

Transactions on Information Theory, 4(2), 69–72. 187

Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual

representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.

300

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in

speech recognition. Proceedings of the IEEE, 77(2), 257–286. 354, 391

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE

ASSP Magazine, pages 257–285. 308, 354

Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive

distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 333

Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning

using graphics processors. In L. Bottou and M. Littman, editors, ICML 2009 , pages

873–880, New York, NY, USA. ACM. 24, 378

Rall, L. B. (1981). Automatic Diﬀerentiation: Techniques and Applications. Lecture

Notes in Computer Science 120, Springer. 184

Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Founda-

tions of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster

University Archive for the History of Economic Thought. 50

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Eﬃcient learning of

sparse representations with an energy-based model. In NIPS’2006 . 12, 16, 460, 472,

473

Ranzato, M., Huang, F., Boureau, Y., and LeCun, Y. (2007b). Unsupervised learning

of invariant feature hierarchies with applications to object recognition. In Proceedings

of the Computer Vision and Pattern Recognition Conference (CVPR’07). IEEE Press.

299

626

BIBLIOGRAPHY

Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief

networks. In NIPS’2007 . 460

Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical

parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 126

Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to

parallelizing stochastic gradient descent. In NIPS’2011 . 380

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and

approximate inference in deep generative models. In ICML’2014. 188, 434, 435

Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning

through cross-modal transfer. In 27th Annual Conference on Neural Information Pro-

cessing Systems (NIPS 2013). 482

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-

encoders: Explicit invariance during feature extraction. In ICML’2011 . 468, 470,

503

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.

(2011b). Higher order contractive auto-encoder. In European Conference on Machine

Learning and Principles and Practice of Knowledge Discovery in Databases (ECML

PKDD). 448

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.

(2011c). Higher order contractive auto-encoder. In ECML PKDD. 468

Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold

tangent classiﬁer. In NIPS’2011 . 516

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for

sampling contractive auto-encoders. In ICML’2012. 584

Ringach, D. and Shapley, R. (2004). Reverse correlation in neurophysiology. Cognitive

Science, 28(2), 147–166. 302

Roberts, S. and Everson, R. (2001). Independent component analysis: principles and

practice. Cambridge University Press. 455

Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech

recognition system. Computer Speech and Language, 5(3), 259–274. 24, 391

Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 88

Romero, A., Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Y.

(2015). Fitnets: Hints for thin deep nets. In ICLR’2015, arXiv:1412.6550. 271

Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part

i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,

8(1), pp. 181–217. 88

627

BIBLIOGRAPHY

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage

and organization in the brain. Psychological Review, 65, 386–408. 12, 13, 24

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 13, 24

Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear

embedding. Science, 290(5500). 152, 153, 504

Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-

propagating errors. Nature, 323, 533–536. 12, 16, 22, 195, 394

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal repre-

sentations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,

Parallel Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cam-

bridge. 19, 24, 195

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986c). Learning representations

by back-propagating errors. Nature, 323, 533–536. 158, 308

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986d). Parallel

Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,

Cambridge. 15, 195

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986e). Parallel

Distributed Processing: Explorations in the Microstructure of Cognition, volume 1.

MIT Press, Cambridge. 158

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large

Scale Visual Recognition Challenge. 19

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition

challenge. arXiv preprint arXiv:1409.0575 . 25

Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal

elements of macaque V1 receptive ﬁelds. Neuron, 46(6), 945–956. 301

Sainath, T., rahman Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep

convolutional neural networks for LVCSR. In ICASSP 2013 . 392

Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages

448–455. 21, 24, 473, 566, 569, 574, 576

Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings

of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AIS-

TATS 2009), volume 8. 573, 577, 589

628

BIBLIOGRAPHY

Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance

kernels for Gaussian processes. In NIPS’07 , pages 1249–1256, Cambridge, MA. MIT

Press. 487

Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief

networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,

volume 25, pages 872–879. ACM. 539

Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean ﬁeld theory for sigmoid belief

networks. Journal of Artiﬁcial Intelligence Research, 4, 61–76. 24

Savich, A. W., Moussa, M., and Areibi, S. (2007). The impact of arithmetic representation

on implementing mlp-bp on fpgas: A study. Neural Networks, IEEE Transactions on,

18(1), 240–252. 383

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On

random weights and unsupervised feature learning. In Proc. ICML’2011 . ACM. 298

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear

dynamics of learning in deep linear neural networks. In ICLR. 266

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of

history compression. Neural Computation, 4(2), 234–242. 326

Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on

Neural Networks, 7(1), 142–146. 394

Sch¨olkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 148

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a

kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 152, 504

Sch¨olkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods —

Support Vector Learning. MIT Press, Cambridge, MA. 16, 164, 191

Schulz, H. and Behnke, S. (2012). Learning two-layer contractive encodings. In

ICANN’2012 , pages 620–628. 470

Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE

Transactions on Signal Processing, 45(11), 2673–2681. 323

Schwenk, H. (2007). Continuous space language models. Computer speech and language,

21, 492–518. 395, 399

Schwenk, H. (2010). Continuous space language models for statistical machine translation.

The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 395, 405

Schwenk, H. (2014). Cleaned subset of wmt ’14 dataset. 19

Schwenk, H. and Bengio, Y. (1998). Training methods for adaptive boosting of neural

networks. In NIPS’97 , pages 647–653. MIT Press. 227

629

BIBLIOGRAPHY

Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocab-

ulary continuous speech recognition. In International Conference on Acoustics, Speech

and Signal Processing (ICASSP), volume 1, pages 765–768. 395

Schwenk, H. and Gauvain, J.-L. (2005). Building continuous space language models for

transcribing european languages. In Interspeech, pages 737–740. 395

Schwenk, H., Costa-juss`a, M. R., and Fonollosa, J. A. R. (2006). Continuous space lan-

guage models for the iwslt 2006 task. In International Workshop on Spoken Language

Translation, pages 166–173. 395, 405

Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-

dependent deep neural networks. In Interspeech 2011 , pages 437–440. 22

Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied

to house numbers digit classiﬁcation. CoRR, abs/1204.3968. 389

Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection

with unsupervised multi-stage feature learning. In Proc. International Conference on

Computer Vision and Pattern Recognition (CVPR’13). IEEE. 22, 190

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical

Journal, 27(3), 379—-423. 57

Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the

Institute of Radio Engineers, 37(1), 10–21. 57

Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publica-

tions. 28

Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–

548. 311

Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied

Mathematics Letters, 4(6), 77–80. 311

Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.

Journal of Computer and Systems Sciences, 50(1), 132–150. 249

Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). Best practices for convolutional

neural networks. In ICDAR’2003 . 306

Simard, P. and Graf, H. P. (1994). Backpropagation without multiplication. In Advances

in Neural Information Processing Systems, pages 232–239. 383

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism

for specifying selected invariances in an adaptive network. In NIPS’1991 . 210, 515,

516

Simard, P. Y., LeCun, Y., and Denker, J. (1993). Eﬃcient pattern recognition using a

new transformation distance. In NIPS’92 . 514

630

BIBLIOGRAPHY

Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation

invariance in pattern recognition — tangent distance and tangent propagation. Lecture

Notes in Computer Science, 1524. 514

Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a min-

imum, with application to neural networks. International Journal of Control, 62(6),

1391–1407. 221

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of

harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed

Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 424, 437

Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of

machine learning algorithms. In NIPS’2012 . 372

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a).

Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In

NIPS’2011 . 326, 328

Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural

language with recursive neural networks. In Proceedings of the Twenty-Eighth Inter-

national Conference on Machine Learning (ICML’2011). 326

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).

Semi-supervised recursive autoencoders for predicting sentiment distributions. In

EMNLP’2011 . 326

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.

(2013). Recursive deep models for semantic compositionality over a sentiment treebank.

In EMNLP’2013 . 326, 328

Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural

networks. Complex Systems, 2, 625–639. 168

Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local

minima even for networks without hidden layers. Complex Systems, 3, 91–106. 242

Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturba-

tion gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.

184

Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog:

how ”less is more” in unsupervised dependency parsing. In HLT’10 . 273

Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann

machines. In NIPS’2012 . 485

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).

Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Ma-

chine Learning Research, 15, 1929–1958. 229, 231, 576

631

BIBLIOGRAPHY

Steinkrau, D., Simard, P. Y., and Buck, I. (2005). Using gpus for machine learning algo-

rithms. 2013 12th International Conference on Document Analysis and Recognition,

0, 1115–1119. 378

Stewart, L., He, X., and Zemel, R. S. (2007). Learning ﬂexible features for conditional

random ﬁelds. IEEE Transactions on Pattern Analysis and Machine Intelligence,

30(8), 1415–1426. 349

Supancic, J. and Ramanan, D. (2013). Self-paced learning for long-term tracking. In

CVPR’2013 . 273

Sussillo, D. (2014). Random walks: Training very deep nonlinear feed-forward networks

with smart initialization. CoRR, abs/1412.6558. 266, 268

Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of

computer science, University of Toronto. 335, 336, 344

Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive

Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 9, pages 789–

795. 524

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of

initialization and momentum in deep learning. In ICML. 255, 335, 336, 344

Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with

neural networks. Technical report, arXiv:1409.3215. 23, 95, 340, 341

Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with

neural networks. In NIPS’2014 . 324, 407

Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.

Master’s thesis, University of British Columbia. 466

Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On

autoencoders and score matching for energy based models. In ICML’2011 . ACM. 532

Swersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw bayesian optimization.

arXiv preprint arXiv:1406.3896. 373

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-

houcke, V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical

report, arXiv:1409.4842. 21, 22, 24, 190, 235, 270, 288

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and

Fergus, R. (2014b). Intriguing properties of neural networks. ICLR, abs/1312.6199.

234

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to

human-level performance in face veriﬁcation. In CVPR’2014 . 93

632

BIBLIOGRAPHY

Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In

Proceedings of the 27th International Conference on Machine Learning, June 21-24,

2010, Haifa, Israel. 211

Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines

for modeling motion style. In L. Bottou and M. Littman, editors, ICML 2009 , pages

1025–1032. ACM. 408

Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary

latent variables. In NIPS’06 , pages 1345–1352. MIT Press, Cambridge, MA. 408, 409

Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework

for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 152, 153, 476,

477, 504

Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994 . 516

Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society B, 58, 267–288. 205

Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to

the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors,

ICML 2008 , pages 1064–1071. ACM. 526, 563

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.

Journal of the Royal Statistical Society B, 61(3), 611–622. 454

Tom Schaul, Ioannis Antonoglou, D. S. (2014). Unit tests for stochastic optimization. In

International Conference on Learning Representations. 260

Torabi, A., Pal, C., Larochelle, H., and Courville, A. (2015). Using descriptive video

services to create a large data source for video annotation research. arXiv preprint

arXiv: 1503.01070. 142

Tu, K. and Honavar, V. (2011). On the utility of curricula in unsupervised learning of

probabilistic grammars. In IJCAI’2011 . 273

Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autore-

gressive density-estimator. In NIPS’2013 . 331, 333

van der Maaten, L. and Hinton, G. E. (2008a). Visualizing data using t-SNE. J. Machine

Learning Res., 9. 395, 476, 504, 508

van der Maaten, L. and Hinton, G. E. (2008b). Visualizing data using t-SNE. Journal of

Machine Learning Research, 9, 2579–2605. 477

Vanhoucke, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks

on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.

377, 384

633

BIBLIOGRAPHY

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-

Verlag, Berlin. 106

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.

106

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative

frequencies of events to their probabilities. Theory of Probability and Its Applications,

16, 264–280. 106

Vincent, P. (2011a). A connection between score matching and denoising autoencoders.

Neural Computation, 23(7). 465, 466, 468, 584

Vincent, P. (2011b). A connection between score matching and denoising autoencoders.

Neural Computation, 23(7), 1661–1674. 532, 585

Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.

506

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and

composing robust features with denoising autoencoders. In ICML 2008 . 211, 463

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked

denoising autoencoders: Learning useful representations in a deep network with a local

denoising criterion. J. Machine Learning Res., 11. 463

Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a). Gram-

mar as a foreign language. Technical report, arXiv:1412.7449. 340

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural

image caption generator. arXiv 1411.4555. 340

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: a neural image

caption generator. In CVPR’2015 . arXiv:1411.4555. 95

Viola, P. and Jones, M. (2001). Robust real-time object detection. In International

Journal of Computer Vision. 382

Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal

projections directed to the auditory pathway. Nature, 404(6780), 871–876. 14

Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.

In Advances in Neural Information Processing Systems 26 , pages 351–359. 232

Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme

recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,

and Signal Processing, 37, 328–339. 309, 385, 391

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of

neural networks using dropconnect. In ICML’2013. 232

634

BIBLIOGRAPHY

Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 232

Warde-Farley, D., Goodfellow, I. J., Lamblin, P., Desjardins, G., Bastien, F., and Bengio,

Y. (2011). pylearn2. http://deeplearning.net/software/pylearn2. 379

Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical

analysis of dropout in piecewise linear networks. In ICLR’2014 . 232

Wawrzynek, J., Asanovic, K., Kingsbury, B., Johnson, D., Beck, J., and Morgan, N.

(1996). Spert-ii: A vector microprocessor system. Computer, 29(3), 79–86. 383

Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by

semideﬁnite programming. In CVPR’2004 , pages 988–995. 152, 504

Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In

Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC , pages 762–770. 195

Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised em-

bedding. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 , pages

1168–1175, New York, NY, USA. ACM. 486

Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning

to rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 328

Weston, J., Chopra, S., and Bordes, A. (2014). Memory networks. arXiv preprint

arXiv:1410.3916 . 343

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON

Convention Record, volume 4, pages 96–104. IRE, New York. 13, 19, 21, 24

Wikipedia (2015). List of animals by number of neurons — wikipedia, the free encyclo-

pedia. [Online; accessed 4-March-2015]. 21, 24

Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In

NIPS’95 , pages 514–520. MIT Press, Cambridge, MA. 191

Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist

reinforcement learning. Machine Learning, 8, 229–256. 188, 343

Wolpert, D. and MacReady, W. (1997). No free lunch theorems for optimization. IEEE

Transactions on Evolutionary Computation, 1, 67–82. 249

Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms.

Neural Computation, 8(7), 1341–1390. 110

Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image

recognition. arXiv:1501.02876. 22, 380

Wu, Z. (1997). Global continuation for distance geometry problems. SIAM Journal of

Optimization, 7, 814–836. 271

635

BIBLIOGRAPHY

Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated

splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562.

231

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,

and Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with

visual attention. In ICML’2015 . 95

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,

and Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with

visual attention. arXiv:1502.03044. 340

Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for

gaussian mixtures. Neural Computation, 8, 129–151. 355

Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly

decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 524,

563

Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv 1410.4615. 273

Zaremba, W. and Sutskever, I. (2015). Reinforcement learning neural turing machines.

arXiv preprint arXiv:1505.00521. 343

Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions

of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical

Society. American Mathematical Society. 494

Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional net-

works. In ECCV’14 . 6

Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative

stochastic network for protein secondary structure prediction. In ICML’2014 . 590, 591

Z¨ohrer, M. and Pernkopf, F. (2014). General stochastic networks for classiﬁcation. In

NIPS’2014 . 590

636