Bibliography

Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for

Boltzmann machines. Cognitive Science, 9, 147–169. 513

Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data gen-

erating distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal.

426

Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data

generating distribution. In ICLR’2013 . also arXiv report 1211.4246. 408, 426, 428

Alain, G., Bengio, Y., Yao, L.,

Eric Thibodeau-Laufer, Yosinski, J., and Vincent, P.

(2015). GSNs: Generative stochastic networks. arXiv:1503.05571. 411

Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian

gradient. In Advances in Neural Information Processing Systems, pages 127–133. MIT

Press. 166

Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris

Society, 59, 2–5. 19

Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly

learning to align and translate. Technical report, arXiv:1409.0473. 22, 91, 359, 368,

369

Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition

with continuous-parameter hidden Markov models. Computer, Speech and Language,

2, 219–234. 62, 325

Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT

Press. 328

Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural

Information Processing Systems 26 , pages 2814–2822. 221

Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the

past and the future in protein secondary structure prediction. Bioinformatics, 15(11),

937–946. 296

555

BIBLIOGRAPHY

Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-

energy physics with deep learning. Nature communications, 5. 22

Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal

function. IEEE Trans. on Information Theory, 39, 930–945. 181

Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University

Press. 413

Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and

Applications. Wiley. 413

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,

Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.

Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 75, 178, 342

Basu, S. and Christensen, J. (2013). Teaching classiﬁcation boundaries to humans. In

AAAI’2013 . 247

Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of

ﬁnite state Markov chains. Ann. Math. Stat., 37, 1559–1563. 323

Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th Inter-

national Conference on Computational Learning Theory (COLT’95), pages 311–320,

Santa Cruz, California. ACM Press. 222

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2015). Automatic

diﬀerentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767 . 176

Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces

in random-dot stereograms. Nature, 355, 161–163. 460

Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for em-

bedding and clustering. In NIPS’01, Cambridge, MA. MIT Press. 446

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and

data representation. Neural Computation, 15(6), 1373–1396. 145, 464

Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distri-

butions using neural networks. IEEE Transactions on Neural Networks, special issue

on Data Mining and Knowledge Discovery, 11(3), 550–557. 302

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for se-

quence prediction with recurrent neural networks. Technical report, arXiv:1506.03099.

287

Bengio, Y. (1991). Artiﬁcial Neural Networks and their Application to Sequence Recog-

nition. Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 307,

328

556

BIBLIOGRAPHY

Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal

on Pattern Recognition and Artiﬁcial Intelligence, 7(4), 647–668. 325

Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys,

2, 129–162. 325

Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys,

2, 129–162. 328

Bengio, Y. (2002). New distributed probabilistic language models. Technical Report

1215, Dept. IRO, Universit´e de Montr´eal. 361

Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 139, 182, 184

Bengio, Y. (2013). Estimating or propagating gradients through stochastic neurons.

Technical Report arXiv:1305.2982, Universite de Montreal. 395

Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-

layer neural networks. In NIPS’99 , pages 400–406. MIT Press. 300, 302, 303, 304

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.

Neural Computation, 21(6), 1601–1621. 426, 484, 522

Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE

Transactions on Neural Networks, 7(5), 1231–1249. 328

Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold

cross-validation. In NIPS’03, Cambridge, MA. MIT Press, Cambridge. 109

Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In L. Bottou,

O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT

Press. 17, 185

Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In Large

Scale Kernel Machines. 139

Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04 ,

pages 129–136. MIT Press. 143, 466, 467

Bengio, Y. and S´en´ecal, J.-S. (2003). Quick training of probabilistic neural nets by

importance sampling. In Proceedings of AISTATS 2003 . 364

Bengio, Y. and S´en´ecal, J.-S. (2008). Adaptive importance sampling to accelerate training

of a neural probabilistic language model. IEEE Trans. Neural Networks, 19(4), 713–

722. 364

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated

acoustic parameters for continuous speech recognition using artiﬁcial neural networks.

In Proceedings of EuroSpeech’91 . 23, 352

557

BIBLIOGRAPHY

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992a). Global optimization of a

neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks,

3(2), 252–259. 325, 328

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992b). Neural network - gaussian

mixture hybrid for speech recognition or density estimation. In NIPS 4, pages 175–182.

Morgan Kaufmann. 352

Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term depen-

dencies in recurrent networks. In IEEE International Conference on Neural Networks,

pages 1183–1195, San Francisco. IEEE Press. (invited paper). 234, 313

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with

gradient descent is diﬃcult. IEEE Tr. Neural Nets. 234, 235, 236, 305, 312, 313

Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for

on-line handwriting recognition. Neural Computation, 7(6), 1289–1303. 328

Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language

model. In NIPS’00, pages 932–938. MIT Press. 16, 343

Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language

model. In NIPS’2000, pages 932–938. 355, 356, 357, 366

Bengio, Y., Ducharme, R., and Vincent, P. (2001c). A neural probabilistic language

model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages

932–938. MIT Press. 468, 469

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003a). A neural probabilistic

language model. JMLR, 3, 1137–1155. 356, 360, 366

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003b). A neural probabilistic

language model. Journal of Machine Learning Research, 3, 1137–1155. 468, 469

Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions

for local kernel machines. In NIPS’2005 . 139

Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows.

In NIPS’2005 . MIT Press. 143, 466

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007a). Greedy layer-wise

training of deep networks. In NIPS’2006 . 12, 16, 184, 432, 433

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007b). Greedy layer-wise

training of deep networks. In NIPS 19 , pages 153–160. MIT Press. 182

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In

ICML’09 . 167, 246

558

BIBLIOGRAPHY

Bengio, Y., L´eonard, N., and Courville, A. (2013a). Estimating or propagating gradients

through stochastic neurons for conditional computation. arXiv:1308.3432. 180, 366,

395

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-

encoders as generative models. In NIPS’2013. 428, 544, 548

Bengio, Y., Courville, A., and Vincent, P. (2013c). Representation learning: A review and

new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),

35(8), 1798–1828. 458, 542

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014a). Deep generative

stochastic networks trainable by backprop. Technical Report arXiv:1306.1091. 395

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative

stochastic networks trainable by backprop. In ICML’2014 . 395, 545, 546, 547, 549,

550

Bennett, C. (1976). Eﬃcient estimation of free energy diﬀerences from Monte Carlo data.

Journal of Computational Physics, 22(2), 245–268. 500

Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy

approach to natural language processing. Computational Linguistics, 22, 39–71. 367

Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive

divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 486

Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Clas-

siﬁcation. Ph.D. thesis, Universit´e de Montr´eal. 407

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,

J., Warde-Farley, D., and Bengio, Y. (2010a). Theano: a CPU and GPU math ex-

pression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference

(SciPy). Oral Presentation. 75, 342

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,

J., Warde-Farley, D., and Bengio, Y. (2010b). Theano: a CPU and GPU math expres-

sion compiler. In Proc. SciPy. 178

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.

488

Bishop, C. M. (1994). Mixture density networks. 162

Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks.

In Proceedings International Conference on Artiﬁcial Neural Networks ICANN’95 , vol-

ume 1, page 141–148. 205, 213

Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization.

Neural Computation, 7(1), 108–116. 205

559

BIBLIOGRAPHY

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 89, 138, 140

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability

and the vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 102,

103

Bonnet, G. (1964). Transformations des signaux al´eatoires `a travers les syst`emes non

lin´eaires sans m´emoire. Annales des T´el´ecommunications, 19(9–10), 203–220. 180

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and

meaning representations for open-text semantic parsing. AISTATS’2012 . 299

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal

margin classiﬁers. In COLT ’92: Proceedings of the ﬁfth annual workshop on Com-

putational learning theory, pages 144–152, New York, NY, USA. ACM. 16, 129, 139,

155

Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications

`a la reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 328

Bottou, L. (2011). From machine learning to machine reasoning. Technical report,

arXiv.1102.1808. 299

Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker inde-

pendent isolated digit recognition: multilayer perceptrons vs dynamic time warping.

Neural Networks, 3, 453–465. 328

Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing

systems using graph transformer networks. In Proceedings of the Computer Vision and

Pattern Recognition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 318,

326, 327, 328, 329, 330, 331

Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in

vision algorithms. In Proc. International Conference on Machine learning (ICML’10).

261

Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals:

multi-way local pooling for image recognition. In Proc. International Conference on

Computer Vision (ICCV’11). IEEE. 261

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and

singular value decomposition. Biological Cybernetics, 59, 291–294. 404

Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Ap-

proach, volume 247 of The Kluwer international series in engineering and computer

science. Kluwer Academic Publishers, Boston. 328

Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered

perceptrons. Computer Speech and Language, 3, 1–19. 352

560

BIBLIOGRAPHY

Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multi-

layer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence,

12, 1167–1178. 328

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University

Press, New York, NY, USA. 85

Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate

where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665–674.

229

Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 145,

464

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 215

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and

Regression Trees. Wadsworth International Group, Belmont, CA. 140

Bridle, J. S. (1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden

Markov model interpretation. Speech Communication, 9(1), 83–92. 158

Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition.

Ph.D. thesis, Dept. of Computer Science, Carnegie-Mellon University. 325

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Laﬀerty, J. D.,

Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.

Computational linguistics, 16(2), 79–85. 19

Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992).

Class-based n-gram models of natural language. Computational Linguistics, 18, 467–

479. 356

Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and

control. Blaisdell Pub. Co. 188

Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving

optimum programming problems. Technical Report BR-1303, Raytheon Company,

Missle and Space Division. 188

Buchberger, B., Collins, G. E., Loos, R., and Albrecht, R. (1983). Computer Algebra.

Springer-Verlag. 178

Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Pro-

ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 535–541. ACM. 343

Cai, M., Shi, Y., and Liu, J. (2013). Deep maxout neural networks for speech recognition.

In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop

on, pages 291–296. IEEE. 187

561

BIBLIOGRAPHY

Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.

In R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005, pages 33–40. Society for

Artiﬁcial Intelligence and Statistics. 484, 522

Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models

Summer School, pages 372–379. 222

Cauchy, A. (1847a). M´ethode g´en´erale pour la r´esolution de syst`emes d’´equations si-

multan´ees. In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538.

Cauchy, L. A. (1847b). M´ethode g´en´erale pour la r´esolution des syst`emes d’´equations

simultan´ees. Compte Rendu `a l’Acad´emie des Sciences. 188

Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,

UCSD. 145, 461

Chapelle, O., Weston, J., and Sch¨olkopf, B. (2003). Cluster kernels for semi-supervised

learning. In NIPS’02 , pages 585–592, Cambridge, MA. MIT Press. 446

Chapelle, O., Sch¨olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT

Press, Cambridge, MA. 446

Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neu-

ral Networks for Document Processing. In Guy Lorette, editor, Tenth International

Workshop on Frontiers in Handwriting Recognition, La Baule (France). Universit´e de

Rennes 1, Suvisoft. http://www.suvisoft.com. 20, 23, 341

Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for

language modeling. Computer, Speech and Language, 13(4), 359–393. 317, 318, 367

Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (????). Project adam:

Building an eﬃcient and scalable deep learning training system. 343

Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio,

Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical

machine translation. In Proceedings of the Empiricial Methods in Natural Language

Processing (EMNLP 2014). 312, 368

Choromanska, A., Henaﬀ, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The

loss surface of multilayer networks. 229, 435

Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous

speech recognition using attention-based recurrent nn: First results. arXiv:1412.1602.

353

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated

recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,

arXiv 1412.3555. 353

562

BIBLIOGRAPHY

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural

network for traﬃc sign classiﬁcation. Neural Networks, 32, 333–338. 21, 182, 184

Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big

simple neural nets for handwritten digit recognition. Neural Computation, 22, 1–14.

20, 23, 341

Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse

coding and vector quantization. In ICML’2011 . 23

Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in un-

supervised feature learning. In Proceedings of the Thirteenth International Conference

on Artiﬁcial Intelligence and Statistics (AISTATS 2011). 347

Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep

learning with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings

of the 30th International Conference on Machine Learning (ICML-13), volume 28 (3),

pages 1337–1345. JMLR Workshop and Conference Proceedings. 20, 23, 272, 343

Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris

VI, LIP6. 155

Collobert, R. (2011). Deep learning for eﬃcient discriminative parsing. In AISTATS’2011.

Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language process-

ing: Deep neural networks with multitask learning. In ICML’2008 . 365

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.

(2011a). Natural language processing (almost) from scratch. Journal of Machine

Learning Research, 12, 2493–2537. 247

Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011b). Torch7: A matlab-like envi-

ronment for machine learning. In BigLearn, NIPS Workshop. 342

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,

36, 287–314. 414, 415

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,

273–297. 16, 129, 139

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmenta-

tion using depth information. In International Conference on Learning Representations

(ICLR2013). 21, 182, 184

Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by

spike-and-slab RBMs. In ICML’11 . 375, 539

Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab

RBM and extensions to discrete and sparse data distributions. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 540

563

BIBLIOGRAPHY

Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.

Wiley-Interscience. 54

Cox, D. and Pinto, N. (2011). Beyond simple features: A large-scale feature search

approach to unconstrained face recognition. In Automatic Face & Gesture Recognition

and Workshops (FG 2011), 2011 IEEE International Conference on, pages 8–15. IEEE.

272

Cox, R. T. (1946). Probability, frequency and reasonable expectation. American Journal

of Physics, 14, 1––10. 47

Cram´er, H. (1946). Mathematical methods of statistics. Princeton University Press. 121

Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304,

111–114. 482

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-

matics of Control, Signals, and Systems, 2, 303–314. 180, 455

Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition

with the mean-covariance restricted Boltzmann machine. In NIPS’2010 . 21

Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained

deep neural networks for large vocabulary speech recognition. IEEE Transactions on

Audio, Speech, and Language Processing, 20(1), 33–42. 352

Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for

QSAR predictions. arXiv:1406.1231. 22

Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-

dimensional inputs. In NIPS26 . NIPS Foundation. 492

Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with

reconstruction sampling. In ICML’2011 . 365

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).

Identifying and attacking the saddle point problem in high-dimensional non-convex

optimization. In NIPS’2014 . 79, 229, 435

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.

(2014). The visual microphone: Passive recovery of sound from video. ACM Transac-

tions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 345

de Finetti, B. (1937). La pr´evision: ses lois logiques, ses sources subjectives. Annales de

l’institut Henri Poincar´e, 7, 1–68. 47

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M.,

Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep

networks. In NIPS’2012 . 343

564

BIBLIOGRAPHY

Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.

17, 182, 455, 456

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A

Large-Scale Hierarchical Image Database. In CVPR09 . 19, 134

Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than

10,000 image categories tell us? In Proceedings of the 11th European Conference on

Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.

Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and

Adam, H. (2014). Large-scale object classiﬁcation using label relation graphs. In

ECCV’2014 , pages 48–64. 318

Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and

Trends in Signal Processing. 353

Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Bi-

nary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010 ,

Makuhari, Chiba, Japan. 21

Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs

for vision. Technical Report 1327, D´epartement d’Informatique et de Recherche

Op´erationnelle, Universit´e de Montr´eal. 540

Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.

In NIPS’2011 . 500

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast

and robust neural network joint models for statistical machine translation. In Proc.

ACL’2014 . 368

DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.

neurons vs. machines. NIPS Tutorial. 22, 275

Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random ﬁelds. In International

Conference on Artiﬁcial Intelligence and Statistics, pages 177–184. 318

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,

K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual

recognition and description. arXiv:1411.4389. 92

Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embed-

ding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,

Stanford University. 145, 464

Doob, J. (1953). Stochastic processes. Wiley: New York. 47

565

BIBLIOGRAPHY

Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.

IEEE Transactions on Neural Networks, 1, 75–80. 236, 305

Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Math-

ematical Analysis and Applications, 5(1), 30–45. 188

Dreyfus, S. E. (1973). The computational solution of optimal control problems with time

lag. IEEE Transactions on Automatic Control, 18(4), 383–385. 188

Dugas, C., Bengio, Y., B´elisle, F., and Nadeau, C. (2001). Incorporating second-order

functional knowledge for better option pricing. In NIPS’00 , pages 472–478. MIT Press.

62, 155

El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term

dependencies. In NIPS 8 . MIT Press. 297, 316, 317

ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term

dependencies. In NIPS’1995 . 308

Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and Vincent, P. (2009). The diﬃculty

of training deep architectures and the eﬀect of unsupervised pre-training. In Proceedings

of AISTATS’2009. 182, 184

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).

Why does unsupervised pre-training help deep learning? J. Machine Learning Res.

433, 435, 436, 437

Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll´ar, P., Gao, J., He, X.,

Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual

concepts and back. arXiv:1411.4952. 92

Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P.,

and Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekker-

man, M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and

Distributed Approaches. Cambridge University Press. 422

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-

telligence. 21, 182, 184

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-

telligence, 35(8), 1915–1929. 318

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.

442

Fischer, A. and Igel, C. (2011). Bounding the bias of contrastive divergence learning.

Neural Computation, 23(3), 664–73. 522

566

BIBLIOGRAPHY

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals

of Eugenics, 7, 179–188. 19, 94

Frasconi, P., Gori, M., and Sperduti, A. (1997). On the eﬃcient classiﬁcation of data

structures by neural networks. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence. 299

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive pro-

cessing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786.

299

Frey, B. J. (1998). Graphical models for machine learning and digital communication.

MIT Press. 300, 301

Frey, B. J., Hinton, G. E., and Dayan, P. (1996). Does the wake-sleep algorithm learn

good density estimators? In NIPS’95 , pages 661–670. MIT Press, Cambridge, MA.

300

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-

anism of pattern recognition unaﬀected by shift in position. Biological Cybernetics, 36,

193–202. 14, 20, 23, 276

Garson, J. (1900). The metric system of identiﬁcation of criminals, as used in in great

britain and ireland. The Journal of the Anthropological Institute of Great Britain and

Ireland, (2), 177–227. 19

Glorot, X. and Bengio, Y. (2010). Understanding the diﬃculty of training deep feedfor-

ward neural networks. In AISTATS’2010 . 154

Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks. In

AISTATS’2011 . 14, 155, 421

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectiﬁer neural networks.

In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS 2011). 186, 421

Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale senti-

ment classiﬁcation: A deep learning approach. In ICML’2011 . 421, 441

Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face

Recognition. Imperial College Press. 465, 467

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep

networks. In NIPS’2009 , pages 646–654. 408, 420

Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L.

(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot

Interaction (HRI), Osaka, Japan. ACM Press, ACM Press. 90

Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with

spike-and-slab sparse coding. In ICML’2012 . 417

567

BIBLIOGRAPHY

Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution

for autoencoders. Technical report, Universit´e de Montr´eal. 267

Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding

for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning

Hierarchical Models. 182, 184, 441

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).

Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–

1327. 187, 220, 274, 347

Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction

deep Boltzmann machines. In NIPS26 . NIPS Foundation. 91, 490, 536, 538

Goodfellow, I. J., Courville, A., and Bengio, Y. (2013c). Scaling up spike-and-slab models

for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(8), 1902–1914. 540

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2014a). An em-

pirical investigation of catastrophic forgeting in gradient-based neural networks. In

ICLR’2014 . 187

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adver-

sarial examples. CoRR, abs/1412.6572. 223, 225

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,

Courville, A., and Bengio, Y. (2014c). Generative adversarial networks. In NIPS’2014.

180

Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014d). Multi-

digit number recognition from Street View imagery using deep convolutional neural

networks. In International Conference on Learning Representations. 21, 91, 182, 183,

184, 334

Goodman, J. (2001). Classes for fast maximum entropy training. In International Con-

ference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 361

Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE

Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(1), 76–86. 229

Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Originally

published under the pseudonym “Student”. 19

Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed

representations without word alignments. Technical report, arXiv:1410.2455. 444

Graves, A. (2011a). Practical variational inference for neural networks. In J. Shawe-

Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in

Neural Information Processing Systems 24 , pages 2348–2356. Curran Associates, Inc.

204

568

BIBLIOGRAPHY

Graves, A. (2011b). Practical variational inference for neural networks. In NIPS’2011 .

206

Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies

in Computational Intelligence. Springer. 282, 296, 311, 312, 318, 353

Graves, A. (2013). Generating sequences with recurrent neural networks. Technical

report, arXiv:1308.0850. 163, 311

Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent

neural networks. In ICML’2014 . 311

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁcation with bidirec-

tional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.

296

Graves, A. and Schmidhuber, J. (2009). Oﬄine handwriting recognition with multidi-

mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and

L. Bottou, editors, NIPS’2008 , pages 545–552. 296

Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-

ral classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks.

In ICML’2006 , pages 369–376, Pittsburgh, USA. 318, 353

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Uncon-

strained on-line handwriting recognition with recurrent neural networks. In J. Platt,

D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 296

Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., and Schmidhuber,

J. (2009). A novel connectionist system for unconstrained handwriting recognition.

Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868.

311

Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recur-

rent neural networks. In ICASSP’2013 , pages 6645–6649. 296, 297, 311, 312, 353

Graves, A., Wayne, G., and Danihelka, I. (2014). Neural Turing machines.

arXiv:1410.5401. 22

Gregor, K. and LeCun, Y. (2010). Emergence of complex-like cells in a temporal product

network with local receptive ﬁelds. Technical report, arXiv:1006.0448. 266

G¨ul¸cehre, C¸ . and Bengio, Y. (2013). Knowledge matters: Importance of prior infor-

mation for optimization. In International Conference on Learning Representations

(ICLR’2013). 21, 243

Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-

tion principle for unnormalized statistical models. In Proceedings of The Thirteenth

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10). 492

569

BIBLIOGRAPHY

Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.

(2007). Online learning for oﬀroad robots: Spatial label propagation to learn long-

range traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA,

USA. 346

Haﬀner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural

networks for high performance continuous speech recognition. In International Confer-

ence on Acoustics, Speech and Signal Processing (ICASSP), pages 105–108, Toronto.

328

H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings

of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,

California. ACM Press. 182, 455

H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.

Computational Complexity, 1, 113–129. 182, 455

Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 15

Henaﬀ, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning

of sparse features for scalable audio classiﬁcation. In ISMIR’11 . 422

Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modiﬁables: D´ecodage de

messages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie

des Sciences, 299(III-13), 525––528. 414

Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,

Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic

modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 21,

353

Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.

Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 483

Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 464

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with

neural networks. Science, 313(5786), 504–507. 410, 432, 433

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with

Neural Networks. Science, 313, 504–507. 435

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

Helmholtz free energy. In NIPS’1993 . 404

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief

nets. Neural Computation, 18, 1527–1554. 12, 16, 23, 130, 432, 433, 523

570

BIBLIOGRAPHY

Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,

Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural

networks for acoustic modeling in speech recognition: The shared views of four research

groups. IEEE Signal Process. Mag., 29(6), 82–97. 91

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.

(2012c). Improving neural networks by preventing co-adaptation of feature detectors.

Technical report, arXiv:1207.0580. 201

Hinton, G. E., Vinyals, O., and Dean, J. (2014). Dark knowledge. Invited talk at the

BayLearn Bay Area Machine Learning Symposium. 344

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma

thesis, T.U. M¨unich. 234, 305, 313

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computa-

tion, 9(8), 1735–1780. 22, 311, 312

Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).

Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies. In

J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE

Press. 312

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are

universal approximators. Neural Networks, 2, 359–366. 180, 455

Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an

unknown mapping and its derivatives using multilayer feedforward networks. Neural

networks, 3(5), 551–560. 180

Horst, R., Pardalos, P., and Thoai, N. (2000). Introduction to Global Optimization.

Kluwer Academic Publishers. Second Edition. 245

Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World

Chess Champion. Princeton University Press, Princeton, NJ, USA. 2

Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov

random ﬁelds on lattice. Annals of the Institute of Statistical Mathematics, 54(1),

1–18. 489

Hubel, D. and Wiesel, T. (1968). Receptive ﬁelds and functional architecture of monkey

striate cortex. Journal of Physiology (London), 195, 215–243. 273

Hubel, D. H. and Wiesel, T. N. (1959). Receptive ﬁelds of single neurons in the cat’s

striate cortex. Journal of Physiology, 148, 574–591. 273

Hubel, D. H. and Wiesel, T. N. (1962). Receptive ﬁelds, binocular interaction, and

functional architecture in the cat’s visual cortex. Journal of Physiology (London),

160, 106–154. 273

571

BIBLIOGRAPHY

Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96,

pages 13–24. 284

Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing

Surveys, 2, 94–128. 414

Hyv¨arinen, A. (2005a). Estimation of non-normalized statistical models using score

matching. J. Machine Learning Res., 6. 425

Hyv¨arinen, A. (2005b). Estimation of non-normalized statistical models using score

matching. Journal of Machine Learning Research, 6, 695–709. 490

Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence,

and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural

Networks, 18, 1529–1531. 491

Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and

Data Analysis, 51, 2499–2512. 491

Hyv¨arinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Ex-

istence and uniqueness results. Neural Networks, 12(3), 429–439. 415

Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis.

Wiley-Interscience. 414

Hyv¨arinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A proba-

bilistic approach to early computational vision. Springer-Verlag. 279

Ioﬀe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training

by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 . 21, 90

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture

of local experts. Neural Computation, 3, 79–87. 162

Jaeger, H. (2003). Adaptive nonlinear system identiﬁcation with echo state networks. In

Advances in Neural Information Processing Systems 15 . 306

Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state

networks. Technical report, Jacobs University. 297

Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 305

Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and

saving energy in wireless communication. Science, 304(5667), 78–80. 23, 305

Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Sch¨olkopf, B. (2012).

On causal and anticausal learning. In ICML’2012 , pages 1255–1262. 447, 449

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best

multi-stage architecture for object recognition? In ICCV’09. 14, 155, 422

572

BIBLIOGRAPHY

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best

multi-stage architecture for object recognition? In Proc. International Conference on

Computer Vision (ICCV’09), pages 2146–2153. IEEE. 20, 23, 186, 272

Jarzynski, C. (1997). Nonequilibrium equality for free energy diﬀerences. Phys. Rev.

Lett., 78, 2690–2693. 499

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University

Press. 46

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target

vocabulary for neural machine translation. arXiv:1412.2007. 368

Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters

from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in

Practice. North-Holland, Amsterdam. 317, 367

Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial pyramids: Receptive ﬁeld learn-

ing for pooled image features. In Computer Vision and Pattern Recognition (CVPR),

2012 IEEE Conference on, pages 3370–3377. IEEE. 261

Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural

networks: convergence and generalization. IEEE Transactions on Neural Networks,

7(6), 1424–1438. 204, 206

Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 16

Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classi-

ﬁcation. IEEE Transactions on Signal Processing, 40(12), 3043–3054. 325

Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algo-

rithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 414

Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., G¨ul¸cehre, c., Memisevic, R., Vin-

cent, P., Courville, A., Bengio, Y., Ferrari, R. C., Mirza, M., Jean, S., Carrier, P.-L.,

Dauphin, Y., Boulanger-Lewandowski, N., Aggarwal, A., Zumer, J., Lamblin, P., Ray-

mond, J.-P., Desjardins, G., Pascanu, R., Warde-Farley, D., Torabi, A., Sharma, A.,

Bengio, E., Cˆot´e, M., Konda, K. R., and Wu, Z. (2013). Combining modality speciﬁc

deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM

on International Conference on Multimodal Interaction. 182, 184

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In

EMNLP’2013 . 368

Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.

IEEE Transactions on Pattern Analysis and Machine Intelligence. 428

Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image

descriptions. In CVPR’2015 . arXiv:1412.2306. 92

573

BIBLIOGRAPHY

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).

Large-scale video classiﬁcation with convolutional neural networks. In CVPR. 19

Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side

Constraints. Master’s thesis, Dept.˜of Mathematics, Univ.˜of Chicago. 87

Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model

component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal

Processing, ASSP-35(3), 400–401. 317, 367

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding

algorithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 407

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding

algorithms with applications to object recognition. Technical report, Computational

and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-

12-01. 422

Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant

features through topographic ﬁlter maps. In CVPR’2009. 422

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.

(2010a). Learning convolutional feature hierarchies for visual recognition. In Advances

in Neural Information Processing Systems 23 (NIPS’10), pages 1090–1098. 272

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and Le-

Cun, Y. (2010b). Learning convolutional feature hierarchies for visual recognition.

In NIPS’2010 . 422

Kelley, H. J. (1960). Gradient theory of optimal ﬂight paths. ARS Journal, 30(10),

947–954. 188

Khan, F., Zhu, X., and Mutlu, B. (2011). How do humans teach: On curriculum learning

and teaching dimension. In Advances in Neural Information Processing Systems 24

(NIPS’11), pages 1449–1457. 247

Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary

Mathematics ; V. 1). American Mathematical Society. 379

Kingma, D. and LeCun, Y. (2010a). Regularized estimation of image statistics by score

matching. In NIPS’2010 . 425

Kingma, D. and LeCun, Y. (2010b). Regularized estimation of image statistics by score

matching. In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,

editors, Advances in Neural Information Processing Systems 23, pages 1126–1134. 492

Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning

with deep generative models. In NIPS’2014. 395

574

BIBLIOGRAPHY

Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable

models in auxiliary form. Technical report, arxiv:1306.0733. 180, 395

Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings

of the International Conference on Learning Representations (ICLR). 180, 395, 467,

468

Kingma, D. P. and Welling, M. (2014b). Eﬃcient gradient-based inference through trans-

formations between bayes nets and neural nets. Technical report, arxiv:1402.0480. 180,

394, 395

Kirkpatrick, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by simulated

annealing. Science, 220, 671–680. 245

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.

In ICML’2014 . 92

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embed-

dings with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 92, 311

Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed

representations of words. In Proceedings of COLING 2012 . 444

Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and

Pﬁster, H. (2014). Deep learning for the connectome. GPU Technology Conference. 22

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and

Techniques. MIT Press. 323, 393, 400

Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and max-

imization of A posteriori probabilities – application to transition-based connectionist

speech recognition. In NIPS’95 . MIT Press, Cambridge, MA. 352

Koren, Y. (2009). 1 the bellkor solution to the netﬂix grand prize. 218

Koutnik, J., Greﬀ, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In

ICML’2014 . 297, 317

Koˇcisk´y, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-

sentations by Marginalizing Alignments. In Proceedings of ACL. 369

Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties

of DBNs with binary hidden units and real-valued visible units. In ICML’2013. 455

Krizhevsky, A. (2010). Convolutional deep belief networks on CIFAR-

10. Technical report, University of Toronto. Unpublished Manuscript:

http://www.cs.utoronto.ca/ kriz/conv-cifar10-aug2010.pdf. 342

Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny

images. Technical report, University of Toronto. 19, 375

575

BIBLIOGRAPHY

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classiﬁcation with deep

convolutional neural networks. In Advances in Neural Information Processing Systems

25 (NIPS’2012). 20, 23, 90, 346

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classiﬁcation with deep

convolutional neural networks. In NIPS’2012 . 21, 182, 184, 421

Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the

Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–

492, Berkeley, Calif. University of California Press. 87

Kumar, M. P., Packer, B., and Koller, D. (2010). Self-paced learning for latent variable

models. In NIPS’2010 . 247

Laﬀerty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random ﬁelds:

Probabilistic models for segmenting and labeling sequence data. In C. E. Brodley and

A. P. Danyluk, editors, ICML 2001 . Morgan Kaufmann. 318, 326

Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural net-

work architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-

Mellon University. 282, 307

Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear

independent component analysis using ensemble learning: Experiments and discussion.

In Proc. ICA. Citeseer. 415

Larochelle, H. and Bengio, Y. (2008a). Classiﬁcation using discriminative restricted Boltz-

mann machines. In ICML’2008 . 408, 551

Larochelle, H. and Bengio, Y. (2008b). Classiﬁcation using discriminative restricted

Boltzmann machines. In ICML’08 , pages 536–543. ACM. 446

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.

In AISTATS’2011 . 300, 303

Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In

AAAI Conference on Artiﬁcial Intelligence. 442

Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative

and discriminative models. In Proceedings of the Computer Vision and Pattern Recog-

nition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer

Society. 446

Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled

convolutional neural networks. In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor,

R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems

23 (NIPS’10), pages 1279–1287. 266

576

BIBLIOGRAPHY

Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,

A. (2012). Building high-level features using large scale unsupervised learning. In

ICML’2012 . 20, 23

Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approx-

imators. Neural Computation, 22(8), 2192–2207. 455

Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural

gradient algorithm. In NIPS’07 . 166

LeCun, Y. (1985). Une proc´edure d’apprentissage pour R´eseau `a seuil assym´etrique. In

Cognitiva 85: A la Fronti`ere de l’Intelligence Artiﬁcielle, des Sciences de la Connais-

sance et des Neurosciences, pages 599–604, Paris 1985. CESTA, Paris. 188

LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de

Paris VI. 16, 404

LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,

Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications

of neural network chips and automatic learning. IEEE Communications Magazine,

27(11), 41–46. 215, 276

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998a). Gradient-based learning

applied to document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 14, 23

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998b). Gradient-based learning

applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 16, 19,

318, 326, 327, 328, 353

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area

V2. In NIPS’07 . 408

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief

networks for scalable unsupervised learning of hierarchical representations. In L. Bottou

and M. Littman, editors, ICML 2009. ACM, Montreal, Canada. 272, 540, 541

Lee, Y. J. and Grauman, K. (2011). Learning the easy things ﬁrst: self-paced visual

category discovery. In CVPR’2011 . 247

Leibniz, G. W. (1676). Memoir using the chain rule. (Cited in TMME 7:2&3 p 321-332,

2010). 188

Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; represen-

tation and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.

Leprieur, H. and Haﬀner, P. (1995). Discriminant learning with minimum memory loss

for improved non-vocabulary rejection. In EUROSPEECH’95, Madrid, Spain. 325

577

BIBLIOGRAPHY

L’Hˆopital, G. F. A. (1696). Analyse des inﬁniment petits, pour l’intelligence des lignes

courbes. Paris: L’Imprimerie Royale. 188

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies

is not as diﬃcult with NARX recurrent neural networks. IEEE Transactions on Neural

Networks, 7(6), 1329–1338. 308

Linde, N. (1992). The machine that changed the world, episode 3. Documentary minis-

eries. 2

Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Nu-

merical Mathematics, 16(2), 146–160. 188

Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to ap-

proximately evaluate or simulate. In Proceedings of the 27th International Conference

on Machine Learning (ICML’10). 517

Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine

invented by Charles Babbage”. 1

Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 319, 325, 332

Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent

neural network training. Computer Science Review, 3(3), 127–149. 305

Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with

convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 92

Lyness, J. N. and Moler, C. B. (1967). Numerical diﬀerentiation of analytic functions.

SIAM J.Numer. Anal., 4, 202––210. 176

Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09 . 491

Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without

stable states: A new framework for neural computation based on perturbations. Neural

Computation, 14(11), 2531–2560. 305

MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge

University Press. 54

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning

with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 92

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for

restricted Boltzmann machine learning. In Proceedings of The Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10), volume 9, pages

509–516. 486, 491, 519

Martens, J. and Medabalimi, V. (2014). On the expressive eﬃciency of sum product

networks. arXiv:1411.7717 . 456

578

BIBLIOGRAPHY

Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-

free optimization. In Proc. ICML’2011 . ACM. 313, 314

Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous

state space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612.

489

Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition

using a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo

CA. Morgan Kaufmann. 328

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall,

London. 157

McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous

activity. Bulletin of Mathematical Biophysics, 5, 115–133. 13

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,

Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,

J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In

JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 182, 184, 441

Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surﬁng on the

manifold. Learning Workshop, Snowbird. 544

Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular

PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 355

Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,

Brno University of Technology. 163, 315

Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empiri-

cal evaluation and combination of advanced language modeling techniques. In Proc.

12th annual conference of the international speech communication association (INTER-

SPEECH 2011). 366

Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for

training large scale neural network language models. In Proc. ASRU’2011. 247, 366

Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages

for machine translation. Technical report, arXiv:1309.4168. 444

Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cam-

bridge UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 496

Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 13

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 89

579

BIBLIOGRAPHY

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings eﬃciently with noise-

contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and

K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages

2265–2273. Curran Associates, Inc. 366, 494

Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural proba-

bilistic language models. In ICML’2012 , pages 1751–1758. 366

Mnih, V. and Hinton, G. (2010). Learning to detect roads in high-resolution aerial images.

In Proceedings of the 11th European Conference on Computer Vision (ECCV). 92

Mobahi, H. and Fisher III, J. W. (2015). A theoretical analysis of optimization by gaussian

continuation. In AAAI’2015 . 246

Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief

networks. IEEE Trans. on Audio, Speech and Language Processing, 20(1), 14–22. 352

Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks

with discrete units. Neural Computation, 26. 455

Mont´ufar, G. and Ay, N. (2011). Reﬁnements of universal approximation results for

deep belief networks and restricted Boltzmann machines. Neural Computation, 23(5),

1306–1319. 455

Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product

of mixtures? SIAM Journal on Discrete Mathematics, 29(1), 321–347. 454

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear

regions of deep neural networks. In NIPS’2014 . 17, 453, 456, 457

Mor-Yosef, S., Samueloﬀ, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking

the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet

Gynecol, 75(6), 944–7. 2

Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language

model. In AISTATS’2005. 361, 363

Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages

275–282, San Mateo, CA. Morgan Kaufmann. 308, 309, 317

Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cam-

bridge, MA, USA. 89, 138, 140

Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In

ICML’2014 . 163, 304, 305

Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method

for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,

ASSP-36(9), 1432–1436. 325

580

BIBLIOGRAPHY

Nair, V. and Hinton, G. (2010a). Rectiﬁed linear units improve restricted Boltzmann

machines. In ICML’2010 . 155, 421

Nair, V. and Hinton, G. E. (2010b). Rectiﬁed linear units improve restricted Boltzmann

machines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh

International Conference on Machine Learning (ICML-10), pages 807–814. ACM. 14

Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypoth-

esis. In NIPS’2010 . 145, 461

Neal, R. M. (1992). Connectionist learning of belief networks. Artiﬁcial Intelligence, 56,

71–113. 542

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Springer. 221

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2),

125–139. 498, 499

Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance

sampling. 500

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Read-

ing digits in natural images with unsupervised feature learning. Deep Learning and

Unsupervised Feature Learning Workshop, NIPS. 19

Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical

language modelling. In European Conference on Speech Communication and Technol-

ogy (Eurospeech), pages 973–976, Berlin. 356

Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of

part-of-speech and automatically derived category-based language models for speech

recognition. In International Conference on Acoustics, Speech and Signal Processing

(ICASSP), pages 177–180. 356

Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in

classifying static speech patterns. Computer Speech and Language, 4, 275–289. 155

Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 85, 87

Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural

Computation, 17, 1665–1699. 14

Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld prop-

erties by learning a sparse code for natural images. Nature, 381, 607–609. 277, 407,

460

Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set:

a strategy employed by V1? Vision Research, 37, 3311–3325. 350, 420

581

BIBLIOGRAPHY

Opper, M. and Archambeau, C. (2009). The variational gaussian approximation revisited.

Neural computation, 21(3), 786–792. 180

Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning

algorithms for various stochastic models. Neural Networks, 13(7), 755 – 764. 166

Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research

in Economics and Management Sci., MIT. 188

Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de

Montr´eal. 231, 232

Pascanu, R. and Bengio, Y. (2012). On the diﬃculty of training recurrent neural networks.

Technical Report arXiv:1211.5063, Universite de Montreal. 163

Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Tech-

nical report, arXiv:1301.3584. 166

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the diﬃculty of training recurrent

neural networks. In ICML’2013 . 163, 236, 305, 309, 315, 316, 317

Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. Technical report, U.

Montreal, arXiv:1312.6098. 182

Pascanu, R., G¨ul¸cehre, C¸ ., Cho, K., and Bengio, Y. (2014a). How to construct deep

recurrent neural networks. In ICLR’2014 . 17, 221, 297, 298, 311, 353, 456, 457

Pascanu, R., Montufar, G., and Bengio, Y. (2014b). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 454

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential

reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, Uni-

versity of California, Irvine, pages 329–334. 377

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann. 47

Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 27

Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recog-

nition hard? PLoS Comput Biol, 4. 350, 541

Pinto, N., Stone, Z., Zickler, T., and Cox, D. (2011). Scaling up biologically-inspired

computer vision: A case study in unconstrained face recognition on facebook. In Com-

puter Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer

Society Conference on, pages 35–42. IEEE. 272

Pollack, J. B. (1990). Recursive distributed representations. Artiﬁcial Intelligence, 46(1),

77–105. 299

582

BIBLIOGRAPHY

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.

USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 238

Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders

and deep networks. CoRR, abs/1406.1831. 203

Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In

UAI’2011 , Barcelona, Spain. 182, 456

Poundstone, W. (2005). Fortune’s Formula: The untold story of the scientiﬁc betting

system that beat the casinos and Wall Street. Macmillan. 55

Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 155

Price, R. (1958). A useful theorem for nonlinear devices having gaussian inputs. IEEE

Transactions on Information Theory, 4(2), 69–72. 180

Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual

representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.

274

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in

speech recognition. Proceedings of the IEEE, 77(2), 257–286. 323, 352

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE

ASSP Magazine, pages 257–285. 281, 323

Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive

distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 304

Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning

using graphics processors. In L. Bottou and M. Littman, editors, ICML 2009 , pages

873–880, New York, NY, USA. ACM. 23, 341

Rall, L. B. (1981). Automatic Diﬀerentiation: Techniques and Applications. Lecture

Notes in Computer Science 120, Springer. 176

Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Founda-

tions of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster

University Archive for the History of Economic Thought. 48

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Eﬃcient learning of

sparse representations with an energy-based model. In NIPS’2006 . 12, 16, 420, 432,

433

Ranzato, M., Huang, F., Boureau, Y., and LeCun, Y. (2007b). Unsupervised learning

of invariant feature hierarchies with applications to object recognition. In Proceedings

of the Computer Vision and Pattern Recognition Conference (CVPR’07). IEEE Press.

272

583

BIBLIOGRAPHY

Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief

networks. In NIPS’2007 . 420

Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical

parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 121

Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to

parallelizing stochastic gradient descent. In NIPS’2011 . 343

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and

approximate inference in deep generative models. In ICML’2014. 180, 394, 395

Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning

through cross-modal transfer. In 27th Annual Conference on Neural Information Pro-

cessing Systems (NIPS 2013). 442

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-

encoders: Explicit invariance during feature extraction. In ICML’2011. 428, 430,

463

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.

(2011b). Higher order contractive auto-encoder. In European Conference on Machine

Learning and Principles and Practice of Knowledge Discovery in Databases (ECML

PKDD). 408

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.

(2011c). Higher order contractive auto-encoder. In ECML PKDD. 428

Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold

tangent classiﬁer. In NIPS’2011 . 476

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for

sampling contractive auto-encoders. In ICML’2012. 544

Ringach, D. and Shapley, R. (2004). Reverse correlation in neurophysiology. Cognitive

Science, 28(2), 147–166. 276

Roberts, S. and Everson, R. (2001). Independent component analysis: principles and

practice. Cambridge University Press. 415

Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech

recognition system. Computer Speech and Language, 5(3), 259–274. 23, 352

Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 85

Romero, A., Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Y.

(2015). Fitnets: Hints for thin deep nets. In ICLR’2015, arXiv:1412.6550. 244, 245

Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part

i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,

8(1), pp. 181–217. 85

584

BIBLIOGRAPHY

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage

and organization in the brain. Psychological Review, 65, 386–408. 12, 13, 23

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 13, 23

Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear

embedding. Science, 290(5500). 145, 464

Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-

propagating errors. Nature, 323, 533–536. 12, 16, 21, 188, 355

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal repre-

sentations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,

Parallel Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cam-

bridge. 19, 23, 188

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986c). Learning representations

by back-propagating errors. Nature, 323, 533–536. 149, 281

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986d). Parallel

Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,

Cambridge. 15, 188

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986e). Parallel

Distributed Processing: Explorations in the Microstructure of Cognition, volume 1.

MIT Press, Cambridge. 149

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large

Scale Visual Recognition Challenge. 19

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition

challenge. arXiv preprint arXiv:1409.0575 . 24

Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal

elements of macaque V1 receptive ﬁelds. Neuron, 46(6), 945–956. 275

Sainath, T., rahman Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep

convolutional neural networks for LVCSR. In ICASSP 2013 . 353

Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages

448–455. 20, 23, 433, 526, 529, 534, 536

Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings

of the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AIS-

TATS 2009), volume 8. 533, 537, 549

585

BIBLIOGRAPHY

Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance

kernels for Gaussian processes. In NIPS’07 , pages 1249–1256, Cambridge, MA. MIT

Press. 447

Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief

networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,

volume 25, pages 872–879. ACM. 499

Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean ﬁeld theory for sigmoid belief

networks. Journal of Artiﬁcial Intelligence Research, 4, 61–76. 23

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On

random weights and unsupervised feature learning. In Proc. ICML’2011 . ACM. 272

Schaul, T., Zhang, S., and LeCun, Y. (2012). No More Pesky Learning Rates. Technical

report, New York University, arxiv 1206.1106. 243

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of

history compression. Neural Computation, 4(2), 234–242. 297

Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on

Neural Networks, 7(1), 142–146. 355

Sch¨olkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 139

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a

kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 145, 464

Sch¨olkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods —

Support Vector Learning. MIT Press, Cambridge, MA. 16, 155, 184

Schulz, H. and Behnke, S. (2012). Learning two-layer contractive encodings. In

ICANN’2012 , pages 620–628. 430

Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE

Transactions on Signal Processing, 45(11), 2673–2681. 296

Schwenk, H. (2007). Continuous space language models. Computer speech and language,

21, 492–518. 356, 360

Schwenk, H. (2010). Continuous space language models for statistical machine translation.

The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 356, 367

Schwenk, H. (2014). Cleaned subset of wmt ’14 dataset. 19

Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocab-

ulary continuous speech recognition. In International Conference on Acoustics, Speech

and Signal Processing (ICASSP), volume 1, pages 765–768. 356

Schwenk, H. and Gauvain, J.-L. (2005). Building continuous space language models for

transcribing european languages. In Interspeech, pages 737–740. 356

586

BIBLIOGRAPHY

Schwenk, H., Costa-juss`a, M. R., and Fonollosa, J. A. R. (2006). Continuous space lan-

guage models for the iwslt 2006 task. In International Workshop on Spoken Language

Translation, pages 166–173. 356, 367

Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-

dependent deep neural networks. In Interspeech 2011 , pages 437–440. 21

Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied

to house numbers digit classiﬁcation. CoRR, abs/1204.3968. 350

Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection

with unsupervised multi-stage feature learning. In Proc. International Conference on

Computer Vision and Pattern Recognition (CVPR’13). IEEE. 21, 182, 184

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical

Journal, 27(3), 379—-423. 55

Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the

Institute of Radio Engineers, 37(1), 10–21. 55

Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publica-

tions. 27

Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–

548. 284

Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied

Mathematics Letters, 4(6), 77–80. 284

Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.

Journal of Computer and Systems Sciences, 50(1), 132–150. 236

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism

for specifying selected invariances in an adaptive network. In NIPS’1991 . 475, 476

Simard, P. Y., LeCun, Y., and Denker, J. (1993). Eﬃcient pattern recognition using a

new transformation distance. In NIPS’92 . 474

Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation

invariance in pattern recognition — tangent distance and tangent propagation. Lecture

Notes in Computer Science, 1524. 474

Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a min-

imum, with application to neural networks. International Journal of Control, 62(6),

1391–1407. 213

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of

harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed

Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 384, 397

587

BIBLIOGRAPHY

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a).

Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In

NIPS’2011 . 299

Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural

language with recursive neural networks. In Proceedings of the Twenty-Eighth Inter-

national Conference on Machine Learning (ICML’2011). 299

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).

Semi-supervised recursive autoencoders for predicting sentiment distributions. In

EMNLP’2011 . 299

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.

(2013). Recursive deep models for semantic compositionality over a sentiment treebank.

In EMNLP’2013 . 299

Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural

networks. Complex Systems, 2, 625–639. 159

Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local

minima even for networks without hidden layers. Complex Systems, 3, 91–106. 229

Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturba-

tion gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.

176

Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog:

how ”less is more” in unsupervised dependency parsing. In HLT’10 . 247

Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann

machines. In NIPS’2012 . 445

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).

Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Ma-

chine Learning Research, 15, 1929–1958. 218, 220, 221, 536

Stewart, L., He, X., and Zemel, R. S. (2007). Learning ﬂexible features for conditional

random ﬁelds. IEEE Transactions on Pattern Analysis and Machine Intelligence,

30(8), 1415–1426. 319

Supancic, J. and Ramanan, D. (2013). Self-paced learning for long-term tracking. In

CVPR’2013 . 247

Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of

computer science, University of Toronto. 306, 307, 314

Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive

Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 9, pages 789–

795. 484

588

BIBLIOGRAPHY

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of

initialization and momentum in deep learning. In ICML. 238, 306, 307, 314

Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with

neural networks. Technical report, arXiv:1409.3215. 22, 91, 311, 312

Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with

neural networks. In NIPS’2014 . 368, 369

Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.

Master’s thesis, University of British Columbia. 426

Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On

autoencoders and score matching for energy based models. In ICML’2011 . ACM. 492

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-

houcke, V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical

report, arXiv:1409.4842. 20, 21, 23, 182, 184, 225, 262

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and

Fergus, R. (2014b). Intriguing properties of neural networks. ICLR, abs/1312.6199.

223

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to

human-level performance in face veriﬁcation. In CVPR’2014 . 90

Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In

Proceedings of the 27th International Conference on Machine Learning, June 21-24,

2010, Haifa, Israel. 203

Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework

for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 145, 436, 437,

464

Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994 . 476

Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society B, 58, 267–288. 198

Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to

the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors,

ICML 2008 , pages 1064–1071. ACM. 486, 523

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.

Journal of the Royal Statistical Society B, 61(3), 611–622. 414

Torabi, A., Pal, C., Larochelle, H., and Courville, A. (2015). Using descriptive video

services to create a large data source for video annotation research. arXiv preprint

arXiv: 1503.01070. 134

589

BIBLIOGRAPHY

Tu, K. and Honavar, V. (2011). On the utility of curricula in unsupervised learning of

probabilistic grammars. In IJCAI’2011 . 247

Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autore-

gressive density-estimator. In NIPS’2013 . 302, 304

van der Maaten, L. and Hinton, G. E. (2008a). Visualizing data using t-SNE. J. Machine

Learning Res., 9. 356, 436, 464, 468

van der Maaten, L. and Hinton, G. E. (2008b). Visualizing data using t-SNE. Journal of

Machine Learning Research, 9, 2579–2605. 437

Vanhoucke, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks

on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.

340

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-

Verlag, Berlin. 102, 103

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.

102, 103

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative

frequencies of events to their probabilities. Theory of Probability and Its Applications,

16, 264–280. 102, 103

Vincent, P. (2011a). A connection between score matching and denoising autoencoders.

Neural Computation, 23(7). 425, 426, 428, 544

Vincent, P. (2011b). A connection between score matching and denoising autoencoders.

Neural Computation, 23(7), 1661–1674. 492, 545

Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.

466

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and

composing robust features with denoising autoencoders. In ICML 2008 . 423

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked

denoising autoencoders: Learning useful representations in a deep network with a local

denoising criterion. J. Machine Learning Res., 11. 423

Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a). Gram-

mar as a foreign language. Technical report, arXiv:1412.7449. 311

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural

image caption generator. arXiv 1411.4555. 311

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: a neural image

caption generator. In CVPR’2015 . arXiv:1411.4555. 92

590

BIBLIOGRAPHY

Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal

projections directed to the auditory pathway. Nature, 404(6780), 871–876. 14

Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.

In Advances in Neural Information Processing Systems 26 , pages 351–359. 221

Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme

recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,

and Signal Processing, 37, 328–339. 282, 346, 352

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of

neural networks using dropconnect. In ICML’2013. 222

Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 221

Warde-Farley, D., Goodfellow, I. J., Lamblin, P., Desjardins, G., Bastien, F., and Bengio,

Y. (2011). pylearn2. http://deeplearning.net/software/pylearn2. 342

Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical

analysis of dropout in piecewise linear networks. In ICLR’2014 . 221

Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by

semideﬁnite programming. In CVPR’2004 , pages 988–995. 145, 464

Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In

Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC , pages 762–770. 188

Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised em-

bedding. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 , pages

1168–1175, New York, NY, USA. ACM. 446

Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning

to rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 299

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON

Convention Record, volume 4, pages 96–104. IRE, New York. 13, 19, 20, 23

Wikipedia (2015). List of animals by number of neurons — wikipedia, the free encyclo-

pedia. [Online; accessed 4-March-2015]. 20, 23

Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In

NIPS’95 , pages 514–520. MIT Press, Cambridge, MA. 184

Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist

reinforcement learning. Machine Learning, 8, 229–256. 180

Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms.

Neural Computation, 8(7), 1341–1390. 104

Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image

recognition. arXiv:1501.02876. 21, 343

591

BIBLIOGRAPHY

Wu, Z. (1997). Global continuation for distance geometry problems. SIAM Journal of

Optimization, 7, 814–836. 245

Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated

splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562.

221

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,

and Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with

visual attention. In ICML’2015 . 92

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,

and Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with

visual attention. arXiv:1502.03044. 311

Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for

gaussian mixtures. Neural Computation, 8, 129–151. 325

Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly

decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 484,

523

Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv 1410.4615. 247

Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions

of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical

Society. American Mathematical Society. 454

Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional net-

works. In ECCV’14 . 6

Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative

stochastic network for protein secondary structure prediction. In ICML’2014 . 550, 551

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.

Journal of the Royal Statistical Society, Series B, 67(2), 301–320. 165

Z¨ohrer, M. and Pernkopf, F. (2014). General stochastic networks for classiﬁcation. In

NIPS’2014 . 550

592