Bibliography
Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for
Boltzmann machines. Cognitive Science, 9, 147–169. 553
Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data gen-
erating distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal.
466
Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data
generating distribution. In ICLR’2013 . also arXiv report 1211.4246. 448, 466, 468
Alain, G., Bengio, Y., Yao, L.,
´
Eric Thibodeau-Laufer, Yosinski, J., and Vincent, P.
(2015). GSNs: Generative stochastic networks. arXiv:1503.05571. 451
Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris
Society, 59, 2–5. 19
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. Technical report, arXiv:1409.0473. 23, 95, 325, 398,
407, 408
Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition
with continuous-parameter hidden Markov models. Computer, Speech and Language,
2, 219–234. 64, 356
Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT
Press. 358
Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural
Information Processing Systems 26 , pages 2814–2822. 232
Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the
past and the future in protein secondary structure prediction. Bioinformatics, 15(11),
937–946. 323
Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-
energy physics with deep learning. Nature communications, 5. 23
595
BIBLIOGRAPHY
Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Trans. on Information Theory, 39, 930–945. 189
Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University
Press. 453
Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and
Applications. Wiley. 453
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,
Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.
Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 78, 186, 379
Basu, S. and Christensen, J. (2013). Teaching classification boundaries to humans. In
AAAI’2013 . 273
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of
finite state Markov chains. Ann. Math. Stat., 37, 1559–1563. 354
Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th Inter-
national Conference on Computational Learning Theory (COLT’95), pages 311–320,
Santa Cruz, California. ACM Press. 234
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2015). Automatic
differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767 . 184
Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. arXiv
preprint arXiv:1411.7610 . 231
Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces
in random-dot stereograms. Nature, 355, 161–163. 500
Beiu, V., Quintana, J. M., and Avedillo, M. J. (2003). Vlsi implementations of threshold
logic-a comprehensive survey. Neural Networks, IEEE Transactions on, 14(5), 1217–
1243. 383
Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for em-
bedding and clustering. In NIPS’01, Cambridge, MA. MIT Press. 486
Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and
data representation. Neural Computation, 15(6), 1373–1396. 152, 504
Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distri-
butions using neural networks. IEEE Transactions on Neural Networks, special issue
on Data Mining and Knowledge Discovery, 11(3), 550–557. 330
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for se-
quence prediction with recurrent neural networks. Technical report, arXiv:1506.03099.
314
596
BIBLIOGRAPHY
Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence Recog-
nition. Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 336,
358
Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal
on Pattern Recognition and Artificial Intelligence, 7(4), 647–668. 356
Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys,
2, 129–162. 356
Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys,
2, 129–162. 359
Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural Computa-
tion, 12(8), 1889–1900. 372
Bengio, Y. (2002). New distributed probabilistic language models. Technical Report
1215, Dept. IRO, Universit´e de Montr´eal. 400
Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 147, 190
Bengio, Y. (2013a). Deep learning of representations: looking forward. In Statistical
Language and Speech Processing, volume 7978 of Lecture Notes in Computer Science,
pages 1–37. Springer, also in arXiv at http://arxiv.org/abs/1305.0445. 381
Bengio, Y. (2013b). Estimating or propagating gradients through stochastic neurons.
Technical Report arXiv:1305.2982, Universite de Montreal. 435
Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-
layer neural networks. In NIPS’99 , pages 400–406. MIT Press. 329, 330, 332, 333
Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.
Neural Computation, 21(6), 1601–1621. 466, 524, 562
Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE
Transactions on Neural Networks, 7(5), 1231–1249. 358
Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold
cross-validation. In NIPS’03, Cambridge, MA. MIT Press, Cambridge. 114
Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In L. Bottou,
O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT
Press. 17, 192
Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In Large
Scale Kernel Machines. 147
Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04 ,
pages 129–136. MIT Press. 150, 506, 507
597
BIBLIOGRAPHY
Bengio, Y. and en´ecal, J.-S. (2003). Quick training of probabilistic neural nets by
importance sampling. In Proceedings of AISTATS 2003 . 403
Bengio, Y. and en´ecal, J.-S. (2008). Adaptive importance sampling to accelerate training
of a neural probabilistic language model. IEEE Trans. Neural Networks, 19(4), 713–
722. 403
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated
acoustic parameters for continuous speech recognition using artificial neural networks.
In Proceedings of EuroSpeech’91 . 24, 391
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992a). Global optimization of a
neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks,
3(2), 252–259. 356, 358
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992b). Neural network - gaussian
mixture hybrid for speech recognition or density estimation. In NIPS 4, pages 175–182.
Morgan Kaufmann. 391
Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term depen-
dencies in recurrent networks. In IEEE International Conference on Neural Networks,
pages 1183–1195, San Francisco. IEEE Press. (invited paper). 247, 343
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE Tr. Neural Nets. 247, 248, 249, 334, 341, 343, 344
Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for
on-line handwriting recognition. Neural Computation, 7(6), 1289–1303. 358
Bengio, Y., Latendresse, S., and Dugas, C. (1999). Gradient-based learning of hyper-
parameters. Learning Conference, Snowbird. 372
Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language
model. In NIPS’00, pages 932–938. MIT Press. 16, 380
Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language
model. In NIPS’2000, pages 932–938. 394, 395, 396, 405
Bengio, Y., Ducharme, R., and Vincent, P. (2001c). A neural probabilistic language
model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages
932–938. MIT Press. 508, 509
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003a). A neural probabilistic
language model. JMLR, 3, 1137–1155. 395, 399, 405
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003b). A neural probabilistic
language model. Journal of Machine Learning Research, 3, 1137–1155. 508, 509
Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., and Marcotte, P. (2006a). Convex
neural networks. In NIPS’2005 , pages 123–130. 227
598
BIBLIOGRAPHY
Bengio, Y., Delalleau, O., and Le Roux, N. (2006b). The curse of highly variable functions
for local kernel machines. In NIPS’2005 . 147
Bengio, Y., Larochelle, H., and Vincent, P. (2006c). Non-local manifold Parzen windows.
In NIPS’2005 . MIT Press. 150, 506
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007a). Greedy layer-wise
training of deep networks. In NIPS’2006 . 12, 16, 472, 473
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007b). Greedy layer-wise
training of deep networks. In NIPS 19 , pages 153–160. MIT Press. 190
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In
ICML’09 . 272, 273
Bengio, Y., L´eonard, N., and Courville, A. (2013a). Estimating or propagating gradients
through stochastic neurons for conditional computation. arXiv:1308.3432. 188, 381,
435
Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-
encoders as generative models. In NIPS’2013. 468, 584, 588
Bengio, Y., Courville, A., and Vincent, P. (2013c). Representation learning: A review and
new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),
35(8), 1798–1828. 498, 582
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014a). Deep generative
stochastic networks trainable by backprop. Technical Report arXiv:1306.1091. 435
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative
stochastic networks trainable by backprop. In ICML’2014 . 435, 585, 586, 587, 589,
590
Bennett, C. (1976). Efficient estimation of free energy differences from Monte Carlo data.
Journal of Computational Physics, 22(2), 245–268. 540
Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy
approach to natural language processing. Computational Linguistics, 22, 39–71. 406
Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive
divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 526
Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Clas-
sification. Ph.D. thesis, Universit´e de Montr´eal. 226, 447
Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J.
Machine Learning Res., 13, 281–305. 369, 370, 371
599
BIBLIOGRAPHY
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,
J., Warde-Farley, D., and Bengio, Y. (2010a). Theano: a CPU and GPU math ex-
pression compiler. In Proceedings of the Python for Scientific Computing Conference
(SciPy). Oral Presentation. 78, 379
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,
J., Warde-Farley, D., and Bengio, Y. (2010b). Theano: a CPU and GPU math expres-
sion compiler. In Proc. SciPy. 186
Bergstra, J., Bardenet, R., Bengio, Y., and K´egl, B. (2011). Algorithms for hyper-
parameter optimization. In NIPS’2011 . 372
Bertsekas, D. P. (2004). Nonlinear programming. Athena Scientific, 2 edition. 250
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.
528
Bishop, C. M. (1994). Mixture density networks. 171
Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks.
In Proceedings International Conference on Artificial Neural Networks ICANN’95 , vol-
ume 1, page 141–148. 212, 221
Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization.
Neural Computation, 7(1), 108–116. 212
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 92, 138
Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is np-complete.
249
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability
and the vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 106
Bonnet, G. (1964). Transformations des signaux al´eatoires `a travers les syst`emes non
lin´eaires sans emoire. Annales des el´ecommunications, 19(9–10), 203–220. 187
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and
meaning representations for open-text semantic parsing. AISTATS’2012 . 328
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal
margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Com-
putational learning theory, pages 144–152, New York, NY, USA. ACM. 16, 134, 148,
164
Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications
`a la reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 358
Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad, editor,
Online Learning in Neural Networks. Cambridge University Press, Cambridge, UK.
252
600
BIBLIOGRAPHY
Bottou, L. (2011). From machine learning to machine reasoning. Technical report,
arXiv.1102.1808. 326, 328
Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In NIPS’2008.
251, 253
Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker inde-
pendent isolated digit recognition: multilayer perceptrons vs dynamic time warping.
Neural Networks, 3, 453–465. 358
Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing
systems using graph transformer networks. In Proceedings of the Computer Vision and
Pattern Recognition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 349,
357, 358, 359, 360, 361
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal de-
pendencies in high-dimensional sequences: Application to polyphonic music generation
and transcription. In ICML’12 . 408
Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in
vision algorithms. In Proc. International Conference on Machine learning (ICML’10).
287
Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals:
multi-way local pooling for image recognition. In Proc. International Conference on
Computer Vision (ICCV’11). IEEE. 287
Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and
singular value decomposition. Biological Cybernetics, 59, 291–294. 444
Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Ap-
proach, volume 247 of The Kluwer international series in engineering and computer
science. Kluwer Academic Publishers, Boston. 358
Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered
perceptrons. Computer Speech and Language, 3, 1–19. 391
Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multi-
layer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence,
12, 1167–1178. 358
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University
Press, New York, NY, USA. 88
Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate
where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665–674.
242
Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 152,
504
601
BIBLIOGRAPHY
Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 226
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and
Regression Trees. Wadsworth International Group, Belmont, CA. 137
Bridle, J. S. (1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden
Markov model interpretation. Speech Communication, 9(1), 83–92. 167
Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition.
Ph.D. thesis, Dept. of Computer Science, Carnegie-Mellon University. 356
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D.,
Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.
Computational linguistics, 16(2), 79–85. 19
Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992).
Class-based n-gram models of natural language. Computational Linguistics, 18, 467–
479. 395
Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and
control. Blaisdell Pub. Co. 195
Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving
optimum programming problems. Technical Report BR-1303, Raytheon Company,
Missle and Space Division. 195
Buchberger, B., Collins, G. E., Loos, R., and Albrecht, R. (1983). Computer Algebra.
Springer-Verlag. 186
Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Pro-
ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 535–541. ACM. 380
Cai, M., Shi, Y., and Liu, J. (2013). Deep maxout neural networks for speech recognition.
In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop
on, pages 291–296. IEEE. 194
Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.
In R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005, pages 33–40. Society for
Artificial Intelligence and Statistics. 524, 562
Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models
Summer School, pages 372–379. 232
Cauchy, A. (1847a). M´ethode g´en´erale pour la esolution de syst`emes d’´equations si-
multan´ees. In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538.
80
Cauchy, L. A. (1847b). M´ethode g´en´erale pour la esolution des syst`emes d’´equations
simultan´ees. Compte Rendu `a l’Acad´emie des Sciences. 194
602
BIBLIOGRAPHY
Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,
UCSD. 152, 501
Chapelle, O., Weston, J., and Scolkopf, B. (2003). Cluster kernels for semi-supervised
learning. In NIPS’02 , pages 585–592, Cambridge, MA. MIT Press. 486
Chapelle, O., Sch¨olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT
Press, Cambridge, MA. 486
Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neu-
ral Networks for Document Processing. In Guy Lorette, editor, Tenth International
Workshop on Frontiers in Handwriting Recognition, La Baule (France). Universit´e de
Rennes 1, Suvisoft. http://www.suvisoft.com. 21, 24, 378
Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for
language modeling. Computer, Speech and Language, 13(4), 359–393. 348, 406
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., and Temam, O. (2014a). Diannao:
A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Pro-
ceedings of the 19th international conference on Architectural support for programming
languages and operating systems, pages 269–284. ACM. 383
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N.,
et al. (2014b). Dadiannao: A machine-learning supercomputer. In Microarchitecture
(MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 609–622.
IEEE. 383
Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). Project adam:
Building an efficient and scalable deep learning training system. In 11th USENIX
Symposium on Operating Systems Design and Implementation (OSDI’14). 380
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. In Proceedings of the Empiricial Methods in Natural Language
Processing (EMNLP 2014). 323, 341, 407
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The
loss surface of multilayer networks. 242, 475
Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous
speech recognition using attention-based recurrent nn: First results. arXiv:1412.1602.
393
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated
recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,
arXiv 1412.3555. 341, 393
Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural
network for traffic sign classification. Neural Networks, 32, 333–338. 22, 190
603
BIBLIOGRAPHY
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big
simple neural nets for handwritten digit recognition. Neural Computation, 22, 1–14.
21, 24, 378
Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse
coding and vector quantization. In ICML’2011 . 24
Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in un-
supervised feature learning. In Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics (AISTATS 2011). 386
Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep
learning with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings
of the 30th International Conference on Machine Learning (ICML-13), volume 28 (3),
pages 1337–1345. JMLR Workshop and Conference Proceedings. 21, 24, 299, 380
Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris
VI, LIP6. 164
Collobert, R. (2011). Deep learning for efficient discriminative parsing. In AISTATS’2011.
95
Collobert, R. and Weston, J. (2008). A unified architecture for natural language process-
ing: Deep neural networks with multitask learning. In ICML’2008 . 404
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.
(2011a). Natural language processing (almost) from scratch. Journal of Machine
Learning Research, 12, 2493–2537. 273
Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011b). Torch7: A matlab-like envi-
ronment for machine learning. In BigLearn, NIPS Workshop. 379
Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,
36, 287–314. 454, 455
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,
273–297. 16, 134, 148
Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmenta-
tion using depth information. In International Conference on Learning Representations
(ICLR2013). 22, 190
Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Low precision arithmetic for deep
learning. In Arxiv:1412.7024, ICLR’2015 Workshop. 384
Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by
spike-and-slab RBMs. In ICML’11 . 415, 579
Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab
RBM and extensions to discrete and sparse data distributions. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 580
604
BIBLIOGRAPHY
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.
Wiley-Interscience. 56
Cox, D. and Pinto, N. (2011). Beyond simple features: A large-scale feature search
approach to unconstrained face recognition. In Automatic Face & Gesture Recognition
and Workshops (FG 2011), 2011 IEEE International Conference on, pages 8–15. IEEE.
298
Cram´er, H. (1946). Mathematical methods of statistics. Princeton University Press. 126
Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304,
111–114. 522
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-
matics of Control, Signals, and Systems, 2, 303–314. 188, 495
Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition
with the mean-covariance restricted Boltzmann machine. In NIPS’2010 . 22
Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained
deep neural networks for large vocabulary speech recognition. IEEE Transactions on
Audio, Speech, and Language Processing, 20(1), 33–42. 392
Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for
QSAR predictions. arXiv:1406.1231. 23
Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-
dimensional inputs. In NIPS26 . NIPS Foundation. 532
Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with
reconstruction sampling. In ICML’2011 . 403
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization. In NIPS’2014 . 242, 475
Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.
(2014). The visual microphone: Passive recovery of sound from video. ACM Transac-
tions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 384
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M.,
Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep
networks. In NIPS’2012 . 380
Dean, T. and Kanazawa, K. (1989). A model for reasoning about persistence and causa-
tion. Computational Intelligence, 5(3), 142–150. 347
Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.
17, 189, 495, 496
605
BIBLIOGRAPHY
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09 . 19, 142
Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than
10,000 image categories tell us? In Proceedings of the 11th European Conference on
Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.
19
Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and
Adam, H. (2014). Large-scale object classification using label relation graphs. In
ECCV’2014 , pages 48–64. 349
Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and
Trends in Signal Processing. 392
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Bi-
nary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010 ,
Makuhari, Chiba, Japan. 22
Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs
for vision. Technical Report 1327, epartement d’Informatique et de Recherche
Op´erationnelle, Universit´e de Montr´eal. 580
Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.
In NIPS’2011 . 540
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast
and robust neural network joint models for statistical machine translation. In Proc.
ACL’2014 . 406
DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.
neurons vs. machines. NIPS Tutorial. 23, 301
Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random fields. In International
Conference on Artificial Intelligence and Statistics, pages 177–184. 349
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual
recognition and description. arXiv:1411.4389. 95
Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embed-
ding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,
Stanford University. 152, 504
Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.
IEEE Transactions on Neural Networks, 1, 75–80. 249, 334
Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Math-
ematical Analysis and Applications, 5(1), 30–45. 195
606
BIBLIOGRAPHY
Dreyfus, S. E. (1973). The computational solution of optimal control problems with time
lag. IEEE Transactions on Automatic Control, 18(4), 383–385. 195
Dugas, C., Bengio, Y., elisle, F., and Nadeau, C. (2001). Incorporating second-order
functional knowledge for better option pricing. In NIPS’00 , pages 472–478. MIT Press.
65, 164
Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., and Leyton-
Brown, K. (2013). Towards an empirical foundation for assessing bayesian optimization
of hyperparameters. NIPS workshop on Bayesian Optimization in Theory and Practice.
372
El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term
dependencies. In NIPS 8 . MIT Press. 326, 347
ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term
dependencies. In NIPS’1995 . 337
Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and Vincent, P. (2009). The difficulty
of training deep architectures and the effect of unsupervised pre-training. In Proceedings
of AISTATS’2009. 190
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).
Why does unsupervised pre-training help deep learning? J. Machine Learning Res.
473, 475, 476, 477
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll´ar, P., Gao, J., He, X.,
Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual
concepts and back. arXiv:1411.4952. 95
Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P.,
and Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekker-
man, M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and
Distributed Approaches. Cambridge University Press. 462
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical
features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-
telligence. 22, 190
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical
features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 35(8), 1915–1929. 349
Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.
482
Fischer, A. and Igel, C. (2011). Bounding the bias of contrastive divergence learning.
Neural Computation, 23(3), 664–73. 562
607
BIBLIOGRAPHY
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7, 179–188. 19, 98
Frasconi, P., Gori, M., and Sperduti, A. (1997). On the efficient classification of data
structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence. 326,
328
Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive pro-
cessing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786.
326, 328
Freund, Y. and Schapire, R. E. (1996a). Experiments with a new boosting algorithm. In
Machine Learning: Proceedings of Thirteenth International Conference, pages 148–156,
USA. ACM. 227
Freund, Y. and Schapire, R. E. (1996b). Game theory, on-line prediction and boosting.
In Proceedings of the Ninth Annual Conference on Computational Learning Theory,
pages 325–332. 227
Frey, B. J. (1998). Graphical models for machine learning and digital communication.
MIT Press. 329
Frey, B. J., Hinton, G. E., and Dayan, P. (1996). Does the wake-sleep algorithm learn
good density estimators? In NIPS’95 , pages 661–670. MIT Press, Cambridge, MA.
329
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36,
193–202. 14, 21, 24, 302
Garson, J. (1900). The metric system of identification of criminals, as used in in great
britain and ireland. The Journal of the Anthropological Institute of Great Britain and
Ireland, (2), 177–227. 19
Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual
prediction with LSTM. Neural computation, 12(10), 2451–2471. 342
Glorot, X. and Bengio, Y. (2010a). Understanding the difficulty of training deep feedfor-
ward neural networks. In AISTATS’2010 . 163
Glorot, X. and Bengio, Y. (2010b). Understanding the difficulty of training deep feedfor-
ward neural networks. In JMLR W&CP: Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9, pages
249–256. 266
Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In
AISTATS’2011 . 14, 164, 461
608
BIBLIOGRAPHY
Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectifier neural networks.
In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial
Intelligence and Statistics (AISTATS 2011). 193, 461
Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale senti-
ment classification: A deep learning approach. In ICML’2011 . 461, 481
Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face
Recognition. Imperial College Press. 505, 507
Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep
networks. In NIPS’2009 , pages 646–654. 226, 448, 460
Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L.
(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot
Interaction (HRI), Osaka, Japan. ACM Press, ACM Press. 93
Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with
spike-and-slab sparse coding. In ICML’2012 . 457
Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution
for autoencoders. Technical report, Universit´e de Montr´eal. 293
Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding
for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning
Hierarchical Models. 190, 481
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).
Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–
1327. 193, 230, 300, 386
Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction
deep Boltzmann machines. In NIPS26 . NIPS Foundation. 94, 530, 576, 578
Goodfellow, I. J., Courville, A., and Bengio, Y. (2013c). Scaling up spike-and-slab models
for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(8), 1902–1914. 580
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2014a). An em-
pirical investigation of catastrophic forgeting in gradient-based neural networks. In
ICLR’2014 . 194
Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adver-
sarial examples. CoRR, abs/1412.6572. 235
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014c). Generative adversarial networks. In NIPS’2014.
188
609
BIBLIOGRAPHY
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014d). Multi-
digit number recognition from Street View imagery using deep convolutional neural
networks. In International Conference on Learning Representations. 22, 94, 190, 365,
382
Goodman, J. (2001). Classes for fast maximum entropy training. In International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 400
Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(1), 76–86. 242
Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Originally
published under the pseudonym “Student”. 19
Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed
representations without word alignments. Technical report, arXiv:1410.2455. 408, 484
Graf, H. P. and Jackel, L. D. (1989). Analog electronic neural network circuits. Circuits
and Devices Magazine, IEEE , 5(4), 44–49. 383
Graves, A. (2011a). Practical variational inference for neural networks. In J. Shawe-
Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in
Neural Information Processing Systems 24 , pages 2348–2356. Curran Associates, Inc.
211, 212
Graves, A. (2011b). Practical variational inference for neural networks. In NIPS’2011 .
214
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies
in Computational Intelligence. Springer. 309, 323, 340, 341, 349, 393
Graves, A. (2013). Generating sequences with recurrent neural networks. Technical
report, arXiv:1308.0850. 172, 340
Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent
neural networks. In ICML’2014 . 340
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirec-
tional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.
323
Graves, A. and Schmidhuber, J. (2009). Offline handwriting recognition with multidi-
mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and
L. Bottou, editors, NIPS’2008 , pages 545–552. 323
Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-
ral classification: Labelling unsegmented sequence data with recurrent neural networks.
In ICML’2006 , pages 369–376, Pittsburgh, USA. 349, 393
610
BIBLIOGRAPHY
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Uncon-
strained on-line handwriting recognition with recurrent neural networks. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 323
Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., and Schmidhuber,
J. (2009). A novel connectionist system for unconstrained handwriting recognition.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868.
340
Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep re-
current neural networks. In ICASSP’2013 , pages 6645–6649. 323, 325, 326, 340, 341,
392, 393
Graves, A., Wayne, G., and Danihelka, I. (2014a). Neural Turing machines.
arXiv:1410.5401. 23
Graves, A., Wayne, G., and Danihelka, I. (2014b). Neural turing machines. arXiv preprint
arXiv:1410.5401 . 343
Greff, K., Srivastava, R. K., Koutn´ık, J., Steunebrink, B. R., and Schmidhuber, J. (2015).
LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 . 342
Gregor, K. and LeCun, Y. (2010). Emergence of complex-like cells in a temporal product
network with local receptive fields. Technical report, arXiv:1006.0448. 292
G¨ul¸cehre, C¸ . and Bengio, Y. (2013). Knowledge matters: Importance of prior infor-
mation for optimization. In International Conference on Learning Representations
(ICLR’2013). 22, 270
Guo, H. and Gelfand, S. B. (1992). Classification trees with neural network feature
extraction. Neural Networks, IEEE Transactions on, 3(6), 923–933. 382
Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. (2015). Deep learning
with limited numerical precision. CoRR, abs/1502.02551. 384
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-
tion principle for unnormalized statistical models. In Proceedings of The Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS’10). 532
Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.
(2007). Online learning for offroad robots: Spatial label propagation to learn long-
range traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA,
USA. 385
Haffner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural
networks for high performance continuous speech recognition. In International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pages 105–108, Toronto.
358
611
BIBLIOGRAPHY
H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings
of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,
California. ACM Press. 189, 495
H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.
Computational Complexity, 1, 113–129. 189, 495
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning:
data mining, inference and prediction. Springer Series in Statistics. Springer Verlag.
138
Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 15
Henaff, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning
of sparse features for scalable audio classification. In ISMIR’11 . 462
Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modifiables: D´ecodage de
messages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie
des Sciences, 299(III-13), 525––528. 454
Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures. 257
Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 22,
392
Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.
Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 523
Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 504
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with
neural networks. Science, 313(5786), 504–507. 450, 472, 473
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with
Neural Networks. Science, 313, 504–507. 475
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and
Helmholtz free energy. In NIPS’1993 . 444
Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief
nets. Neural Computation, 18, 1527–1554. 12, 16, 24, 135, 472, 473, 563
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural
networks for acoustic modeling in speech recognition: The shared views of four research
groups. IEEE Signal Process. Mag., 29(6), 82–97. 94
612
BIBLIOGRAPHY
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
(2012c). Improving neural networks by preventing co-adaptation of feature detectors.
Technical report, arXiv:1207.0580. 208
Hinton, G. E., Vinyals, O., and Dean, J. (2014). Dark knowledge. Invited talk at the
BayLearn Bay Area Machine Learning Symposium. 381
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma
thesis, T.U. M¨unich. 247, 334, 343
Hochreiter, S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering flat
minima. In Advances in Neural Information Processing Systems 7 , pages 529–536.
MIT Press. 215
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computa-
tion, 9(8), 1735–1780. 23, 340, 341
Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In
J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE
Press. 341
Holi, J. L. and Hwang, J.-N. (1993). Finite precision error analysis of neural network
hardware implementations. Computers, IEEE Transactions on, 42(3), 281–290. 383
Holt, J. L. and Baker, T. E. (1991). Back propagation simulations using limited pre-
cision calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint
Conference on, volume 2, pages 121–126. IEEE. 383
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks, 2, 359–366. 188, 495
Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an
unknown mapping and its derivatives using multilayer feedforward networks. Neural
networks, 3(5), 551–560. 188
Horst, R., Pardalos, P., and Thoai, N. (2000). Introduction to Global Optimization.
Kluwer Academic Publishers. Second Edition. 271
Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World
Chess Champion. Princeton University Press, Princeton, NJ, USA. 2
Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov
random fields on lattice. Annals of the Institute of Statistical Mathematics, 54(1),
1–18. 529
Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monkey
striate cortex. Journal of Physiology (London), 195, 215–243. 299
613
BIBLIOGRAPHY
Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s
striate cortex. Journal of Physiology, 148, 574–591. 299
Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction, and
functional architecture in the cat’s visual cortex. Journal of Physiology (London),
160, 106–154. 299
Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Sequential model-based optimization
for general algorithm configuration. In LION-5 . Extended version as UBC Tech report
TR-2010-10. 372
Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96,
pages 13–24. 311
Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing
Surveys, 2, 94–128. 454
Hyv¨arinen, A. (2005a). Estimation of non-normalized statistical models using score
matching. J. Machine Learning Res., 6. 465
Hyv¨arinen, A. (2005b). Estimation of non-normalized statistical models using score
matching. Journal of Machine Learning Research, 6, 695–709. 530
Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence,
and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural
Networks, 18, 1529–1531. 531
Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and
Data Analysis, 51, 2499–2512. 531
Hyv¨arinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Ex-
istence and uniqueness results. Neural Networks, 12(3), 429–439. 455
Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis.
Wiley-Interscience. 454
Hyv¨arinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A proba-
bilistic approach to early computational vision. Springer-Verlag. 305
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training
by reducing internal covariate shift. 22, 93, 263
Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation.
Neural networks, 1(4), 295–307. 256
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture
of local experts. Neural Computation, 3, 79–87. 171
Jaeger, H. (2003). Adaptive nonlinear system identification with echo state networks. In
Advances in Neural Information Processing Systems 15 . 335
614
BIBLIOGRAPHY
Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state
networks. Technical report, Jacobs University. 326
Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 334
Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and
saving energy in wireless communication. Science, 304(5667), 78–80. 24, 334
Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Scolkopf, B. (2012).
On causal and anticausal learning. In ICML’2012 , pages 1255–1262. 487, 489
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best
multi-stage architecture for object recognition? In ICCV’09. 14, 164, 462
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best
multi-stage architecture for object recognition? In Proc. International Conference on
Computer Vision (ICCV’09), pages 2146–2153. IEEE. 21, 24, 192, 193, 298, 299
Jarzynski, C. (1997). Nonequilibrium equality for free energy differences. Phys. Rev.
Lett., 78, 2690–2693. 539
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University
Press. 48
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target
vocabulary for neural machine translation. arXiv:1412.2007. 407
Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters
from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in
Practice. North-Holland, Amsterdam. 348, 406
Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial pyramids: Receptive field learn-
ing for pooled image features. In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pages 3370–3377. IEEE. 287
Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural
networks: convergence and generalization. IEEE Transactions on Neural Networks,
7(6), 1424–1438. 211, 214
Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 16
Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015a). An empirical evaluation of
recurrent network architectures. In ICML’2015 . 342
Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015b). An empirical exploration of
recurrent network architectures. In Proceedings of The 32nd International Conference
on Machine Learning, pages 2342–2350. 268
Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classi-
fication. IEEE Transactions on Signal Processing, 40(12), 3043–3054. 356
615
BIBLIOGRAPHY
Judd, J. S. (1989). Neural Network Design and the Complexity of Learning. MIT press.
249
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algo-
rithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 454
Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., G¨ul¸cehre, c., Memisevic, R., Vin-
cent, P., Courville, A., Bengio, Y., Ferrari, R. C., Mirza, M., Jean, S., Carrier, P.-L.,
Dauphin, Y., Boulanger-Lewandowski, N., Aggarwal, A., Zumer, J., Lamblin, P., Ray-
mond, J.-P., Desjardins, G., Pascanu, R., Warde-Farley, D., Torabi, A., Sharma, A.,
Bengio, E., ot´e, M., Konda, K. R., and Wu, Z. (2013). Combining modality specific
deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM
on International Conference on Multimodal Interaction. 190
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In
EMNLP’2013 . 407
Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.
IEEE Transactions on Pattern Analysis and Machine Intelligence. 468
Kanazawa, K., Koller, D., and Russell, S. (1995). Stochastic simulation algorithms for
dynamic probabilistic networks. In Proc. UAI’1995 , pages 346–351. 347
Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image
descriptions. In CVPR’2015 . arXiv:1412.2306. 95
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).
Large-scale video classification with convolutional neural networks. In CVPR. 19
Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. Master’s thesis, Dept.˜of Mathematics, Univ.˜of Chicago. 90
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model
component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-35(3), 400–401. 348, 406
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding
algorithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 447
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding
algorithms with applications to object recognition. Technical report, Computational
and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-
12-01. 462
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant
features through topographic filter maps. In CVPR’2009. 462
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.
(2010a). Learning convolutional feature hierarchies for visual recognition. In Advances
in Neural Information Processing Systems 23 (NIPS’10), pages 1090–1098. 299
616
BIBLIOGRAPHY
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and Le-
Cun, Y. (2010b). Learning convolutional feature hierarchies for visual recognition.
In NIPS’2010 . 462
Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10),
947–954. 195
Khan, F., Zhu, X., and Mutlu, B. (2011). How do humans teach: On curriculum learning
and teaching dimension. In Advances in Neural Information Processing Systems 24
(NIPS’11), pages 1449–1457. 273
Kim, S. K., McAfee, L. C., McMahon, P. L., and Olukotun, K. (2009). A highly scalable
restricted Boltzmann machine FPGA implementation. In Field Programmable Logic
and Applications, 2009. FPL 2009. International Conference on, pages 367–372. IEEE.
383
Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary
Mathematics ; V. 1). American Mathematical Society. 419
Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980 . 258
Kingma, D. and LeCun, Y. (2010a). Regularized estimation of image statistics by score
matching. In NIPS’2010 . 465
Kingma, D. and LeCun, Y. (2010b). Regularized estimation of image statistics by score
matching. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 1126–1134. 532
Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning
with deep generative models. In NIPS’2014. 435
Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable
models in auxiliary form. Technical report, arxiv:1306.0733. 188, 435
Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings
of the International Conference on Learning Representations (ICLR). 188, 435, 507,
508
Kingma, D. P. and Welling, M. (2014b). Efficient gradient-based inference through trans-
formations between bayes nets and neural nets. Technical report, arxiv:1402.0480. 188,
434, 435
Kirkpatrick, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220, 671–680. 271
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.
In ICML’2014 . 95
617
BIBLIOGRAPHY
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embed-
dings with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 95, 340
Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed
representations of words. In Proceedings of COLING 2012 . 408, 484
Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and
Pfister, H. (2014). Deep learning for the connectome. GPU Technology Conference. 23
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and
Techniques. MIT Press. 354, 433, 440
Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and max-
imization of A posteriori probabilities application to transition-based connectionist
speech recognition. In NIPS’95 . MIT Press, Cambridge, MA. 391
Koren, Y. (2009). 1 the bellkor solution to the netflix grand prize. 227
Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In
ICML’2014 . 326, 347
Koˇcisk´y, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-
sentations by Marginalizing Alignments. In Proceedings of ACL. 408
Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties
of DBNs with binary hidden units and real-valued visible units. In ICML’2013. 495
Krizhevsky, A. (2010). Convolutional deep belief networks on CIFAR-
10. Technical report, University of Toronto. Unpublished Manuscript:
http://www.cs.utoronto.ca/ kriz/conv-cifar10-aug2010.pdf. 379
Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny
images. Technical report, University of Toronto. 19, 415
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems
25 (NIPS’2012). 21, 24, 93, 306, 385
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classification with deep
convolutional neural networks. In NIPS’2012 . 22, 190, 461
Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the
Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–
492, Berkeley, Calif. University of California Press. 90
Kumar, M. P., Packer, B., and Koller, D. (2010). Self-paced learning for latent variable
models. In NIPS’2010 . 273
Lafferty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In C. E. Brodley and
A. P. Danyluk, editors, ICML 2001 . Morgan Kaufmann. 349, 357
618
BIBLIOGRAPHY
Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural net-
work architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-
Mellon University. 302, 309, 336
Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear
independent component analysis using ensemble learning: Experiments and discussion.
In Proc. ICA. Citeseer. 455
Larochelle, H. and Bengio, Y. (2008a). Classification using discriminative restricted Boltz-
mann machines. In ICML’2008 . 226, 448, 591
Larochelle, H. and Bengio, Y. (2008b). Classification using discriminative restricted
Boltzmann machines. In ICML’08 , pages 536–543. ACM. 486
Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.
In AISTATS’2011 . 329, 332
Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In
AAAI Conference on Artificial Intelligence. 482
Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative
and discriminative models. In Proceedings of the Computer Vision and Pattern Recog-
nition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer
Society. 223, 486
Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled
convolutional neural networks. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems
23 (NIPS’10), pages 1279–1287. 292
Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,
A. (2012). Building high-level features using large scale unsupervised learning. In
ICML’2012 . 21, 24
Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approx-
imators. Neural Computation, 22(8), 2192–2207. 495
Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural
gradient algorithm. In NIPS 20 . MIT Press. 241
LeCun, Y. (1985). Une proc´edure d’apprentissage pour R´eseau `a seuil assym´etrique. In
Cognitiva 85: A la Fronti`ere de l’Intelligence Artificielle, des Sciences de la Connais-
sance et des Neurosciences, pages 599–604, Paris 1985. CESTA, Paris. 195
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de
Paris VI. 16, 444
LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,
Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications
of neural network chips and automatic learning. IEEE Communications Magazine,
27(11), 41–46. 302
619
BIBLIOGRAPHY
LeCun, Y., Bottou, L., Orr, G. B., and M¨uller, K.-R. (1998a). Efficient backprop. In
Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524.
Springer Verlag. 365
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient-based learning
applied to document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 14, 24
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998c). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 16, 19,
349, 357, 358, 359, 392
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998d). Gradient based learning
applied to document recognition. Proc. IEEE. 305
LeCun, Y., Kavukcuoglu, K., and Farabet, C. (2010). Convolutional networks and ap-
plications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE
International Symposium on, pages 253–256. IEEE. 306
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area
V2. In NIPS’07 . 226, 448
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In L. Bottou
and M. Littman, editors, ICML 2009. ACM, Montreal, Canada. 298, 580, 581
Lee, Y. J. and Grauman, K. (2011). Learning the easy things first: self-paced visual
category discovery. In CVPR’2011 . 273
Leibniz, G. W. (1676). Memoir using the chain rule. (Cited in TMME 7:2&3 p 321-332,
2010). 194
Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; represen-
tation and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.
2
Leprieur, H. and Haffner, P. (1995). Discriminant learning with minimum memory loss
for improved non-vocabulary rejection. In EUROSPEECH’95, Madrid, Spain. 356
L’Hˆopital, G. F. A. (1696). Analyse des infiniment petits, pour l’intelligence des lignes
courbes. Paris: L’Imprimerie Royale. 194
Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies
is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural
Networks, 7(6), 1329–1338. 336, 337
Linde, N. (1992). The machine that changed the world, episode 3. Documentary minis-
eries. 2
Lindsey, C. and Lindblad, T. (1994). Review of hardware neural networks: a user’s
perspective. In Proc. Third Workshop on Neural Networks: From Biology to High
Energy Physics, pages 195––202, Isola d’Elba, Italy. 383
620
BIBLIOGRAPHY
Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Nu-
merical Mathematics, 16(2), 146–160. 195
Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to ap-
proximately evaluate or simulate. In Proceedings of the 27th International Conference
on Machine Learning (ICML’10). 557
Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine
invented by Charles Babbage”. 1
Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 350, 356, 362
Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent
neural network training. Computer Science Review, 3(3), 127–149. 334
Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with
convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 95
Lyness, J. N. and Moler, C. B. (1967). Numerical differentiation of analytic functions.
SIAM J.Numer. Anal., 4, 202––210. 184
Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09 . 531
Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without
stable states: A new framework for neural computation based on perturbations. Neural
Computation, 14(11), 2531–2560. 334
MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge
University Press. 56
Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015). Gradient-based hyperparameter
optimization through reversible learning. arXiv preprint arXiv:1502.03492 . 372
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning
with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 95
Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for
restricted Boltzmann machine learning. In Proceedings of The Thirteenth International
Conference on Artificial Intelligence and Statistics (AISTATS’10), volume 9, pages
509–516. 526, 531, 559
Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science,
194. 302
Martens, J. (2010). Deep learning via Hessian-free optimization. In L. Bottou and
M. Littman, editors, Proceedings of the Twenty-seventh International Conference on
Machine Learning (ICML-10), pages 735–742. ACM. 267
Martens, J. and Medabalimi, V. (2014). On the expressive efficiency of sum product
networks. arXiv:1411.7717 . 496
621
BIBLIOGRAPHY
Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-
free optimization. In Proc. ICML’2011 . ACM. 344
Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous
state space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612.
529
Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition
using a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo
CA. Morgan Kaufmann. 358
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall,
London. 165
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5, 115–133. 13
Mead, C. and Ismail, M. (2012). Analog VLSI implementation of neural systems, vol-
ume 80. Springer Science & Business Media. 383
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,
Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,
J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In
JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 190, 481
Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surfing on the
manifold. Learning Workshop, Snowbird. 584
Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular
PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 394
Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,
Brno University of Technology. 172, 345
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empiri-
cal evaluation and combination of advanced language modeling techniques. In Proc.
12th annual conference of the international speech communication association (INTER-
SPEECH 2011). 405
Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for
training large scale neural network language models. In Proc. ASRU’2011. 273, 405
Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages
for machine translation. Technical report, arXiv:1309.4168. 484
Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cam-
bridge UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 536
Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 13
622
BIBLIOGRAPHY
Misra, J. and Saha, I. (2010). Artificial neural networks in hardware: A survey of two
decades of progress. Neurocomputing, 74(1), 239–255. 383
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 92
Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-
contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and
K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages
2265–2273. Curran Associates, Inc. 404, 534
Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural proba-
bilistic language models. In ICML’2012 , pages 1751–1758. 404
Mnih, V. and Hinton, G. (2010). Learning to detect roads in high-resolution aerial images.
In Proceedings of the 11th European Conference on Computer Vision (ECCV). 95
Mobahi, H. and Fisher III, J. W. (2015). A theoretical analysis of optimization by gaussian
continuation. In AAAI’2015 . 272
Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief
networks. IEEE Trans. on Audio, Speech and Language Processing, 20(1), 14–22. 392
Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks
with discrete units. Neural Computation, 26. 495
Mont´ufar, G. and Ay, N. (2011). Refinements of universal approximation results for
deep belief networks and restricted Boltzmann machines. Neural Computation, 23(5),
1306–1319. 495
Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product
of mixtures? SIAM Journal on Discrete Mathematics, 29(1), 321–347. 494
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear
regions of deep neural networks. In NIPS’2014 . 17, 493, 496, 497
Mor-Yosef, S., Samueloff, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking
the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet
Gynecol, 75(6), 944–7. 2
Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language
model. In AISTATS’2005. 400, 402
Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages
275–282, San Mateo, CA. Morgan Kaufmann. 337, 338, 347
Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cam-
bridge, MA, USA. 92, 138
Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In
ICML’2014 . 172, 333, 334
623
BIBLIOGRAPHY
Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method
for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP-36(9), 1432–1436. 356
Nair, V. and Hinton, G. (2010a). Rectified linear units improve restricted Boltzmann
machines. In ICML’2010 . 164, 461
Nair, V. and Hinton, G. E. (2010b). Rectified linear units improve restricted Boltzmann
machines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh
International Conference on Machine Learning (ICML-10), pages 807–814. ACM. 14
Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypoth-
esis. In NIPS’2010 . 152, 501
Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56,
71–113. 582
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.
Springer. 231
Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2),
125–139. 538, 539
Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance
sampling. 540
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Read-
ing digits in natural images with unsupervised feature learning. Deep Learning and
Unsupervised Feature Learning Workshop, NIPS. 19
Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical
language modelling. In European Conference on Speech Communication and Technol-
ogy (Eurospeech), pages 973–976, Berlin. 395
Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of
part-of-speech and automatically derived category-based language models for speech
recognition. In International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 177–180. 397
Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in
classifying static speech patterns. Computer Speech and Language, 4, 275–289. 164
Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 85, 90
Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural
Computation, 17, 1665–1699. 14
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field prop-
erties by learning a sparse code for natural images. Nature, 381, 607–609. 226, 305,
447, 500
624
BIBLIOGRAPHY
Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set:
a strategy employed by V1? Vision Research, 37, 3311–3325. 389, 391, 460
Opper, M. and Archambeau, C. (2009). The variational gaussian approximation revisited.
Neural computation, 21(3), 786–792. 188
Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research
in Economics and Management Sci., MIT. 195
Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de
Montr´eal. 244, 245
Pascanu, R. and Bengio, Y. (2012). On the difficulty of training recurrent neural networks.
Technical Report arXiv:1211.5063, Universite de Montreal. 172
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent
neural networks. In ICML’2013 . 172, 249, 334, 338, 345, 346, 347
Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. Technical report, U.
Montreal, arXiv:1312.6098. 189
Pascanu, R., ul¸cehre, C¸ ., Cho, K., and Bengio, Y. (2014a). How to construct deep
recurrent neural networks. In ICLR’2014 . 17, 231, 325, 326, 340, 393, 496, 497
Pascanu, R., Montufar, G., and Bengio, Y. (2014b). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 494
Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential
reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, Uni-
versity of California, Irvine, pages 329–334. 417
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann. 49
Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 28
Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recog-
nition hard? PLoS Comput Biol, 4. 389, 581
Pinto, N., Stone, Z., Zickler, T., and Cox, D. (2011). Scaling up biologically-inspired
computer vision: A case study in unconstrained face recognition on facebook. In Com-
puter Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on, pages 35–42. IEEE. 298
Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46(1),
77–105. 326
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 253
625
BIBLIOGRAPHY
Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders
and deep networks. CoRR, abs/1406.1831. 211
Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In
UAI’2011 , Barcelona, Spain. 189, 496
Poundstone, W. (2005). Fortune’s Formula: The untold story of the scientific betting
system that beat the casinos and Wall Street. Macmillan. 57
Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 164
Presley, R. K. and Haggard, R. L. (1994). A fixed point implementation of the backprop-
agation learning algorithm. In Southeastcon’94. Creative Technology Transfer-A Global
Affair., Proceedings of the 1994 IEEE, pages 136–138. IEEE. 383
Price, R. (1958). A useful theorem for nonlinear devices having gaussian inputs. IEEE
Transactions on Information Theory, 4(2), 69–72. 187
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual
representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.
300
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2), 257–286. 354, 391
Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE
ASSP Magazine, pages 257–285. 308, 354
Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive
distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 333
Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning
using graphics processors. In L. Bottou and M. Littman, editors, ICML 2009 , pages
873–880, New York, NY, USA. ACM. 24, 378
Rall, L. B. (1981). Automatic Differentiation: Techniques and Applications. Lecture
Notes in Computer Science 120, Springer. 184
Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Founda-
tions of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster
University Archive for the History of Economic Thought. 50
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Efficient learning of
sparse representations with an energy-based model. In NIPS’2006 . 12, 16, 460, 472,
473
Ranzato, M., Huang, F., Boureau, Y., and LeCun, Y. (2007b). Unsupervised learning
of invariant feature hierarchies with applications to object recognition. In Proceedings
of the Computer Vision and Pattern Recognition Conference (CVPR’07). IEEE Press.
299
626
BIBLIOGRAPHY
Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief
networks. In NIPS’2007 . 460
Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical
parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 126
Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to
parallelizing stochastic gradient descent. In NIPS’2011 . 380
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and
approximate inference in deep generative models. In ICML’2014. 188, 434, 435
Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning
through cross-modal transfer. In 27th Annual Conference on Neural Information Pro-
cessing Systems (NIPS 2013). 482
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-
encoders: Explicit invariance during feature extraction. In ICML’2011 . 468, 470,
503
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.
(2011b). Higher order contractive auto-encoder. In European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in Databases (ECML
PKDD). 448
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.
(2011c). Higher order contractive auto-encoder. In ECML PKDD. 468
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold
tangent classifier. In NIPS’2011 . 516
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for
sampling contractive auto-encoders. In ICML’2012. 584
Ringach, D. and Shapley, R. (2004). Reverse correlation in neurophysiology. Cognitive
Science, 28(2), 147–166. 302
Roberts, S. and Everson, R. (2001). Independent component analysis: principles and
practice. Cambridge University Press. 455
Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech
recognition system. Computer Speech and Language, 5(3), 259–274. 24, 391
Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 88
Romero, A., Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Y.
(2015). Fitnets: Hints for thin deep nets. In ICLR’2015, arXiv:1412.6550. 271
Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part
i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,
8(1), pp. 181–217. 88
627
BIBLIOGRAPHY
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological Review, 65, 386–408. 12, 13, 24
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 13, 24
Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. Science, 290(5500). 152, 153, 504
Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-
propagating errors. Nature, 323, 533–536. 12, 16, 22, 195, 394
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal repre-
sentations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,
Parallel Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cam-
bridge. 19, 24, 195
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986c). Learning representations
by back-propagating errors. Nature, 323, 533–536. 158, 308
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986d). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,
Cambridge. 15, 195
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986e). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, volume 1.
MIT Press, Cambridge. 158
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large
Scale Visual Recognition Challenge. 19
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition
challenge. arXiv preprint arXiv:1409.0575 . 25
Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal
elements of macaque V1 receptive fields. Neuron, 46(6), 945–956. 301
Sainath, T., rahman Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep
convolutional neural networks for LVCSR. In ICASSP 2013 . 392
Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of
the International Conference on Artificial Intelligence and Statistics, volume 5, pages
448–455. 21, 24, 473, 566, 569, 574, 576
Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings
of the Twelfth International Conference on Artificial Intelligence and Statistics (AIS-
TATS 2009), volume 8. 573, 577, 589
628
BIBLIOGRAPHY
Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance
kernels for Gaussian processes. In NIPS’07 , pages 1249–1256, Cambridge, MA. MIT
Press. 487
Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief
networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,
volume 25, pages 872–879. ACM. 539
Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean field theory for sigmoid belief
networks. Journal of Artificial Intelligence Research, 4, 61–76. 24
Savich, A. W., Moussa, M., and Areibi, S. (2007). The impact of arithmetic representation
on implementing mlp-bp on fpgas: A study. Neural Networks, IEEE Transactions on,
18(1), 240–252. 383
Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On
random weights and unsupervised feature learning. In Proc. ICML’2011 . ACM. 298
Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear
dynamics of learning in deep linear neural networks. In ICLR. 266
Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of
history compression. Neural Computation, 4(2), 234–242. 326
Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on
Neural Networks, 7(1), 142–146. 394
Scolkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 148
Scolkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 152, 504
Scolkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods
Support Vector Learning. MIT Press, Cambridge, MA. 16, 164, 191
Schulz, H. and Behnke, S. (2012). Learning two-layer contractive encodings. In
ICANN’2012 , pages 620–628. 470
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11), 2673–2681. 323
Schwenk, H. (2007). Continuous space language models. Computer speech and language,
21, 492–518. 395, 399
Schwenk, H. (2010). Continuous space language models for statistical machine translation.
The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 395, 405
Schwenk, H. (2014). Cleaned subset of wmt ’14 dataset. 19
Schwenk, H. and Bengio, Y. (1998). Training methods for adaptive boosting of neural
networks. In NIPS’97 , pages 647–653. MIT Press. 227
629
BIBLIOGRAPHY
Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocab-
ulary continuous speech recognition. In International Conference on Acoustics, Speech
and Signal Processing (ICASSP), volume 1, pages 765–768. 395
Schwenk, H. and Gauvain, J.-L. (2005). Building continuous space language models for
transcribing european languages. In Interspeech, pages 737–740. 395
Schwenk, H., Costa-juss`a, M. R., and Fonollosa, J. A. R. (2006). Continuous space lan-
guage models for the iwslt 2006 task. In International Workshop on Spoken Language
Translation, pages 166–173. 395, 405
Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-
dependent deep neural networks. In Interspeech 2011 , pages 437–440. 22
Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied
to house numbers digit classification. CoRR, abs/1204.3968. 389
Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection
with unsupervised multi-stage feature learning. In Proc. International Conference on
Computer Vision and Pattern Recognition (CVPR’13). IEEE. 22, 190
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical
Journal, 27(3), 379—-423. 57
Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the
Institute of Radio Engineers, 37(1), 10–21. 57
Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publica-
tions. 28
Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–
548. 311
Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied
Mathematics Letters, 4(6), 77–80. 311
Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.
Journal of Computer and Systems Sciences, 50(1), 132–150. 249
Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). Best practices for convolutional
neural networks. In ICDAR’2003 . 306
Simard, P. and Graf, H. P. (1994). Backpropagation without multiplication. In Advances
in Neural Information Processing Systems, pages 232–239. 383
Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism
for specifying selected invariances in an adaptive network. In NIPS’1991 . 210, 515,
516
Simard, P. Y., LeCun, Y., and Denker, J. (1993). Efficient pattern recognition using a
new transformation distance. In NIPS’92 . 514
630
BIBLIOGRAPHY
Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation
invariance in pattern recognition — tangent distance and tangent propagation. Lecture
Notes in Computer Science, 1524. 514
Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a min-
imum, with application to neural networks. International Journal of Control, 62(6),
1391–1407. 221
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of
harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed
Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 424, 437
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of
machine learning algorithms. In NIPS’2012 . 372
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a).
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In
NIPS’2011 . 326, 328
Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural
language with recursive neural networks. In Proceedings of the Twenty-Eighth Inter-
national Conference on Machine Learning (ICML’2011). 326
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).
Semi-supervised recursive autoencoders for predicting sentiment distributions. In
EMNLP’2011 . 326
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.
(2013). Recursive deep models for semantic compositionality over a sentiment treebank.
In EMNLP’2013 . 326, 328
Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural
networks. Complex Systems, 2, 625–639. 168
Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local
minima even for networks without hidden layers. Complex Systems, 3, 91–106. 242
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturba-
tion gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.
184
Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog:
how ”less is more” in unsupervised dependency parsing. In HLT’10 . 273
Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann
machines. In NIPS’2012 . 485
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting. Journal of Ma-
chine Learning Research, 15, 1929–1958. 229, 231, 576
631
BIBLIOGRAPHY
Steinkrau, D., Simard, P. Y., and Buck, I. (2005). Using gpus for machine learning algo-
rithms. 2013 12th International Conference on Document Analysis and Recognition,
0, 1115–1119. 378
Stewart, L., He, X., and Zemel, R. S. (2007). Learning flexible features for conditional
random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30(8), 1415–1426. 349
Supancic, J. and Ramanan, D. (2013). Self-paced learning for long-term tracking. In
CVPR’2013 . 273
Sussillo, D. (2014). Random walks: Training very deep nonlinear feed-forward networks
with smart initialization. CoRR, abs/1412.6558. 266, 268
Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of
computer science, University of Toronto. 335, 336, 344
Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive
Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International
Conference on Artificial Intelligence and Statistics (AISTATS), volume 9, pages 789–
795. 524
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of
initialization and momentum in deep learning. In ICML. 255, 335, 336, 344
Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with
neural networks. Technical report, arXiv:1409.3215. 23, 95, 340, 341
Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with
neural networks. In NIPS’2014 . 324, 407
Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.
Master’s thesis, University of British Columbia. 466
Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On
autoencoders and score matching for energy based models. In ICML’2011 . ACM. 532
Swersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw bayesian optimization.
arXiv preprint arXiv:1406.3896. 373
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-
houcke, V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical
report, arXiv:1409.4842. 21, 22, 24, 190, 235, 270, 288
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and
Fergus, R. (2014b). Intriguing properties of neural networks. ICLR, abs/1312.6199.
234
Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to
human-level performance in face verification. In CVPR’2014 . 93
632
BIBLIOGRAPHY
Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In
Proceedings of the 27th International Conference on Machine Learning, June 21-24,
2010, Haifa, Israel. 211
Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines
for modeling motion style. In L. Bottou and M. Littman, editors, ICML 2009 , pages
1025–1032. ACM. 408
Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary
latent variables. In NIPS’06 , pages 1345–1352. MIT Press, Cambridge, MA. 408, 409
Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework
for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 152, 153, 476,
477, 504
Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994 . 516
Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society B, 58, 267–288. 205
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to
the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors,
ICML 2008 , pages 1064–1071. ACM. 526, 563
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.
Journal of the Royal Statistical Society B, 61(3), 611–622. 454
Tom Schaul, Ioannis Antonoglou, D. S. (2014). Unit tests for stochastic optimization. In
International Conference on Learning Representations. 260
Torabi, A., Pal, C., Larochelle, H., and Courville, A. (2015). Using descriptive video
services to create a large data source for video annotation research. arXiv preprint
arXiv: 1503.01070. 142
Tu, K. and Honavar, V. (2011). On the utility of curricula in unsupervised learning of
probabilistic grammars. In IJCAI’2011 . 273
Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autore-
gressive density-estimator. In NIPS’2013 . 331, 333
van der Maaten, L. and Hinton, G. E. (2008a). Visualizing data using t-SNE. J. Machine
Learning Res., 9. 395, 476, 504, 508
van der Maaten, L. and Hinton, G. E. (2008b). Visualizing data using t-SNE. Journal of
Machine Learning Research, 9, 2579–2605. 477
Vanhoucke, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks
on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.
377, 384
633
BIBLIOGRAPHY
Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-
Verlag, Berlin. 106
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
106
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and Its Applications,
16, 264–280. 106
Vincent, P. (2011a). A connection between score matching and denoising autoencoders.
Neural Computation, 23(7). 465, 466, 468, 584
Vincent, P. (2011b). A connection between score matching and denoising autoencoders.
Neural Computation, 23(7), 1661–1674. 532, 585
Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.
506
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. In ICML 2008 . 211, 463
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network with a local
denoising criterion. J. Machine Learning Res., 11. 463
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a). Gram-
mar as a foreign language. Technical report, arXiv:1412.7449. 340
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural
image caption generator. arXiv 1411.4555. 340
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: a neural image
caption generator. In CVPR’2015 . arXiv:1411.4555. 95
Viola, P. and Jones, M. (2001). Robust real-time object detection. In International
Journal of Computer Vision. 382
Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal
projections directed to the auditory pathway. Nature, 404(6780), 871–876. 14
Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.
In Advances in Neural Information Processing Systems 26 , pages 351–359. 232
Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme
recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,
and Signal Processing, 37, 328–339. 309, 385, 391
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of
neural networks using dropconnect. In ICML’2013. 232
634
BIBLIOGRAPHY
Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 232
Warde-Farley, D., Goodfellow, I. J., Lamblin, P., Desjardins, G., Bastien, F., and Bengio,
Y. (2011). pylearn2. http://deeplearning.net/software/pylearn2. 379
Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical
analysis of dropout in piecewise linear networks. In ICLR’2014 . 232
Wawrzynek, J., Asanovic, K., Kingsbury, B., Johnson, D., Beck, J., and Morgan, N.
(1996). Spert-ii: A vector microprocessor system. Computer, 29(3), 79–86. 383
Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by
semidefinite programming. In CVPR’2004 , pages 988–995. 152, 504
Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In
Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC , pages 762–770. 195
Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised em-
bedding. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 , pages
1168–1175, New York, NY, USA. ACM. 486
Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning
to rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 328
Weston, J., Chopra, S., and Bordes, A. (2014). Memory networks. arXiv preprint
arXiv:1410.3916 . 343
Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON
Convention Record, volume 4, pages 96–104. IRE, New York. 13, 19, 21, 24
Wikipedia (2015). List of animals by number of neurons wikipedia, the free encyclo-
pedia. [Online; accessed 4-March-2015]. 21, 24
Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In
NIPS’95 , pages 514–520. MIT Press, Cambridge, MA. 191
Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist
reinforcement learning. Machine Learning, 8, 229–256. 188, 343
Wolpert, D. and MacReady, W. (1997). No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation, 1, 67–82. 249
Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms.
Neural Computation, 8(7), 1341–1390. 110
Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image
recognition. arXiv:1501.02876. 22, 380
Wu, Z. (1997). Global continuation for distance geometry problems. SIAM Journal of
Optimization, 7, 814–836. 271
635
BIBLIOGRAPHY
Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated
splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562.
231
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,
and Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with
visual attention. In ICML’2015 . 95
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,
and Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with
visual attention. arXiv:1502.03044. 340
Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for
gaussian mixtures. Neural Computation, 8, 129–151. 355
Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly
decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 524,
563
Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv 1410.4615. 273
Zaremba, W. and Sutskever, I. (2015). Reinforcement learning neural turing machines.
arXiv preprint arXiv:1505.00521. 343
Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions
of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical
Society. American Mathematical Society. 494
Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional net-
works. In ECCV’14 . 6
Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative
stochastic network for protein secondary structure prediction. In ICML’2014 . 590, 591
ohrer, M. and Pernkopf, F. (2014). General stochastic networks for classification. In
NIPS’2014 . 590
636