Bibliography
Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for
Boltzmann machines. Cognitive Science, 9, 147–169. 513
Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data gen-
erating distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal.
426
Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data
generating distribution. In ICLR’2013 . also arXiv report 1211.4246. 408, 426, 428
Alain, G., Bengio, Y., Yao, L.,
´
Eric Thibodeau-Laufer, Yosinski, J., and Vincent, P.
(2015). GSNs: Generative stochastic networks. arXiv:1503.05571. 411
Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian
gradient. In Advances in Neural Information Processing Systems, pages 127–133. MIT
Press. 166
Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris
Society, 59, 2–5. 19
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. Technical report, arXiv:1409.0473. 22, 91, 359, 368,
369
Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition
with continuous-parameter hidden Markov models. Computer, Speech and Language,
2, 219–234. 62, 325
Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT
Press. 328
Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural
Information Processing Systems 26 , pages 2814–2822. 221
Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the
past and the future in protein secondary structure prediction. Bioinformatics, 15(11),
937–946. 296
555
BIBLIOGRAPHY
Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-
energy physics with deep learning. Nature communications, 5. 22
Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Trans. on Information Theory, 39, 930–945. 181
Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University
Press. 413
Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and
Applications. Wiley. 413
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,
Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.
Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 75, 178, 342
Basu, S. and Christensen, J. (2013). Teaching classification boundaries to humans. In
AAAI’2013 . 247
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of
finite state Markov chains. Ann. Math. Stat., 37, 1559–1563. 323
Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th Inter-
national Conference on Computational Learning Theory (COLT’95), pages 311–320,
Santa Cruz, California. ACM Press. 222
Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2015). Automatic
differentiation in machine learning: a survey. arXiv preprint arXiv:1502.05767 . 176
Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces
in random-dot stereograms. Nature, 355, 161–163. 460
Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for em-
bedding and clustering. In NIPS’01, Cambridge, MA. MIT Press. 446
Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and
data representation. Neural Computation, 15(6), 1373–1396. 145, 464
Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distri-
butions using neural networks. IEEE Transactions on Neural Networks, special issue
on Data Mining and Knowledge Discovery, 11(3), 550–557. 302
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for se-
quence prediction with recurrent neural networks. Technical report, arXiv:1506.03099.
287
Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence Recog-
nition. Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 307,
328
556
BIBLIOGRAPHY
Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal
on Pattern Recognition and Artificial Intelligence, 7(4), 647–668. 325
Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys,
2, 129–162. 325
Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys,
2, 129–162. 328
Bengio, Y. (2002). New distributed probabilistic language models. Technical Report
1215, Dept. IRO, Universit´e de Montr´eal. 361
Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 139, 182, 184
Bengio, Y. (2013). Estimating or propagating gradients through stochastic neurons.
Technical Report arXiv:1305.2982, Universite de Montreal. 395
Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-
layer neural networks. In NIPS’99 , pages 400–406. MIT Press. 300, 302, 303, 304
Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.
Neural Computation, 21(6), 1601–1621. 426, 484, 522
Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE
Transactions on Neural Networks, 7(5), 1231–1249. 328
Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold
cross-validation. In NIPS’03, Cambridge, MA. MIT Press, Cambridge. 109
Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In L. Bottou,
O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT
Press. 17, 185
Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In Large
Scale Kernel Machines. 139
Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04 ,
pages 129–136. MIT Press. 143, 466, 467
Bengio, Y. and en´ecal, J.-S. (2003). Quick training of probabilistic neural nets by
importance sampling. In Proceedings of AISTATS 2003 . 364
Bengio, Y. and en´ecal, J.-S. (2008). Adaptive importance sampling to accelerate training
of a neural probabilistic language model. IEEE Trans. Neural Networks, 19(4), 713–
722. 364
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated
acoustic parameters for continuous speech recognition using artificial neural networks.
In Proceedings of EuroSpeech’91 . 23, 352
557
BIBLIOGRAPHY
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992a). Global optimization of a
neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks,
3(2), 252–259. 325, 328
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992b). Neural network - gaussian
mixture hybrid for speech recognition or density estimation. In NIPS 4, pages 175–182.
Morgan Kaufmann. 352
Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term depen-
dencies in recurrent networks. In IEEE International Conference on Neural Networks,
pages 1183–1195, San Francisco. IEEE Press. (invited paper). 234, 313
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE Tr. Neural Nets. 234, 235, 236, 305, 312, 313
Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for
on-line handwriting recognition. Neural Computation, 7(6), 1289–1303. 328
Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language
model. In NIPS’00, pages 932–938. MIT Press. 16, 343
Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language
model. In NIPS’2000, pages 932–938. 355, 356, 357, 366
Bengio, Y., Ducharme, R., and Vincent, P. (2001c). A neural probabilistic language
model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages
932–938. MIT Press. 468, 469
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003a). A neural probabilistic
language model. JMLR, 3, 1137–1155. 356, 360, 366
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003b). A neural probabilistic
language model. Journal of Machine Learning Research, 3, 1137–1155. 468, 469
Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions
for local kernel machines. In NIPS’2005 . 139
Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows.
In NIPS’2005 . MIT Press. 143, 466
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007a). Greedy layer-wise
training of deep networks. In NIPS’2006 . 12, 16, 184, 432, 433
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007b). Greedy layer-wise
training of deep networks. In NIPS 19 , pages 153–160. MIT Press. 182
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In
ICML’09 . 167, 246
558
BIBLIOGRAPHY
Bengio, Y., L´eonard, N., and Courville, A. (2013a). Estimating or propagating gradients
through stochastic neurons for conditional computation. arXiv:1308.3432. 180, 366,
395
Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-
encoders as generative models. In NIPS’2013. 428, 544, 548
Bengio, Y., Courville, A., and Vincent, P. (2013c). Representation learning: A review and
new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),
35(8), 1798–1828. 458, 542
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014a). Deep generative
stochastic networks trainable by backprop. Technical Report arXiv:1306.1091. 395
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative
stochastic networks trainable by backprop. In ICML’2014 . 395, 545, 546, 547, 549,
550
Bennett, C. (1976). Efficient estimation of free energy differences from Monte Carlo data.
Journal of Computational Physics, 22(2), 245–268. 500
Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy
approach to natural language processing. Computational Linguistics, 22, 39–71. 367
Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive
divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 486
Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Clas-
sification. Ph.D. thesis, Universit´e de Montr´eal. 407
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,
J., Warde-Farley, D., and Bengio, Y. (2010a). Theano: a CPU and GPU math ex-
pression compiler. In Proceedings of the Python for Scientific Computing Conference
(SciPy). Oral Presentation. 75, 342
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,
J., Warde-Farley, D., and Bengio, Y. (2010b). Theano: a CPU and GPU math expres-
sion compiler. In Proc. SciPy. 178
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.
488
Bishop, C. M. (1994). Mixture density networks. 162
Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks.
In Proceedings International Conference on Artificial Neural Networks ICANN’95 , vol-
ume 1, page 141–148. 205, 213
Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization.
Neural Computation, 7(1), 108–116. 205
559
BIBLIOGRAPHY
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 89, 138, 140
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability
and the vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 102,
103
Bonnet, G. (1964). Transformations des signaux al´eatoires `a travers les syst`emes non
lin´eaires sans emoire. Annales des el´ecommunications, 19(9–10), 203–220. 180
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and
meaning representations for open-text semantic parsing. AISTATS’2012 . 299
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal
margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Com-
putational learning theory, pages 144–152, New York, NY, USA. ACM. 16, 129, 139,
155
Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications
`a la reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 328
Bottou, L. (2011). From machine learning to machine reasoning. Technical report,
arXiv.1102.1808. 299
Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker inde-
pendent isolated digit recognition: multilayer perceptrons vs dynamic time warping.
Neural Networks, 3, 453–465. 328
Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing
systems using graph transformer networks. In Proceedings of the Computer Vision and
Pattern Recognition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 318,
326, 327, 328, 329, 330, 331
Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in
vision algorithms. In Proc. International Conference on Machine learning (ICML’10).
261
Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals:
multi-way local pooling for image recognition. In Proc. International Conference on
Computer Vision (ICCV’11). IEEE. 261
Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and
singular value decomposition. Biological Cybernetics, 59, 291–294. 404
Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Ap-
proach, volume 247 of The Kluwer international series in engineering and computer
science. Kluwer Academic Publishers, Boston. 328
Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered
perceptrons. Computer Speech and Language, 3, 1–19. 352
560
BIBLIOGRAPHY
Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multi-
layer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence,
12, 1167–1178. 328
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University
Press, New York, NY, USA. 85
Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate
where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665–674.
229
Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 145,
464
Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 215
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and
Regression Trees. Wadsworth International Group, Belmont, CA. 140
Bridle, J. S. (1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden
Markov model interpretation. Speech Communication, 9(1), 83–92. 158
Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition.
Ph.D. thesis, Dept. of Computer Science, Carnegie-Mellon University. 325
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D.,
Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.
Computational linguistics, 16(2), 79–85. 19
Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992).
Class-based n-gram models of natural language. Computational Linguistics, 18, 467–
479. 356
Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and
control. Blaisdell Pub. Co. 188
Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving
optimum programming problems. Technical Report BR-1303, Raytheon Company,
Missle and Space Division. 188
Buchberger, B., Collins, G. E., Loos, R., and Albrecht, R. (1983). Computer Algebra.
Springer-Verlag. 178
Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Pro-
ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 535–541. ACM. 343
Cai, M., Shi, Y., and Liu, J. (2013). Deep maxout neural networks for speech recognition.
In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop
on, pages 291–296. IEEE. 187
561
BIBLIOGRAPHY
Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.
In R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005, pages 33–40. Society for
Artificial Intelligence and Statistics. 484, 522
Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models
Summer School, pages 372–379. 222
Cauchy, A. (1847a). M´ethode g´en´erale pour la esolution de syst`emes d’´equations si-
multan´ees. In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538.
77
Cauchy, L. A. (1847b). M´ethode g´en´erale pour la esolution des syst`emes d’´equations
simultan´ees. Compte Rendu `a l’Acad´emie des Sciences. 188
Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,
UCSD. 145, 461
Chapelle, O., Weston, J., and Scolkopf, B. (2003). Cluster kernels for semi-supervised
learning. In NIPS’02 , pages 585–592, Cambridge, MA. MIT Press. 446
Chapelle, O., Sch¨olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT
Press, Cambridge, MA. 446
Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neu-
ral Networks for Document Processing. In Guy Lorette, editor, Tenth International
Workshop on Frontiers in Handwriting Recognition, La Baule (France). Universit´e de
Rennes 1, Suvisoft. http://www.suvisoft.com. 20, 23, 341
Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for
language modeling. Computer, Speech and Language, 13(4), 359–393. 317, 318, 367
Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (????). Project adam:
Building an efficient and scalable deep learning training system. 343
Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio,
Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. In Proceedings of the Empiricial Methods in Natural Language
Processing (EMNLP 2014). 312, 368
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The
loss surface of multilayer networks. 229, 435
Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous
speech recognition using attention-based recurrent nn: First results. arXiv:1412.1602.
353
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated
recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,
arXiv 1412.3555. 353
562
BIBLIOGRAPHY
Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural
network for traffic sign classification. Neural Networks, 32, 333–338. 21, 182, 184
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big
simple neural nets for handwritten digit recognition. Neural Computation, 22, 1–14.
20, 23, 341
Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse
coding and vector quantization. In ICML’2011 . 23
Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in un-
supervised feature learning. In Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics (AISTATS 2011). 347
Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep
learning with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings
of the 30th International Conference on Machine Learning (ICML-13), volume 28 (3),
pages 1337–1345. JMLR Workshop and Conference Proceedings. 20, 23, 272, 343
Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris
VI, LIP6. 155
Collobert, R. (2011). Deep learning for efficient discriminative parsing. In AISTATS’2011.
92
Collobert, R. and Weston, J. (2008). A unified architecture for natural language process-
ing: Deep neural networks with multitask learning. In ICML’2008 . 365
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.
(2011a). Natural language processing (almost) from scratch. Journal of Machine
Learning Research, 12, 2493–2537. 247
Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011b). Torch7: A matlab-like envi-
ronment for machine learning. In BigLearn, NIPS Workshop. 342
Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,
36, 287–314. 414, 415
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,
273–297. 16, 129, 139
Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmenta-
tion using depth information. In International Conference on Learning Representations
(ICLR2013). 21, 182, 184
Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by
spike-and-slab RBMs. In ICML’11 . 375, 539
Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab
RBM and extensions to discrete and sparse data distributions. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 540
563
BIBLIOGRAPHY
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.
Wiley-Interscience. 54
Cox, D. and Pinto, N. (2011). Beyond simple features: A large-scale feature search
approach to unconstrained face recognition. In Automatic Face & Gesture Recognition
and Workshops (FG 2011), 2011 IEEE International Conference on, pages 8–15. IEEE.
272
Cox, R. T. (1946). Probability, frequency and reasonable expectation. American Journal
of Physics, 14, 1––10. 47
Cram´er, H. (1946). Mathematical methods of statistics. Princeton University Press. 121
Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304,
111–114. 482
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-
matics of Control, Signals, and Systems, 2, 303–314. 180, 455
Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition
with the mean-covariance restricted Boltzmann machine. In NIPS’2010 . 21
Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained
deep neural networks for large vocabulary speech recognition. IEEE Transactions on
Audio, Speech, and Language Processing, 20(1), 33–42. 352
Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for
QSAR predictions. arXiv:1406.1231. 22
Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-
dimensional inputs. In NIPS26 . NIPS Foundation. 492
Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with
reconstruction sampling. In ICML’2011 . 365
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization. In NIPS’2014 . 79, 229, 435
Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.
(2014). The visual microphone: Passive recovery of sound from video. ACM Transac-
tions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 345
de Finetti, B. (1937). La pr´evision: ses lois logiques, ses sources subjectives. Annales de
l’institut Henri Poincar´e, 7, 1–68. 47
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M.,
Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep
networks. In NIPS’2012 . 343
564
BIBLIOGRAPHY
Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.
17, 182, 455, 456
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09 . 19, 134
Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than
10,000 image categories tell us? In Proceedings of the 11th European Conference on
Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.
19
Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and
Adam, H. (2014). Large-scale object classification using label relation graphs. In
ECCV’2014 , pages 48–64. 318
Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and
Trends in Signal Processing. 353
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Bi-
nary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010 ,
Makuhari, Chiba, Japan. 21
Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs
for vision. Technical Report 1327, epartement d’Informatique et de Recherche
Op´erationnelle, Universit´e de Montr´eal. 540
Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.
In NIPS’2011 . 500
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast
and robust neural network joint models for statistical machine translation. In Proc.
ACL’2014 . 368
DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.
neurons vs. machines. NIPS Tutorial. 22, 275
Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random fields. In International
Conference on Artificial Intelligence and Statistics, pages 177–184. 318
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual
recognition and description. arXiv:1411.4389. 92
Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embed-
ding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,
Stanford University. 145, 464
Doob, J. (1953). Stochastic processes. Wiley: New York. 47
565
BIBLIOGRAPHY
Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.
IEEE Transactions on Neural Networks, 1, 75–80. 236, 305
Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Math-
ematical Analysis and Applications, 5(1), 30–45. 188
Dreyfus, S. E. (1973). The computational solution of optimal control problems with time
lag. IEEE Transactions on Automatic Control, 18(4), 383–385. 188
Dugas, C., Bengio, Y., elisle, F., and Nadeau, C. (2001). Incorporating second-order
functional knowledge for better option pricing. In NIPS’00 , pages 472–478. MIT Press.
62, 155
El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term
dependencies. In NIPS 8 . MIT Press. 297, 316, 317
ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term
dependencies. In NIPS’1995 . 308
Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and Vincent, P. (2009). The difficulty
of training deep architectures and the effect of unsupervised pre-training. In Proceedings
of AISTATS’2009. 182, 184
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).
Why does unsupervised pre-training help deep learning? J. Machine Learning Res.
433, 435, 436, 437
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll´ar, P., Gao, J., He, X.,
Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual
concepts and back. arXiv:1411.4952. 92
Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P.,
and Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekker-
man, M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and
Distributed Approaches. Cambridge University Press. 422
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical
features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-
telligence. 21, 182, 184
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical
features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 35(8), 1915–1929. 318
Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.
442
Fischer, A. and Igel, C. (2011). Bounding the bias of contrastive divergence learning.
Neural Computation, 23(3), 664–73. 522
566
BIBLIOGRAPHY
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7, 179–188. 19, 94
Frasconi, P., Gori, M., and Sperduti, A. (1997). On the efficient classification of data
structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence. 299
Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive pro-
cessing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786.
299
Frey, B. J. (1998). Graphical models for machine learning and digital communication.
MIT Press. 300, 301
Frey, B. J., Hinton, G. E., and Dayan, P. (1996). Does the wake-sleep algorithm learn
good density estimators? In NIPS’95 , pages 661–670. MIT Press, Cambridge, MA.
300
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36,
193–202. 14, 20, 23, 276
Garson, J. (1900). The metric system of identification of criminals, as used in in great
britain and ireland. The Journal of the Anthropological Institute of Great Britain and
Ireland, (2), 177–227. 19
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedfor-
ward neural networks. In AISTATS’2010 . 154
Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In
AISTATS’2011 . 14, 155, 421
Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectifier neural networks.
In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial
Intelligence and Statistics (AISTATS 2011). 186, 421
Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale senti-
ment classification: A deep learning approach. In ICML’2011 . 421, 441
Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face
Recognition. Imperial College Press. 465, 467
Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep
networks. In NIPS’2009 , pages 646–654. 408, 420
Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L.
(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot
Interaction (HRI), Osaka, Japan. ACM Press, ACM Press. 90
Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with
spike-and-slab sparse coding. In ICML’2012 . 417
567
BIBLIOGRAPHY
Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution
for autoencoders. Technical report, Universit´e de Montr´eal. 267
Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding
for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning
Hierarchical Models. 182, 184, 441
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).
Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–
1327. 187, 220, 274, 347
Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction
deep Boltzmann machines. In NIPS26 . NIPS Foundation. 91, 490, 536, 538
Goodfellow, I. J., Courville, A., and Bengio, Y. (2013c). Scaling up spike-and-slab models
for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(8), 1902–1914. 540
Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2014a). An em-
pirical investigation of catastrophic forgeting in gradient-based neural networks. In
ICLR’2014 . 187
Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adver-
sarial examples. CoRR, abs/1412.6572. 223, 225
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014c). Generative adversarial networks. In NIPS’2014.
180
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014d). Multi-
digit number recognition from Street View imagery using deep convolutional neural
networks. In International Conference on Learning Representations. 21, 91, 182, 183,
184, 334
Goodman, J. (2001). Classes for fast maximum entropy training. In International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 361
Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(1), 76–86. 229
Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Originally
published under the pseudonym “Student”. 19
Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed
representations without word alignments. Technical report, arXiv:1410.2455. 444
Graves, A. (2011a). Practical variational inference for neural networks. In J. Shawe-
Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in
Neural Information Processing Systems 24 , pages 2348–2356. Curran Associates, Inc.
204
568
BIBLIOGRAPHY
Graves, A. (2011b). Practical variational inference for neural networks. In NIPS’2011 .
206
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies
in Computational Intelligence. Springer. 282, 296, 311, 312, 318, 353
Graves, A. (2013). Generating sequences with recurrent neural networks. Technical
report, arXiv:1308.0850. 163, 311
Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent
neural networks. In ICML’2014 . 311
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirec-
tional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.
296
Graves, A. and Schmidhuber, J. (2009). Offline handwriting recognition with multidi-
mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and
L. Bottou, editors, NIPS’2008 , pages 545–552. 296
Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-
ral classification: Labelling unsegmented sequence data with recurrent neural networks.
In ICML’2006 , pages 369–376, Pittsburgh, USA. 318, 353
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Uncon-
strained on-line handwriting recognition with recurrent neural networks. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 296
Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., and Schmidhuber,
J. (2009). A novel connectionist system for unconstrained handwriting recognition.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868.
311
Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recur-
rent neural networks. In ICASSP’2013 , pages 6645–6649. 296, 297, 311, 312, 353
Graves, A., Wayne, G., and Danihelka, I. (2014). Neural Turing machines.
arXiv:1410.5401. 22
Gregor, K. and LeCun, Y. (2010). Emergence of complex-like cells in a temporal product
network with local receptive fields. Technical report, arXiv:1006.0448. 266
G¨ul¸cehre, C¸ . and Bengio, Y. (2013). Knowledge matters: Importance of prior infor-
mation for optimization. In International Conference on Learning Representations
(ICLR’2013). 21, 243
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-
tion principle for unnormalized statistical models. In Proceedings of The Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS’10). 492
569
BIBLIOGRAPHY
Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.
(2007). Online learning for offroad robots: Spatial label propagation to learn long-
range traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA,
USA. 346
Haffner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural
networks for high performance continuous speech recognition. In International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pages 105–108, Toronto.
328
H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings
of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,
California. ACM Press. 182, 455
H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.
Computational Complexity, 1, 113–129. 182, 455
Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 15
Henaff, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning
of sparse features for scalable audio classification. In ISMIR’11 . 422
Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modifiables: D´ecodage de
messages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie
des Sciences, 299(III-13), 525––528. 414
Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 21,
353
Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.
Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 483
Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 464
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with
neural networks. Science, 313(5786), 504–507. 410, 432, 433
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with
Neural Networks. Science, 313, 504–507. 435
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and
Helmholtz free energy. In NIPS’1993 . 404
Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief
nets. Neural Computation, 18, 1527–1554. 12, 16, 23, 130, 432, 433, 523
570
BIBLIOGRAPHY
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural
networks for acoustic modeling in speech recognition: The shared views of four research
groups. IEEE Signal Process. Mag., 29(6), 82–97. 91
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
(2012c). Improving neural networks by preventing co-adaptation of feature detectors.
Technical report, arXiv:1207.0580. 201
Hinton, G. E., Vinyals, O., and Dean, J. (2014). Dark knowledge. Invited talk at the
BayLearn Bay Area Machine Learning Symposium. 344
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma
thesis, T.U. M¨unich. 234, 305, 313
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computa-
tion, 9(8), 1735–1780. 22, 311, 312
Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In
J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE
Press. 312
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks, 2, 359–366. 180, 455
Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an
unknown mapping and its derivatives using multilayer feedforward networks. Neural
networks, 3(5), 551–560. 180
Horst, R., Pardalos, P., and Thoai, N. (2000). Introduction to Global Optimization.
Kluwer Academic Publishers. Second Edition. 245
Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World
Chess Champion. Princeton University Press, Princeton, NJ, USA. 2
Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov
random fields on lattice. Annals of the Institute of Statistical Mathematics, 54(1),
1–18. 489
Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monkey
striate cortex. Journal of Physiology (London), 195, 215–243. 273
Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s
striate cortex. Journal of Physiology, 148, 574–591. 273
Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction, and
functional architecture in the cat’s visual cortex. Journal of Physiology (London),
160, 106–154. 273
571
BIBLIOGRAPHY
Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96,
pages 13–24. 284
Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing
Surveys, 2, 94–128. 414
Hyv¨arinen, A. (2005a). Estimation of non-normalized statistical models using score
matching. J. Machine Learning Res., 6. 425
Hyv¨arinen, A. (2005b). Estimation of non-normalized statistical models using score
matching. Journal of Machine Learning Research, 6, 695–709. 490
Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence,
and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural
Networks, 18, 1529–1531. 491
Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and
Data Analysis, 51, 2499–2512. 491
Hyv¨arinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Ex-
istence and uniqueness results. Neural Networks, 12(3), 429–439. 415
Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis.
Wiley-Interscience. 414
Hyv¨arinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A proba-
bilistic approach to early computational vision. Springer-Verlag. 279
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 . 21, 90
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture
of local experts. Neural Computation, 3, 79–87. 162
Jaeger, H. (2003). Adaptive nonlinear system identification with echo state networks. In
Advances in Neural Information Processing Systems 15 . 306
Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state
networks. Technical report, Jacobs University. 297
Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 305
Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and
saving energy in wireless communication. Science, 304(5667), 78–80. 23, 305
Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Scolkopf, B. (2012).
On causal and anticausal learning. In ICML’2012 , pages 1255–1262. 447, 449
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best
multi-stage architecture for object recognition? In ICCV’09. 14, 155, 422
572
BIBLIOGRAPHY
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best
multi-stage architecture for object recognition? In Proc. International Conference on
Computer Vision (ICCV’09), pages 2146–2153. IEEE. 20, 23, 186, 272
Jarzynski, C. (1997). Nonequilibrium equality for free energy differences. Phys. Rev.
Lett., 78, 2690–2693. 499
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University
Press. 46
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target
vocabulary for neural machine translation. arXiv:1412.2007. 368
Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters
from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in
Practice. North-Holland, Amsterdam. 317, 367
Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial pyramids: Receptive field learn-
ing for pooled image features. In Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on, pages 3370–3377. IEEE. 261
Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural
networks: convergence and generalization. IEEE Transactions on Neural Networks,
7(6), 1424–1438. 204, 206
Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 16
Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classi-
fication. IEEE Transactions on Signal Processing, 40(12), 3043–3054. 325
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algo-
rithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 414
Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., G¨ul¸cehre, c., Memisevic, R., Vin-
cent, P., Courville, A., Bengio, Y., Ferrari, R. C., Mirza, M., Jean, S., Carrier, P.-L.,
Dauphin, Y., Boulanger-Lewandowski, N., Aggarwal, A., Zumer, J., Lamblin, P., Ray-
mond, J.-P., Desjardins, G., Pascanu, R., Warde-Farley, D., Torabi, A., Sharma, A.,
Bengio, E., ot´e, M., Konda, K. R., and Wu, Z. (2013). Combining modality specific
deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM
on International Conference on Multimodal Interaction. 182, 184
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In
EMNLP’2013 . 368
Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.
IEEE Transactions on Pattern Analysis and Machine Intelligence. 428
Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image
descriptions. In CVPR’2015 . arXiv:1412.2306. 92
573
BIBLIOGRAPHY
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).
Large-scale video classification with convolutional neural networks. In CVPR. 19
Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. Master’s thesis, Dept.˜of Mathematics, Univ.˜of Chicago. 87
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model
component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-35(3), 400–401. 317, 367
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding
algorithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 407
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding
algorithms with applications to object recognition. Technical report, Computational
and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-
12-01. 422
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant
features through topographic filter maps. In CVPR’2009. 422
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.
(2010a). Learning convolutional feature hierarchies for visual recognition. In Advances
in Neural Information Processing Systems 23 (NIPS’10), pages 1090–1098. 272
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and Le-
Cun, Y. (2010b). Learning convolutional feature hierarchies for visual recognition.
In NIPS’2010 . 422
Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10),
947–954. 188
Khan, F., Zhu, X., and Mutlu, B. (2011). How do humans teach: On curriculum learning
and teaching dimension. In Advances in Neural Information Processing Systems 24
(NIPS’11), pages 1449–1457. 247
Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary
Mathematics ; V. 1). American Mathematical Society. 379
Kingma, D. and LeCun, Y. (2010a). Regularized estimation of image statistics by score
matching. In NIPS’2010 . 425
Kingma, D. and LeCun, Y. (2010b). Regularized estimation of image statistics by score
matching. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 1126–1134. 492
Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning
with deep generative models. In NIPS’2014. 395
574
BIBLIOGRAPHY
Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable
models in auxiliary form. Technical report, arxiv:1306.0733. 180, 395
Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings
of the International Conference on Learning Representations (ICLR). 180, 395, 467,
468
Kingma, D. P. and Welling, M. (2014b). Efficient gradient-based inference through trans-
formations between bayes nets and neural nets. Technical report, arxiv:1402.0480. 180,
394, 395
Kirkpatrick, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220, 671–680. 245
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.
In ICML’2014 . 92
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embed-
dings with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 92, 311
Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed
representations of words. In Proceedings of COLING 2012 . 444
Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and
Pfister, H. (2014). Deep learning for the connectome. GPU Technology Conference. 22
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and
Techniques. MIT Press. 323, 393, 400
Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and max-
imization of A posteriori probabilities application to transition-based connectionist
speech recognition. In NIPS’95 . MIT Press, Cambridge, MA. 352
Koren, Y. (2009). 1 the bellkor solution to the netflix grand prize. 218
Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In
ICML’2014 . 297, 317
Koˇcisk´y, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-
sentations by Marginalizing Alignments. In Proceedings of ACL. 369
Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties
of DBNs with binary hidden units and real-valued visible units. In ICML’2013. 455
Krizhevsky, A. (2010). Convolutional deep belief networks on CIFAR-
10. Technical report, University of Toronto. Unpublished Manuscript:
http://www.cs.utoronto.ca/ kriz/conv-cifar10-aug2010.pdf. 342
Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny
images. Technical report, University of Toronto. 19, 375
575
BIBLIOGRAPHY
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems
25 (NIPS’2012). 20, 23, 90, 346
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classification with deep
convolutional neural networks. In NIPS’2012 . 21, 182, 184, 421
Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the
Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–
492, Berkeley, Calif. University of California Press. 87
Kumar, M. P., Packer, B., and Koller, D. (2010). Self-paced learning for latent variable
models. In NIPS’2010 . 247
Lafferty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In C. E. Brodley and
A. P. Danyluk, editors, ICML 2001 . Morgan Kaufmann. 318, 326
Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural net-
work architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-
Mellon University. 282, 307
Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear
independent component analysis using ensemble learning: Experiments and discussion.
In Proc. ICA. Citeseer. 415
Larochelle, H. and Bengio, Y. (2008a). Classification using discriminative restricted Boltz-
mann machines. In ICML’2008 . 408, 551
Larochelle, H. and Bengio, Y. (2008b). Classification using discriminative restricted
Boltzmann machines. In ICML’08 , pages 536–543. ACM. 446
Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.
In AISTATS’2011 . 300, 303
Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In
AAAI Conference on Artificial Intelligence. 442
Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative
and discriminative models. In Proceedings of the Computer Vision and Pattern Recog-
nition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer
Society. 446
Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled
convolutional neural networks. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor,
R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems
23 (NIPS’10), pages 1279–1287. 266
576
BIBLIOGRAPHY
Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,
A. (2012). Building high-level features using large scale unsupervised learning. In
ICML’2012 . 20, 23
Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approx-
imators. Neural Computation, 22(8), 2192–2207. 455
Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural
gradient algorithm. In NIPS’07 . 166
LeCun, Y. (1985). Une proc´edure d’apprentissage pour R´eseau `a seuil assym´etrique. In
Cognitiva 85: A la Fronti`ere de l’Intelligence Artificielle, des Sciences de la Connais-
sance et des Neurosciences, pages 599–604, Paris 1985. CESTA, Paris. 188
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de
Paris VI. 16, 404
LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,
Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications
of neural network chips and automatic learning. IEEE Communications Magazine,
27(11), 41–46. 215, 276
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998a). Gradient-based learning
applied to document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 14, 23
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 16, 19,
318, 326, 327, 328, 353
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area
V2. In NIPS’07 . 408
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In L. Bottou
and M. Littman, editors, ICML 2009. ACM, Montreal, Canada. 272, 540, 541
Lee, Y. J. and Grauman, K. (2011). Learning the easy things first: self-paced visual
category discovery. In CVPR’2011 . 247
Leibniz, G. W. (1676). Memoir using the chain rule. (Cited in TMME 7:2&3 p 321-332,
2010). 188
Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; represen-
tation and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.
2
Leprieur, H. and Haffner, P. (1995). Discriminant learning with minimum memory loss
for improved non-vocabulary rejection. In EUROSPEECH’95, Madrid, Spain. 325
577
BIBLIOGRAPHY
L’Hˆopital, G. F. A. (1696). Analyse des infiniment petits, pour l’intelligence des lignes
courbes. Paris: L’Imprimerie Royale. 188
Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies
is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural
Networks, 7(6), 1329–1338. 308
Linde, N. (1992). The machine that changed the world, episode 3. Documentary minis-
eries. 2
Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Nu-
merical Mathematics, 16(2), 146–160. 188
Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to ap-
proximately evaluate or simulate. In Proceedings of the 27th International Conference
on Machine Learning (ICML’10). 517
Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine
invented by Charles Babbage”. 1
Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 319, 325, 332
Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent
neural network training. Computer Science Review, 3(3), 127–149. 305
Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with
convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 92
Lyness, J. N. and Moler, C. B. (1967). Numerical differentiation of analytic functions.
SIAM J.Numer. Anal., 4, 202––210. 176
Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09 . 491
Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without
stable states: A new framework for neural computation based on perturbations. Neural
Computation, 14(11), 2531–2560. 305
MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge
University Press. 54
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning
with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 92
Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for
restricted Boltzmann machine learning. In Proceedings of The Thirteenth International
Conference on Artificial Intelligence and Statistics (AISTATS’10), volume 9, pages
509–516. 486, 491, 519
Martens, J. and Medabalimi, V. (2014). On the expressive efficiency of sum product
networks. arXiv:1411.7717 . 456
578
BIBLIOGRAPHY
Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-
free optimization. In Proc. ICML’2011 . ACM. 313, 314
Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous
state space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612.
489
Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition
using a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo
CA. Morgan Kaufmann. 328
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall,
London. 157
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5, 115–133. 13
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,
Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,
J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In
JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 182, 184, 441
Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surfing on the
manifold. Learning Workshop, Snowbird. 544
Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular
PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 355
Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,
Brno University of Technology. 163, 315
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empiri-
cal evaluation and combination of advanced language modeling techniques. In Proc.
12th annual conference of the international speech communication association (INTER-
SPEECH 2011). 366
Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for
training large scale neural network language models. In Proc. ASRU’2011. 247, 366
Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages
for machine translation. Technical report, arXiv:1309.4168. 444
Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cam-
bridge UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 496
Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 13
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 89
579
BIBLIOGRAPHY
Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-
contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and
K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages
2265–2273. Curran Associates, Inc. 366, 494
Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural proba-
bilistic language models. In ICML’2012 , pages 1751–1758. 366
Mnih, V. and Hinton, G. (2010). Learning to detect roads in high-resolution aerial images.
In Proceedings of the 11th European Conference on Computer Vision (ECCV). 92
Mobahi, H. and Fisher III, J. W. (2015). A theoretical analysis of optimization by gaussian
continuation. In AAAI’2015 . 246
Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief
networks. IEEE Trans. on Audio, Speech and Language Processing, 20(1), 14–22. 352
Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks
with discrete units. Neural Computation, 26. 455
Mont´ufar, G. and Ay, N. (2011). Refinements of universal approximation results for
deep belief networks and restricted Boltzmann machines. Neural Computation, 23(5),
1306–1319. 455
Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product
of mixtures? SIAM Journal on Discrete Mathematics, 29(1), 321–347. 454
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear
regions of deep neural networks. In NIPS’2014 . 17, 453, 456, 457
Mor-Yosef, S., Samueloff, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking
the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet
Gynecol, 75(6), 944–7. 2
Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language
model. In AISTATS’2005. 361, 363
Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages
275–282, San Mateo, CA. Morgan Kaufmann. 308, 309, 317
Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cam-
bridge, MA, USA. 89, 138, 140
Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In
ICML’2014 . 163, 304, 305
Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method
for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP-36(9), 1432–1436. 325
580
BIBLIOGRAPHY
Nair, V. and Hinton, G. (2010a). Rectified linear units improve restricted Boltzmann
machines. In ICML’2010 . 155, 421
Nair, V. and Hinton, G. E. (2010b). Rectified linear units improve restricted Boltzmann
machines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh
International Conference on Machine Learning (ICML-10), pages 807–814. ACM. 14
Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypoth-
esis. In NIPS’2010 . 145, 461
Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56,
71–113. 542
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.
Springer. 221
Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2),
125–139. 498, 499
Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance
sampling. 500
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Read-
ing digits in natural images with unsupervised feature learning. Deep Learning and
Unsupervised Feature Learning Workshop, NIPS. 19
Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical
language modelling. In European Conference on Speech Communication and Technol-
ogy (Eurospeech), pages 973–976, Berlin. 356
Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of
part-of-speech and automatically derived category-based language models for speech
recognition. In International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 177–180. 356
Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in
classifying static speech patterns. Computer Speech and Language, 4, 275–289. 155
Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 85, 87
Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural
Computation, 17, 1665–1699. 14
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field prop-
erties by learning a sparse code for natural images. Nature, 381, 607–609. 277, 407,
460
Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set:
a strategy employed by V1? Vision Research, 37, 3311–3325. 350, 420
581
BIBLIOGRAPHY
Opper, M. and Archambeau, C. (2009). The variational gaussian approximation revisited.
Neural computation, 21(3), 786–792. 180
Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning
algorithms for various stochastic models. Neural Networks, 13(7), 755 764. 166
Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research
in Economics and Management Sci., MIT. 188
Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de
Montr´eal. 231, 232
Pascanu, R. and Bengio, Y. (2012). On the difficulty of training recurrent neural networks.
Technical Report arXiv:1211.5063, Universite de Montreal. 163
Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Tech-
nical report, arXiv:1301.3584. 166
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent
neural networks. In ICML’2013 . 163, 236, 305, 309, 315, 316, 317
Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. Technical report, U.
Montreal, arXiv:1312.6098. 182
Pascanu, R., ul¸cehre, C¸ ., Cho, K., and Bengio, Y. (2014a). How to construct deep
recurrent neural networks. In ICLR’2014 . 17, 221, 297, 298, 311, 353, 456, 457
Pascanu, R., Montufar, G., and Bengio, Y. (2014b). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 454
Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential
reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, Uni-
versity of California, Irvine, pages 329–334. 377
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann. 47
Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 27
Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recog-
nition hard? PLoS Comput Biol, 4. 350, 541
Pinto, N., Stone, Z., Zickler, T., and Cox, D. (2011). Scaling up biologically-inspired
computer vision: A case study in unconstrained face recognition on facebook. In Com-
puter Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on, pages 35–42. IEEE. 272
Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46(1),
77–105. 299
582
BIBLIOGRAPHY
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 238
Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders
and deep networks. CoRR, abs/1406.1831. 203
Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In
UAI’2011 , Barcelona, Spain. 182, 456
Poundstone, W. (2005). Fortune’s Formula: The untold story of the scientific betting
system that beat the casinos and Wall Street. Macmillan. 55
Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 155
Price, R. (1958). A useful theorem for nonlinear devices having gaussian inputs. IEEE
Transactions on Information Theory, 4(2), 69–72. 180
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual
representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.
274
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2), 257–286. 323, 352
Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE
ASSP Magazine, pages 257–285. 281, 323
Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive
distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 304
Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning
using graphics processors. In L. Bottou and M. Littman, editors, ICML 2009 , pages
873–880, New York, NY, USA. ACM. 23, 341
Rall, L. B. (1981). Automatic Differentiation: Techniques and Applications. Lecture
Notes in Computer Science 120, Springer. 176
Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Founda-
tions of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster
University Archive for the History of Economic Thought. 48
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Efficient learning of
sparse representations with an energy-based model. In NIPS’2006 . 12, 16, 420, 432,
433
Ranzato, M., Huang, F., Boureau, Y., and LeCun, Y. (2007b). Unsupervised learning
of invariant feature hierarchies with applications to object recognition. In Proceedings
of the Computer Vision and Pattern Recognition Conference (CVPR’07). IEEE Press.
272
583
BIBLIOGRAPHY
Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief
networks. In NIPS’2007 . 420
Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical
parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 121
Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to
parallelizing stochastic gradient descent. In NIPS’2011 . 343
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and
approximate inference in deep generative models. In ICML’2014. 180, 394, 395
Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning
through cross-modal transfer. In 27th Annual Conference on Neural Information Pro-
cessing Systems (NIPS 2013). 442
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-
encoders: Explicit invariance during feature extraction. In ICML’2011. 428, 430,
463
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.
(2011b). Higher order contractive auto-encoder. In European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in Databases (ECML
PKDD). 408
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.
(2011c). Higher order contractive auto-encoder. In ECML PKDD. 428
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold
tangent classifier. In NIPS’2011 . 476
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for
sampling contractive auto-encoders. In ICML’2012. 544
Ringach, D. and Shapley, R. (2004). Reverse correlation in neurophysiology. Cognitive
Science, 28(2), 147–166. 276
Roberts, S. and Everson, R. (2001). Independent component analysis: principles and
practice. Cambridge University Press. 415
Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech
recognition system. Computer Speech and Language, 5(3), 259–274. 23, 352
Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 85
Romero, A., Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Y.
(2015). Fitnets: Hints for thin deep nets. In ICLR’2015, arXiv:1412.6550. 244, 245
Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part
i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,
8(1), pp. 181–217. 85
584
BIBLIOGRAPHY
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological Review, 65, 386–408. 12, 13, 23
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 13, 23
Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. Science, 290(5500). 145, 464
Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-
propagating errors. Nature, 323, 533–536. 12, 16, 21, 188, 355
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal repre-
sentations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,
Parallel Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cam-
bridge. 19, 23, 188
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986c). Learning representations
by back-propagating errors. Nature, 323, 533–536. 149, 281
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986d). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,
Cambridge. 15, 188
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986e). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, volume 1.
MIT Press, Cambridge. 149
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large
Scale Visual Recognition Challenge. 19
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition
challenge. arXiv preprint arXiv:1409.0575 . 24
Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal
elements of macaque V1 receptive fields. Neuron, 46(6), 945–956. 275
Sainath, T., rahman Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep
convolutional neural networks for LVCSR. In ICASSP 2013 . 353
Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of
the International Conference on Artificial Intelligence and Statistics, volume 5, pages
448–455. 20, 23, 433, 526, 529, 534, 536
Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings
of the Twelfth International Conference on Artificial Intelligence and Statistics (AIS-
TATS 2009), volume 8. 533, 537, 549
585
BIBLIOGRAPHY
Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance
kernels for Gaussian processes. In NIPS’07 , pages 1249–1256, Cambridge, MA. MIT
Press. 447
Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief
networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,
volume 25, pages 872–879. ACM. 499
Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean field theory for sigmoid belief
networks. Journal of Artificial Intelligence Research, 4, 61–76. 23
Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On
random weights and unsupervised feature learning. In Proc. ICML’2011 . ACM. 272
Schaul, T., Zhang, S., and LeCun, Y. (2012). No More Pesky Learning Rates. Technical
report, New York University, arxiv 1206.1106. 243
Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of
history compression. Neural Computation, 4(2), 234–242. 297
Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on
Neural Networks, 7(1), 142–146. 355
Scolkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 139
Scolkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 145, 464
Scolkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods
Support Vector Learning. MIT Press, Cambridge, MA. 16, 155, 184
Schulz, H. and Behnke, S. (2012). Learning two-layer contractive encodings. In
ICANN’2012 , pages 620–628. 430
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11), 2673–2681. 296
Schwenk, H. (2007). Continuous space language models. Computer speech and language,
21, 492–518. 356, 360
Schwenk, H. (2010). Continuous space language models for statistical machine translation.
The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 356, 367
Schwenk, H. (2014). Cleaned subset of wmt ’14 dataset. 19
Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocab-
ulary continuous speech recognition. In International Conference on Acoustics, Speech
and Signal Processing (ICASSP), volume 1, pages 765–768. 356
Schwenk, H. and Gauvain, J.-L. (2005). Building continuous space language models for
transcribing european languages. In Interspeech, pages 737–740. 356
586
BIBLIOGRAPHY
Schwenk, H., Costa-juss`a, M. R., and Fonollosa, J. A. R. (2006). Continuous space lan-
guage models for the iwslt 2006 task. In International Workshop on Spoken Language
Translation, pages 166–173. 356, 367
Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-
dependent deep neural networks. In Interspeech 2011 , pages 437–440. 21
Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied
to house numbers digit classification. CoRR, abs/1204.3968. 350
Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection
with unsupervised multi-stage feature learning. In Proc. International Conference on
Computer Vision and Pattern Recognition (CVPR’13). IEEE. 21, 182, 184
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical
Journal, 27(3), 379—-423. 55
Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the
Institute of Radio Engineers, 37(1), 10–21. 55
Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publica-
tions. 27
Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–
548. 284
Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied
Mathematics Letters, 4(6), 77–80. 284
Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.
Journal of Computer and Systems Sciences, 50(1), 132–150. 236
Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism
for specifying selected invariances in an adaptive network. In NIPS’1991 . 475, 476
Simard, P. Y., LeCun, Y., and Denker, J. (1993). Efficient pattern recognition using a
new transformation distance. In NIPS’92 . 474
Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation
invariance in pattern recognition — tangent distance and tangent propagation. Lecture
Notes in Computer Science, 1524. 474
Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a min-
imum, with application to neural networks. International Journal of Control, 62(6),
1391–1407. 213
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of
harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed
Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 384, 397
587
BIBLIOGRAPHY
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a).
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In
NIPS’2011 . 299
Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural
language with recursive neural networks. In Proceedings of the Twenty-Eighth Inter-
national Conference on Machine Learning (ICML’2011). 299
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).
Semi-supervised recursive autoencoders for predicting sentiment distributions. In
EMNLP’2011 . 299
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.
(2013). Recursive deep models for semantic compositionality over a sentiment treebank.
In EMNLP’2013 . 299
Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural
networks. Complex Systems, 2, 625–639. 159
Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local
minima even for networks without hidden layers. Complex Systems, 3, 91–106. 229
Spall, J. C. (1992). Multivariate stochastic approximation using a simultaneous perturba-
tion gradient approximation. IEEE Transactions on Automatic Control, 37, 332–341.
176
Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog:
how ”less is more” in unsupervised dependency parsing. In HLT’10 . 247
Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann
machines. In NIPS’2012 . 445
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting. Journal of Ma-
chine Learning Research, 15, 1929–1958. 218, 220, 221, 536
Stewart, L., He, X., and Zemel, R. S. (2007). Learning flexible features for conditional
random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30(8), 1415–1426. 319
Supancic, J. and Ramanan, D. (2013). Self-paced learning for long-term tracking. In
CVPR’2013 . 247
Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of
computer science, University of Toronto. 306, 307, 314
Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive
Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International
Conference on Artificial Intelligence and Statistics (AISTATS), volume 9, pages 789–
795. 484
588
BIBLIOGRAPHY
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of
initialization and momentum in deep learning. In ICML. 238, 306, 307, 314
Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with
neural networks. Technical report, arXiv:1409.3215. 22, 91, 311, 312
Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with
neural networks. In NIPS’2014 . 368, 369
Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.
Master’s thesis, University of British Columbia. 426
Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On
autoencoders and score matching for energy based models. In ICML’2011 . ACM. 492
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-
houcke, V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical
report, arXiv:1409.4842. 20, 21, 23, 182, 184, 225, 262
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and
Fergus, R. (2014b). Intriguing properties of neural networks. ICLR, abs/1312.6199.
223
Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to
human-level performance in face verification. In CVPR’2014 . 90
Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In
Proceedings of the 27th International Conference on Machine Learning, June 21-24,
2010, Haifa, Israel. 203
Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework
for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 145, 436, 437,
464
Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994 . 476
Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society B, 58, 267–288. 198
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to
the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors,
ICML 2008 , pages 1064–1071. ACM. 486, 523
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.
Journal of the Royal Statistical Society B, 61(3), 611–622. 414
Torabi, A., Pal, C., Larochelle, H., and Courville, A. (2015). Using descriptive video
services to create a large data source for video annotation research. arXiv preprint
arXiv: 1503.01070. 134
589
BIBLIOGRAPHY
Tu, K. and Honavar, V. (2011). On the utility of curricula in unsupervised learning of
probabilistic grammars. In IJCAI’2011 . 247
Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autore-
gressive density-estimator. In NIPS’2013 . 302, 304
van der Maaten, L. and Hinton, G. E. (2008a). Visualizing data using t-SNE. J. Machine
Learning Res., 9. 356, 436, 464, 468
van der Maaten, L. and Hinton, G. E. (2008b). Visualizing data using t-SNE. Journal of
Machine Learning Research, 9, 2579–2605. 437
Vanhoucke, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks
on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.
340
Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-
Verlag, Berlin. 102, 103
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
102, 103
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and Its Applications,
16, 264–280. 102, 103
Vincent, P. (2011a). A connection between score matching and denoising autoencoders.
Neural Computation, 23(7). 425, 426, 428, 544
Vincent, P. (2011b). A connection between score matching and denoising autoencoders.
Neural Computation, 23(7), 1661–1674. 492, 545
Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.
466
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. In ICML 2008 . 423
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network with a local
denoising criterion. J. Machine Learning Res., 11. 423
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a). Gram-
mar as a foreign language. Technical report, arXiv:1412.7449. 311
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural
image caption generator. arXiv 1411.4555. 311
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: a neural image
caption generator. In CVPR’2015 . arXiv:1411.4555. 92
590
BIBLIOGRAPHY
Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal
projections directed to the auditory pathway. Nature, 404(6780), 871–876. 14
Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.
In Advances in Neural Information Processing Systems 26 , pages 351–359. 221
Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme
recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,
and Signal Processing, 37, 328–339. 282, 346, 352
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of
neural networks using dropconnect. In ICML’2013. 222
Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 221
Warde-Farley, D., Goodfellow, I. J., Lamblin, P., Desjardins, G., Bastien, F., and Bengio,
Y. (2011). pylearn2. http://deeplearning.net/software/pylearn2. 342
Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical
analysis of dropout in piecewise linear networks. In ICLR’2014 . 221
Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by
semidefinite programming. In CVPR’2004 , pages 988–995. 145, 464
Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In
Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC , pages 762–770. 188
Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised em-
bedding. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 , pages
1168–1175, New York, NY, USA. ACM. 446
Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning
to rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 299
Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON
Convention Record, volume 4, pages 96–104. IRE, New York. 13, 19, 20, 23
Wikipedia (2015). List of animals by number of neurons wikipedia, the free encyclo-
pedia. [Online; accessed 4-March-2015]. 20, 23
Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In
NIPS’95 , pages 514–520. MIT Press, Cambridge, MA. 184
Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist
reinforcement learning. Machine Learning, 8, 229–256. 180
Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms.
Neural Computation, 8(7), 1341–1390. 104
Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image
recognition. arXiv:1501.02876. 21, 343
591
BIBLIOGRAPHY
Wu, Z. (1997). Global continuation for distance geometry problems. SIAM Journal of
Optimization, 7, 814–836. 245
Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated
splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562.
221
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,
and Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with
visual attention. In ICML’2015 . 92
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,
and Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with
visual attention. arXiv:1502.03044. 311
Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for
gaussian mixtures. Neural Computation, 8, 129–151. 325
Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly
decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 484,
523
Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv 1410.4615. 247
Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions
of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical
Society. American Mathematical Society. 454
Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional net-
works. In ECCV’14 . 6
Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative
stochastic network for protein secondary structure prediction. In ICML’2014 . 550, 551
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society, Series B, 67(2), 301–320. 165
ohrer, M. and Pernkopf, F. (2014). General stochastic networks for classification. In
NIPS’2014 . 550
592