Bibliography

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis,

A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M.,

Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R.,

Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I.,

Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden,

P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale

machine learning on heterogeneous systems. Software available from tensorﬂow.org. 25,

213, 449

Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for

Boltzmann machines. Cognitive Science, 9, 147–169. 573, 659

Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data gen-

erating distribution. Technical Report Arxiv report 1211.4246, Université de Montréal.

516

Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data

generating distribution. In ICLR’2013 . also arXiv report 1211.4246. 510, 516, 524

Alain, G., Bengio, Y., Yao, L., Éric Thibodeau-Laufer, Yosinski, J., and Vincent, P. (2015).

GSNs: Generative stochastic networks. arXiv:1503.05571. 513, 721

Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris

Society, 59, 2–5. 21

Ba, J., Mnih, V., and Kavukcuoglu, K. (2014). Multiple object recognition with visual

attention. arXiv:1412.7755 . 699

Bachman, P. and Precup, D. (2015). Variational generative stochastic networks with

collaborative shaping. In Proceedings of the 32nd International Conference on Machine

Learning, ICML 2015, Lille, France, 6-11 July 2015 , pages 1964–1972. 725

Bacon, P.-L., Bengio, E., Pineau, J., and Precup, D. (2015). Conditional computation in

neural networks using a decision-theoretic approach. In 2nd Multidisciplinary Conference

on Reinforcement Learning and Decision Making (RLDM 2015). 453

729

BIBLIOGRAPHY

Bagnell, J. A. and Bradley, D. M. (2009). Diﬀerentiable sparse coding. In D. Koller,

D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information

Processing Systems 21 (NIPS’08), pages 113–120. 501

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly

learning to align and translate. In ICLR’2015, arXiv:1409.0473 . 25, 101, 400, 421, 423,

468, 478

Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition

with continuous-parameter hidden Markov models. Computer, Speech and Language,

219–234. 461

Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis:

Learning from examples without local minima. Neural Networks, 2, 53–58. 288

Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the

past and the future in protein secondary structure prediction. Bioinformatics,

(11),

937–946. 398

Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in

high-energy physics with deep learning. Nature communications, 5. 26

Ballard, D. H., Hinton, G. E., and Sejnowski, T. J. (1983). Parallel vision computation.

Nature. 455

Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. 147

Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal

function. IEEE Trans. on Information Theory, 39, 930–945. 199

Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University

Press. 493

Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and

Applications. Wiley. 493

Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,

Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.

Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 25, 82, 213,

223, 449

Basu, S. and Christensen, J. (2013). Teaching classiﬁcation boundaries to humans. In

AAAI’2013 . 332

Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th International

Conference on Computational Learning Theory (COLT’95), pages 311–320, Santa Cruz,

California. ACM Press. 247

730

BIBLIOGRAPHY

Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. arXiv

preprint arXiv:1411.7610 . 265

Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces

in random-dot stereograms. Nature, 355, 161–163. 544

Behnke, S. (2001). Learning iterative image reconstruction in the neural abstraction

pyramid. Int. J. Computational Intelligence and Applications, 1(4), 427–438. 518

Beiu, V., Quintana, J. M., and Avedillo, M. J. (2003). Vlsi implementations of threshold

logic-a comprehensive survey. Neural Networks, IEEE Transactions on,

(5), 1217–

1243. 454

Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for

embedding and clustering. In T. Dietterich, S. Becker, and Z. Ghahramani, editors,

Advances in Neural Information Processing Systems 14 (NIPS’01), Cambridge, MA.

MIT Press. 245

Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and

data representation. Neural Computation, 15(6), 1373–1396. 164, 521

Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D. (2015a). Conditional computation in

neural networks for faster models. arXiv:1511.06297. 453

Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint

distributions using neural networks. IEEE Transactions on Neural Networks, special

issue on Data Mining and Knowledge Discovery, 11(3), 550–557. 715

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015b). Scheduled sampling for

sequence prediction with recurrent neural networks. Technical report, arXiv:1506.03099.

387

Bengio, Y. (1991). Artiﬁcial Neural Networks and their Application to Sequence Recognition.

Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 410

Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural Computation,

12(8), 1889–1900. 438

Bengio, Y. (2002). New distributed probabilistic language models. Technical Report 1215,

Dept. IRO, Université de Montréal. 470

Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 201, 628

Bengio, Y. (2013). Deep learning of representations: looking forward. In Statistical

Language and Speech Processing, volume 7978 of Lecture Notes in Computer Science,

pages 1–37. Springer, also in arXiv at http://arxiv.org/abs/1305.0445. 451

Bengio, Y. (2015). Early inference in energy-based models approximates back-propagation.

Technical Report arXiv:1510.02777, Universite de Montreal. 661

731

BIBLIOGRAPHY

Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-

layer neural networks. In S. Solla, T. Leen, and K.-R. Müller, editors, Advances in

Neural Information Processing Systems 12 (NIPS’99), pages 400–406. MIT Press. 713,

715, 716, 718

Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.

Neural Computation, 21(6), 1601–1621. 516, 615

Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold

cross-validation. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural

Information Processing Systems 16 (NIPS’03), Cambridge, MA. MIT Press, Cambridge.

122

Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Large Scale

Kernel Machines. 19

Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In L. Saul,

Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems

17 (NIPS’04), pages 129–136. MIT Press. 160, 522

Bengio, Y. and Sénécal, J.-S. (2003). Quick training of probabilistic neural nets by

importance sampling. In Proceedings of AISTATS 2003 . 473

Bengio, Y. and Sénécal, J.-S. (2008). Adaptive importance sampling to accelerate training

of a neural probabilistic language model. IEEE Trans. Neural Networks,

(4), 713–722.

473

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated

acoustic parameters for continuous speech recognition using artiﬁcial neural networks.

In Proceedings of EuroSpeech’91 . 27, 462

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Neural network - Gaussian

mixture hybrid for speech recognition or density estimation. In NIPS 4 , pages 175–182.

Morgan Kaufmann. 462

Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term

dependencies in recurrent networks. In IEEE International Conference on Neural

Networks, pages 1183–1195, San Francisco. IEEE Press. (invited paper). 406

Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with

gradient descent is diﬃcult. IEEE Tr. Neural Nets. 18, 405, 406, 414

Bengio, Y., Latendresse, S., and Dugas, C. (1999). Gradient-based learning of hyper-

parameters. Learning Conference, Snowbird. 438

Bengio, Y., Ducharme, R., and Vincent, P. (2001). A neural probabilistic language model.

In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages 932–938. MIT

Press. 18, 450, 466, 469, 475, 480, 485

732

BIBLIOGRAPHY

Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic

language model. JMLR, 3, 1137–1155. 469, 475

Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., and Marcotte, P. (2006a). Convex

neural networks. In NIPS’2005 , pages 123–130. 258

Bengio, Y., Delalleau, O., and Le Roux, N. (2006b). The curse of highly variable functions

for local kernel machines. In NIPS’2005 . 158

Bengio, Y., Larochelle, H., and Vincent, P. (2006c). Non-local manifold Parzen windows.

In NIPS’2005 . MIT Press. 160, 523

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise

training of deep networks. In NIPS’2006 . 14, 19, 201, 326, 327, 328, 531, 533

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In

ICML’09 . 331, 332

Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep

representations. In ICML’2013 . 608

Bengio, Y., Léonard, N., and Courville, A. (2013b). Estimating or propagating gradients

through stochastic neurons for conditional computation. arXiv:1308.3432. 451, 453,

696, 699

Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013c). Generalized denoising auto-

encoders as generative models. In NIPS’2013 . 510, 719, 722

Bengio, Y., Courville, A., and Vincent, P. (2013d). Representation learning: A review and

new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),

35(8), 1798–1828. 558

Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative

stochastic networks trainable by backprop. In ICML’2014 . 719, 720, 721, 722, 723

Bennett, C. (1976). Eﬃcient estimation of free energy diﬀerences from Monte Carlo data.

Journal of Computational Physics, 22(2), 245–268. 634

Bennett, J. and Lanning, S. (2007). The Netﬂix prize. 482

Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy

approach to natural language processing. Computational Linguistics, 22, 39–71. 476

Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive

divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 618

Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern

Classiﬁcation. Ph.D. thesis, Université de Montréal. 255

733

BIBLIOGRAPHY

Bergstra, J. and Bengio, Y. (2009). Slow, decorrelated features for pretraining complex

cell-like networks. In NIPS’2009 . 497

Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J.

Machine Learning Res., 13, 281–305. 437, 438

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,

J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression

compiler. In Proc. SciPy. 25, 82, 213, 223, 449

Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter

optimization. In NIPS’2011 . 439

Berkes, P. and Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex

cell properties. Journal of Vision, 5(6), 579–602. 498

Bertsekas, D. P. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientiﬁc.

106

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician,

(3), 179–195.

621

Bishop, C. M. (1994). Mixture density networks. 189

Bishop, C. M. (1995a). Regularization and complexity control in feed-forward networks.

In Proceedings International Conference on Artiﬁcial Neural Networks ICANN’95 ,

volume 1, page 141–148. 243, 250

Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization.

Neural Computation, 7(1), 108–116. 243

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 98, 146

Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is np-complete.

295

Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and

the vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 114

Bonnet, G. (1964). Transformations des signaux aléatoires à travers les systèmes non

linéaires sans mémoire. Annales des Télécommunications, 19(9–10), 203–220. 696

Bordes, A., Weston, J., Collobert, R., and Bengio, Y. (2011). Learning structured

embeddings of knowledge bases. In AAAI 2011 . 487

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and

meaning representations for open-text semantic parsing. AISTATS’2012 . 404, 487

734

BIBLIOGRAPHY

Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2013a). A semantic matching energy

function for learning with multi-relational data. Machine Learning: Special Issue on

Learning Semantics. 486

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013b).

Translating embeddings for modeling multi-relational data. In C. Burges, L. Bottou,

M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information

Processing Systems 26 , pages 2787–2795. Curran Associates, Inc. 487

Bornschein, J. and Bengio, Y. (2015). Reweighted wake-sleep. In ICLR’2015,

arXiv:1406.2751 . 701

Bornschein, J., Shabanian, S., Fischer, A., and Bengio, Y. (2015). Training bidirectional

Helmholtz machines. Technical report, arXiv:1506.03877. 701

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for opti-

mal margin classiﬁers. In COLT ’92: Proceedings of the ﬁfth annual workshop on

Computational learning theory, pages 144–152, New York, NY, USA. ACM. 18, 141

Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad, editor,

Online Learning in Neural Networks. Cambridge University Press, Cambridge, UK. 298

Bottou, L. (2011). From machine learning to machine reasoning. Technical report,

arXiv.1102.1808. 404

Bottou, L. (2015). Multilayer neural networks. Deep Learning Summer School. 443

Bottou, L. and Bousquet, O. (2008). The tradeoﬀs of large scale learning. In NIPS’2008 .

284, 298

Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal

dependencies in high-dimensional sequences: Application to polyphonic music generation

and transcription. In ICML’12 . 693

Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in

vision algorithms. In Proc. International Conference on Machine learning (ICML’10).

348

Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals:

multi-way local pooling for image recognition. In Proc. International Conference on

Computer Vision (ICCV’11). IEEE. 348

Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and

singular value decomposition. Biological Cybernetics, 59, 291–294. 505

Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered

perceptrons. Computer Speech and Language, 3, 1–19. 462

735

BIBLIOGRAPHY

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University

Press, New York, NY, USA. 93

Boyd, S. and Vandenberghe, L. (2015). Convex optimization. Book in preparation. 31

Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate

where perceptrons succeed. IEEE Transactions on Circuits and Systems,

, 665–674.

286

Brakel, P., Stroobandt, D., and Schrauwen, B. (2013). Training energy-based models for

time-series imputation. Journal of Machine Learning Research,

, 2771–2797. 681,

706

Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 164,

521

Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 256

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and

Regression Trees. Wadsworth International Group, Belmont, CA. 146

Bridle, J. S. (1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden

Markov model interpretation. Speech Communication, 9(1), 83–92. 186

Briggman, K., Denk, W., Seung, S., Helmstaedter, M. N., and Turaga, S. C. (2009).

Maximin aﬃnity learning of image segmentation. In NIPS’2009 , pages 1865–1873. 362

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Laﬀerty, J. D.,

Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.

Computational linguistics, 16(2), 79–85. 21

Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992). Class-

based n-gram models of natural language. Computational Linguistics,

, 467–479.

466

Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and

control. Blaisdell Pub. Co. 226

Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving

optimum programming problems. Technical Report BR-1303, Raytheon Company,

Missle and Space Division. 226

Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery

and data mining, pages 535–541. ACM. 451

Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders.

arXiv preprint arXiv:1509.00519 . 706

736

BIBLIOGRAPHY

Cai, M., Shi, Y., and Liu, J. (2013). Deep maxout neural networks for speech recognition.

In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop

on, pages 291–296. IEEE. 194

Carreira-Perpiñan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.

In R. G. Cowell and Z. Ghahramani, editors, Proceedings of the Tenth International

Workshop on Artiﬁcial Intelligence and Statistics (AISTATS’05), pages 33–40. Society

for Artiﬁcial Intelligence and Statistics. 615

Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models

Summer School, pages 372–379. 246

Cauchy, A. (1847). Méthode générale pour la résolution de systèmes d’équations simul-

tanées. In Compte rendu des séances de l’académie des sciences, pages 536–538. 83,

225

Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,

UCSD. 164

Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM

computing surveys (CSUR), 41(3), 15. 102

Chapelle, O., Weston, J., and Schölkopf, B. (2003). Cluster kernels for semi-supervised

learning. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural

Information Processing Systems 15 (NIPS’02), pages 585–592, Cambridge, MA. MIT

Press. 245

Chapelle, O., Schölkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT

Press, Cambridge, MA. 245, 544

Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neural

Networks for Document Processing. In Guy Lorette, editor, Tenth International

Workshop on Frontiers in Handwriting Recognition, La Baule (France). Université de

Rennes 1, Suvisoft. http://www.suvisoft.com. 24, 27, 448

Chen, B., Ting, J.-A., Marlin, B. M., and de Freitas, N. (2010). Deep learning of invariant

spatio-temporal features from video. NIPS*2010 Deep Learning and Unsupervised

Feature Learning Workshop. 363

Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for

language modeling. Computer, Speech and Language, 13(4), 359–393. 465, 476

Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., and Temam, O. (2014a). Diannao:

A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Pro-

ceedings of the 19th international conference on Architectural support for programming

languages and operating systems, pages 269–284. ACM. 454

737

BIBLIOGRAPHY

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C.,

and Zhang, Z. (2015). Mxnet: A ﬂexible and eﬃcient machine learning library for

heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 . 25

Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N.,

et al. (2014b). Dadiannao: A machine-learning supercomputer. In Microarchitecture

(MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 609–622.

IEEE. 454

Chilimbi, T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). Project adam:

Building an eﬃcient and scalable deep learning training system. In 11th USENIX

Symposium on Operating Systems Design and Implementation (OSDI’14). 450

Cho, K., Raiko, T., and Ilin, A. (2010). Parallel tempering is eﬃcient for learning restricted

Boltzmann machines. In IJCNN’2010 . 607, 618

Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradient and adaptive learning rate for

training restricted Boltzmann machines. In ICML’2011 , pages 105–112. 681

Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y.

(2014a). Learning phrase representations using RNN encoder-decoder for statistical

machine translation. In Proceedings of the Empiricial Methods in Natural Language

Processing (EMNLP 2014). 400, 477, 478

Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014b). On the prop-

erties of neural machine translation: Encoder-decoder approaches. ArXiv e-prints,

abs/1409.1259. 415

Choromanska, A., Henaﬀ, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The

loss surface of multilayer networks. 287, 288

Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous

speech recognition using attention-based recurrent nn: First results. arXiv:1412.1602.

463

Christianson, B. (1992). Automatic hessians by reverse accumulation. IMA Journal of

Numerical Analysis, 12(2), 135–150. 225

Chrupala, G., Kadar, A., and Alishahi, A. (2015). Learning language through pictures.

arXiv 1506.03694. 415

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated

recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,

arXiv 1412.3555. 415, 463

Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2015a). Gated feedback recurrent

neural networks. In ICML’15 . 415

738

BIBLIOGRAPHY

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., and Bengio, Y. (2015b). A

recurrent latent variable model for sequential data. In NIPS’2015 . 706

Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural

network for traﬃc sign classiﬁcation. Neural Networks, 32, 333–338. 23, 201

Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big

simple neural nets for handwritten digit recognition. Neural Computation,

, 1–14.

24, 27, 449

Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse

coding and vector quantization. In ICML’2011 . 27, 256, 501

Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in

unsupervised feature learning. In Proceedings of the Thirteenth International Conference

on Artiﬁcial Intelligence and Statistics (AISTATS 2011). 366, 367, 458

Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep

learning with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings

of the 30th International Conference on Machine Learning (ICML-13), volume 28 (3),

pages 1337–1345. JMLR Workshop and Conference Proceedings. 24, 27, 367, 450

Cohen, N., Sharir, O., and Shashua, A. (2015). On the expressive power of deep learning:

A tensor analysis. arXiv:1509.05009. 557

Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Université de Paris VI,

LIP6. 197

Collobert, R. (2011). Deep learning for eﬃcient discriminative parsing. In AISTATS’2011 .

101, 480

Collobert, R. and Weston, J. (2008a). A uniﬁed architecture for natural language processing:

Deep neural networks with multitask learning. In ICML’2008 . 474, 480

Collobert, R. and Weston, J. (2008b). A uniﬁed architecture for natural language

processing: Deep neural networks with multitask learning. In ICML’2008 . 538

Collobert, R., Bengio, S., and Bengio, Y. (2001). A parallel mixture of SVMs for very

large scale problems. Technical Report IDIAP-RR-01-12, IDIAP. 453

Collobert, R., Bengio, S., and Bengio, Y. (2002). Parallel mixture of SVMs for very large

scale problems. Neural Computation, 14(5), 1105–1114. 453

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P.

(2011a). Natural language processing (almost) from scratch. Journal of Machine

Learning Research, 12, 2493–2537. 332, 480

739

BIBLIOGRAPHY

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011b).

Natural language processing (almost) from scratch. The Journal of Machine Learning

Research, 12, 2493–2537. 538, 539

Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011c). Torch7: A matlab-like environ-

ment for machine learning. In BigLearn, NIPS Workshop. 25, 211, 449

Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,

36, 287–314. 494

Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning,

273–297. 18, 141

Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmentation

using depth information. In International Conference on Learning Representations

(ICLR2013). 23, 201

Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Low precision arithmetic for deep

learning. In Arxiv:1412.7024, ICLR’2015 Workshop. 455

Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by

spike-and-slab RBMs. In ICML’11 . 564, 688

Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab

RBM and extensions to discrete and sparse data distributions. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 690

Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.

Wiley-Interscience. 73

Cox, D. and Pinto, N. (2011). Beyond simple features: A large-scale feature search

approach to unconstrained face recognition. In Automatic Face & Gesture Recognition

and Workshops (FG 2011), 2011 IEEE International Conference on, pages 8–15. IEEE.

366

Cramér, H. (1946). Mathematical methods of statistics. Princeton University Press. 135,

298

Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature,

304

111–114. 613

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics

of Control, Signals, and Systems, 2, 303–314. 198

Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition

with the mean-covariance restricted Boltzmann machine. In NIPS’2010. 23

740

BIBLIOGRAPHY

Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep

neural networks for large vocabulary speech recognition. IEEE Transactions on Audio,

Speech, and Language Processing, 20(1), 33–42. 462

Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Improving deep neural networks

for LVCSR using rectiﬁed linear units and dropout. In ICASSP’2013 . 462

Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for

QSAR predictions. arXiv:1406.1231. 26

Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse

high-dimensional inputs. In NIPS26 . NIPS Foundation. 624

Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with

reconstruction sampling. In ICML’2011 . 474

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).

Identifying and attacking the saddle point problem in high-dimensional non-convex

optimization. In NIPS’2014 . 287, 288, 290

Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.

(2014). The visual microphone: Passive recovery of sound from video. ACM Transactions

on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 455

Dayan, P. (1990). Reinforcement comparison. In Connectionist Models: Proceedings of

the 1990 Connectionist Summer School , San Mateo, CA. 699

Dayan, P. and Hinton, G. E. (1996). Varieties of helmholtz machine. Neural Networks,

9(8), 1385–1403. 701

Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine.

Neural computation, 7(5), 889–904. 701

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., aurelio Ranzato,

M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. (2012a). Large scale

distributed deep networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger,

editors, Advances in Neural Information Processing Systems 25 , pages 1223–1231.

Curran Associates, Inc. 25

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M.,

Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012b). Large scale distributed deep

networks. In NIPS’2012 . 450

Dean, T. and Kanazawa, K. (1989). A model for reasoning about persistence and causation.

Computational Intelligence, 5(3), 142–150. 668

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).

Indexing by latent semantic analysis. Journal of the American Society for Information

Science, 41(6), 391–407. 479, 485

741

BIBLIOGRAPHY

Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.

19, 557

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A

Large-Scale Hierarchical Image Database. In CVPR09 . 21

Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than

10,000 image categories tell us? In Proceedings of the 11th European Conference on

Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.

Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and

Trends in Signal Processing. 463

Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Binary

coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010, Makuhari,

Chiba, Japan. 23

Denil, M., Bazzani, L., Larochelle, H., and de Freitas, N. (2012). Learning where to attend

with deep architectures for image tracking. Neural Computation,

(8), 2151–2184. 370

Denton, E., Chintala, S., Szlam, A., and Fergus, R. (2015). Deep generative image models

using a laplacian pyramid of adversarial networks. NIPS. 709, 710, 726

Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for

vision. Technical Report 1327, Département d’Informatique et de Recherche Opéra-

tionnelle, Université de Montréal. 690

Desjardins, G., Courville, A. C., Bengio, Y., Vincent, P., and Delalleau, O. (2010).

Tempered Markov chain Monte Carlo for training of restricted Boltzmann machines. In

International Conference on Artiﬁcial Intelligence and Statistics, pages 145–152. 607,

618

Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.

In NIPS’2011 . 635

Desjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural neural networks. In

Advances in Neural Information Processing Systems, pages 2062–2070. 323

Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast

and robust neural network joint models for statistical machine translation. In Proc.

ACL’2014 . 476

Devroye, L. (2013). Non-Uniform Random Variate Generation. SpringerLink : Bücher.

Springer New York. 702

DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.

neurons vs. machines. NIPS Tutorial. 26, 369

742

BIBLIOGRAPHY

Dinh, L., Krueger, D., and Bengio, Y. (2014). Nice: Non-linear independent components

estimation. arXiv:1410.8516. 496

Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,

K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual

recognition and description. arXiv:1411.4389. 102

Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding

techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,

Stanford University. 164, 522

Dosovitskiy, A., Tobias Springenberg, J., and Brox, T. (2015). Learning to generate

chairs with convolutional neural networks. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pages 1538–1546. 703, 712, 713

Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.

IEEE Transactions on Neural Networks, 1, 75–80. 405, 406

Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of

Mathematical Analysis and Applications, 5(1), 30–45. 226

Dreyfus, S. E. (1973). The computational solution of optimal control problems with time

lag. IEEE Transactions on Automatic Control , 18(4), 383–385. 226

Drucker, H. and LeCun, Y. (1992). Improving generalisation performance using double

back-propagation. IEEE Transactions on Neural Networks, 3(6), 991–997. 271

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online

learning and stochastic optimization. Journal of Machine Learning Research. 309

Dudik, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning.

In Proceedings of the 28th International Conference on Machine learning, ICML ’11.

485

Dugas, C., Bengio, Y., Bélisle, F., and Nadeau, C. (2001). Incorporating second-order

functional knowledge for better option pricing. In T. Leen, T. Dietterich, and V. Tresp,

editors, Advances in Neural Information Processing Systems 13 (NIPS’00), pages

472–478. MIT Press. 68, 197

Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural net-

works via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906 .

711

El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term

dependencies. In NIPS’1995 . 401, 410, 411

Elkahky, A. M., Song, Y., and He, X. (2015). A multi-view deep learning approach for

cross domain user modeling in recommendation systems. In Proceedings of the 24th

International Conference on World Wide Web, pages 278–288. 483

743

BIBLIOGRAPHY

Elman, J. L. (1993). Learning and development in neural networks: The importance of

starting small. Cognition, 48, 781–799. 332

Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., and Vincent, P. (2009). The diﬃculty

of training deep architectures and the eﬀect of unsupervised pre-training. In Proceedings

of AISTATS’2009 . 201

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).

Why does unsupervised pre-training help deep learning? J. Machine Learning Res.

532, 536, 537

Fahlman, S. E., Hinton, G. E., and Sejnowski, T. J. (1983). Massively parallel architectures

for AI: NETL, thistle, and Boltzmann machines. In Proceedings of the National

Conference on Artiﬁcial Intelligence AAAI-83 . 573, 659

Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X.,

Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual

concepts and back. arXiv:1411.4952. 102

Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P., and

Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekkerman,

M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and

Distributed Approaches. Cambridge University Press. 526

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine

Intelligence. 23, 201

Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical

features for scene labeling. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(8), 1915–1929. 362

Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.

IEEE Transactions on Pattern Analysis and Machine Intelligence,

(4), 594–611. 541

Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., and Abbeel, P. (2015). Learning

visual feature spaces for robotic manipulation with deep spatial autoencoders. arXiv

preprint arXiv:1509.06113 . 25

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals

of Eugenics, 7, 179–188. 21, 105

Földiák, P. (1989). Adaptive network for optimal linear feature extraction. In International

Joint Conference on Neural Networks (IJCNN), volume 1, pages 401–405, Washington

1989. IEEE, New York. 497

Franzius, M., Sprekeler, H., and Wiskott, L. (2007). Slowness and sparseness lead to place,

head-direction, and spatial-view cells. 498

744

BIBLIOGRAPHY

Franzius, M., Wilbert, N., and Wiskott, L. (2008). Invariant object recognition with slow

feature analysis. In Artiﬁcial Neural Networks-ICANN 2008 , pages 961–970. Springer.

499

Frasconi, P., Gori, M., and Sperduti, A. (1997). On the eﬃcient classiﬁcation of data

structures by neural networks. In Proc. Int. Joint Conf. on Artiﬁcial Intelligence. 404

Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive

processing of data structures. IEEE Transactions on Neural Networks,

(5), 768–786.

404

Freund, Y. and Schapire, R. E. (1996a). Experiments with a new boosting algorithm. In

Machine Learning: Proceedings of Thirteenth International Conference, pages 148–156,

USA. ACM. 258

Freund, Y. and Schapire, R. E. (1996b). Game theory, on-line prediction and boosting. In

Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages

325–332. 258

Frey, B. J. (1998). Graphical models for machine learning and digital communication.

MIT Press. 713, 714

Frey, B. J., Hinton, G. E., and Dayan, P. (1996). Does the wake-sleep algorithm learn good

density estimators? In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances

in Neural Information Processing Systems 8 (NIPS’95), pages 661–670. MIT Press,

Cambridge, MA. 656

Frobenius, G. (1908). Über matrizen aus positiven elementen, s. B. Preuss. Akad. Wiss.

Berlin, Germany.[Links] . 601

Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. Biological

Cybernetics, 20, 121–136. 16, 227, 531

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a

mechanism of pattern recognition unaﬀected by shift in position. Biological Cybernetics,

36, 193–202. 16, 24, 27, 227, 370

Gal, Y. and Ghahramani, Z. (2015). Bayesian convolutional neural networks with Bernoulli

approximate variational inference. arXiv preprint arXiv:1506.02158 . 264

Gallinari, P., LeCun, Y., Thiria, S., and Fogelman-Soulie, F. (1987). Memoires associatives

distribuees. In Proceedings of COGNITIVA 87 , Paris, La Villette. 518

Garcia-Duran, A., Bordes, A., Usunier, N., and Grandvalet, Y. (2015). Combining two

and three-way embeddings models for link prediction in knowledge bases. arXiv preprint

arXiv:1506.00999 . 487

745

BIBLIOGRAPHY

Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993).

Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1.

NASA STI/Recon Technical Report N , 93, 27403. 462

Garson, J. (1900). The metric system of identiﬁcation of criminals, as used in in great

britain and ireland. The Journal of the Anthropological Institute of Great Britain and

Ireland, (2), 177–227. 21

Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual

prediction with LSTM. Neural computation, 12(10), 2451–2471. 413, 415

Ghahramani, Z. and Hinton, G. E. (1996). The EM algorithm for mixtures of factor

analyzers. Technical Report CRG-TR-96-1, Dpt. of Comp. Sci., Univ. of Toronto. 492

Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. (2015). Multilingual language

processing from bytes. arXiv preprint arXiv:1512.00103 . 480

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2015). Region-based convolutional

networks for accurate object detection and segmentation. 429

Giudice, M. D., Manera, V., and Keysers, C. (2009). Programmed to learn? the ontogeny

of mirror neurons. Dev. Sci., 12(2), 350––363. 661

Glorot, X. and Bengio, Y. (2010). Understanding the diﬃculty of training deep feedforward

neural networks. In AISTATS’2010 . 305

Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks. In

AISTATS’2011 . 16, 174, 197, 227

Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for large-scale

sentiment classiﬁcation: A deep learning approach. In ICML’2011 . 510, 540

Goldberger, J., Roweis, S., Hinton, G. E., and Salakhutdinov, R. (2005). Neighbourhood

components analysis. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural

Information Processing Systems 17 (NIPS’04). MIT Press. 115

Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face

Recognition. Imperial College Press. 165, 522

Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep

networks. In NIPS’2009 , pages 646–654. 255

Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L. (2010).

Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction

(HRI), Osaka, Japan. ACM Press, ACM Press. 100

Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution

for autoencoders. Technical report, Université de Montréal. 360

746

BIBLIOGRAPHY

Goodfellow, I. J. (2014). On distinguishability criteria for estimating generative models.

In International Conference on Learning Representations, Workshops Track . 628, 708,

709

Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding

for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning

Hierarchical Models. 535, 541

Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).

Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13, pages 1319–

1327. 193, 264, 347, 368, 458

Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep

Boltzmann machines. In NIPS26 . NIPS Foundation. 100, 622, 678, 679, 680, 681, 683,

706

Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R.,

Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research

library. arXiv preprint arXiv:1308.4214 . 25, 449

Goodfellow, I. J., Courville, A., and Bengio, Y. (2013d). Scaling up spike-and-slab models

for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(8), 1902–1914. 500, 501, 502, 655, 690

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2014a). An empirical

investigation of catastrophic forgeting in gradient-based neural networks. In ICLR’2014.

194

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adver-

sarial examples. CoRR, abs/1412.6572. 268, 269, 271, 558, 559

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,

Courville, A., and Bengio, Y. (2014c). Generative adversarial networks. In NIPS’2014 .

547, 696, 708, 709, 712

Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014d). Multi-digit

number recognition from Street View imagery using deep convolutional neural networks.

In International Conference on Learning Representations. 25, 101, 201, 202, 203, 393,

425, 452

Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2015). Qualitatively characterizing neural

network optimization problems. In International Conference on Learning Representa-

tions. 287, 288, 289, 293

Goodman, J. (2001). Classes for fast maximum entropy training. In International

Conference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 470

747

BIBLIOGRAPHY

Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE

Transactions on Pattern Analysis and Machine Intelligence,

PAMI-14

(1), 76–86. 286

Gosset, W. S. (1908). The probable error of a mean. Biometrika,

(1), 1–25. Originally

published under the pseudonym “Student”. 21

Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed

representations without word alignments. Technical report, arXiv:1410.2455. 479, 542

Graf, H. P. and Jackel, L. D. (1989). Analog electronic neural network circuits. Circuits

and Devices Magazine, IEEE , 5(4), 44–49. 454

Graves, A. (2011). Practical variational inference for neural networks. In NIPS’2011 . 243

Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies

in Computational Intelligence. Springer. 377, 398, 414, 463

Graves, A. (2013). Generating sequences with recurrent neural networks. Technical report,

arXiv:1308.0850. 190, 413, 418, 423

Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent

neural networks. In ICML’2014 . 413

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classiﬁcation with bidirec-

tional LSTM and other neural network architectures. Neural Networks,

(5), 602–610.

398

Graves, A. and Schmidhuber, J. (2009). Oﬄine handwriting recognition with multidi-

mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and

L. Bottou, editors, NIPS’2008 , pages 545–552. 398

Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal

classiﬁcation: Labelling unsegmented sequence data with recurrent neural networks. In

ICML’2006 , pages 369–376, Pittsburgh, USA. 463

Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fernández, S. (2008). Uncon-

strained on-line handwriting recognition with recurrent neural networks. In J. Platt,

D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007, pages 577–584. 398

Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., and Schmidhuber, J.

(2009). A novel connectionist system for unconstrained handwriting recognition. Pattern

Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868. 413

Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent

neural networks. In ICASSP’2013 , pages 6645–6649. 398, 401, 413, 414, 463

Graves, A., Wayne, G., and Danihelka, I. (2014a). Neural Turing machines.

arXiv:1410.5401. 25

748

BIBLIOGRAPHY

Graves, A., Wayne, G., and Danihelka, I. (2014b). Neural turing machines. arXiv preprint

arXiv:1410.5401 . 421

Grefenstette, E., Hermann, K. M., Suleyman, M., and Blunsom, P. (2015). Learning to

transduce with unbounded memory. In NIPS’2015. 421

Greﬀ, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2015).

LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069 . 415

Gregor, K. and LeCun, Y. (2010a). Emergence of complex-like cells in a temporal product

network with local receptive ﬁelds. Technical report, arXiv:1006.0448. 355

Gregor, K. and LeCun, Y. (2010b). Learning fast approximations of sparse coding. In

L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International

Conference on Machine Learning (ICML-10). ACM. 658

Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep

autoregressive networks. In International Conference on Machine Learning (ICML’2014).

701

Gregor, K., Danihelka, I., Graves, A., and Wierstra, D. (2015). DRAW: A recurrent neural

network for image generation. arXiv preprint arXiv:1502.04623 . 706

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A

kernel two-sample test. The Journal of Machine Learning Research,

(1), 723–773.

711

Gülçehre, Ç. and Bengio, Y. (2013). Knowledge matters: Importance of prior information

for optimization. In International Conference on Learning Representations (ICLR’2013).

Guo, H. and Gelfand, S. B. (1992). Classiﬁcation trees with neural network feature

extraction. Neural Networks, IEEE Transactions on, 3(6), 923–933. 453

Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. (2015). Deep learning

with limited numerical precision. CoRR, abs/1502.02551. 455

Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-

tion principle for unnormalized statistical models. In Proceedings of The Thirteenth

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10). 625

Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.

(2007). Online learning for oﬀroad robots: Spatial label propagation to learn long-range

traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA, USA. 456

Hajnal, A., Maass, W., Pudlak, P., Szegedy, M., and Turan, G. (1993). Threshold circuits

of bounded depth. J. Comput. System. Sci., 46, 129–154. 199

749

BIBLIOGRAPHY

Håstad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings

of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,

California. ACM Press. 199

Håstad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.

Computational Complexity, 1, 113–129. 199

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning:

data mining, inference and prediction. Springer Series in Statistics. Springer Verlag.

146

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectiﬁers: Surpassing

human-level performance on ImageNet classiﬁcation. arXiv preprint arXiv:1502.01852 .

28, 193

Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 14, 17, 661

Henaﬀ, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning

of sparse features for scalable audio classiﬁcation. In ISMIR’11 . 526

Henderson, J. (2003). Inducing history representations for broad coverage statistical

parsing. In HLT-NAACL, pages 103–110. 480

Henderson, J. (2004). Discriminative training of a neural network statistical parser. In

Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics,

page 95. 480

Henniges, M., Puertas, G., Bornschein, J., Eggert, J., and Lücke, J. (2010). Binary sparse

coding. In Latent Variable Analysis and Signal Separation, pages 450–457. Springer.

645

Herault, J. and Ans, B. (1984). Circuits neuronaux à synapses modiﬁables: Décodage de

messages composites par apprentissage non supervisé. Comptes Rendus de l’Académie

des Sciences, 299(III-13), 525––528. 494

Hinton, G. (2012). Neural networks for machine learning. Coursera, video lectures. 310

Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,

Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic

modeling in speech recognition. IEEE Signal Processing Magazine,

(6), 82–97. 23,

463

Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network.

arXiv preprint arXiv:1503.02531 . 451

Hinton, G. E. (1989). Connectionist learning procedures. Artiﬁcial Intelligence,

185–234. 497

750

BIBLIOGRAPHY

Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Artiﬁcial

Intelligence, 46(1), 47–75. 421

Hinton, G. E. (1999). Products of experts. In ICANN’1999 . 573

Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.

Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 613,

682

Hinton, G. E. (2006). To recognize shapes, ﬁrst learn to generate images. Technical Report

UTML TR 2006-003, University of Toronto. 531, 599

Hinton, G. E. (2007a). How to do backpropagation in a brain. Invited talk at the

NIPS’2007 Deep Learning Workshop. 661

Hinton, G. E. (2007b). Learning multiple layers of representation. Trends in cognitive

sciences, 11(10), 428–434. 665

Hinton, G. E. (2010). A practical guide to training restricted Boltzmann machines.

Technical Report UTML TR 2010-003, Department of Computer Science, University of

Toronto. 613

Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse

distributed representations. Philosophical Transactions of the Royal Society of London.

147

Hinton, G. E. and McClelland, J. L. (1988). Learning representations by recirculation. In

NIPS’1987 , pages 358–366. 505

Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 522

Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with

neural networks. Science, 313(5786), 504–507. 512, 527, 531, 532, 537

Hinton, G. E. and Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines.

In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing,

volume 1, chapter 7, pages 282–317. MIT Press, Cambridge. 573, 659

Hinton, G. E. and Sejnowski, T. J. (1999). Unsupervised learning: foundations of neural

computation. MIT press. 544

Hinton, G. E. and Shallice, T. (1991). Lesioning an attractor network: investigations of

acquired dyslexia. Psychological review , 98(1), 74. 13

Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and

Helmholtz free energy. In NIPS’1993 . 505

751

BIBLIOGRAPHY

Hinton, G. E., Sejnowski, T. J., and Ackley, D. H. (1984). Boltzmann machines: Constraint

satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon

University, Dept. of Computer Science. 573, 659

Hinton, G. E., McClelland, J., and Rumelhart, D. (1986). Distributed representations.

In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing:

Explorations in the Microstructure of Cognition, volume 1, pages 77–109. MIT Press,

Cambridge. 17, 226, 529

Hinton, G. E., Revow, M., and Dayan, P. (1995a). Recognizing handwritten digits using

mixtures of linear models. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances

in Neural Information Processing Systems 7 (NIPS’94), pages 1015–1022. MIT Press,

Cambridge, MA. 492

Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995b). The wake-sleep algorithm

for unsupervised neural networks. Science, 268, 1558–1161. 507, 656

Hinton, G. E., Dayan, P., and Revow, M. (1997). Modelling the manifolds of images of

handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74. 502

Hinton, G. E., Welling, M., Teh, Y. W., and Osindero, S. (2001). A new view of ICA. In

Proceedings of 3rd International Conference on Independent Component Analysis and

Blind Signal Separation (ICA’01), pages 746–751, San Diego, CA. 494

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief

nets. Neural Computation, 18, 1527–1554. 14, 19, 27, 143, 531, 532, 665, 667

Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,

Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural

networks for acoustic modeling in speech recognition: The shared views of four research

groups. IEEE Signal Process. Mag., 29(6), 82–97. 101

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012c).

Improving neural networks by preventing co-adaptation of feature detectors. Technical

report, arXiv:1207.0580. 240, 263, 267

Hinton, G. E., Vinyals, O., and Dean, J. (2014). Dark knowledge. Invited talk at the

BayLearn Bay Area Machine Learning Symposium. 451

Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma

thesis, T.U. Münich. 18, 405, 406

Hochreiter, S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering ﬂat

minima. In Advances in Neural Information Processing Systems 7 , pages 529–536. MIT

Press. 244

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,

9(8), 1735–1780. 18, 413, 414

752

BIBLIOGRAPHY

Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).

Gradient ﬂow in recurrent nets: the diﬃculty of learning long-term dependencies. In

J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE

Press. 414

Holi, J. L. and Hwang, J.-N. (1993). Finite precision error analysis of neural network

hardware implementations. Computers, IEEE Transactions on, 42(3), 281–290. 454

Holt, J. L. and Baker, T. E. (1991). Back propagation simulations using limited preci-

sion calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint

Conference on, volume 2, pages 121–126. IEEE. 454

Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are

universal approximators. Neural Networks, 2, 359–366. 198

Hornik, K., Stinchcombe, M., and White, H. (1990). Universal approximation of an

unknown mapping and its derivatives using multilayer feedforward networks. Neural

networks, 3(5), 551–560. 198

Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World

Chess Champion. Princeton University Press, Princeton, NJ, USA. 2

Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for Markov

random ﬁelds on lattice. Annals of the Institute of Statistical Mathematics,

(1), 1–18.

621

Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. (2013). Learning deep

structured semantic models for web search using clickthrough data. In Proceedings of

the 22nd ACM international conference on Conference on information & knowledge

management, pages 2333–2338. ACM. 483

Hubel, D. and Wiesel, T. (1968). Receptive ﬁelds and functional architecture of monkey

striate cortex. Journal of Physiology (London), 195, 215–243. 367

Hubel, D. H. and Wiesel, T. N. (1959). Receptive ﬁelds of single neurons in the cat’s

striate cortex. Journal of Physiology, 148, 574–591. 367

Hubel, D. H. and Wiesel, T. N. (1962). Receptive ﬁelds, binocular interaction, and

functional architecture in the cat’s visual cortex. Journal of Physiology (London),

160

106–154. 367

Huszar, F. (2015). How (not) to train your generative model: schedule sampling, likelihood,

adversary? arXiv:1511.05101 . 705

Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Sequential model-based optimization

for general algorithm conﬁguration. In LION-5 . Extended version as UBC Tech report

TR-2010-10. 439

753

BIBLIOGRAPHY

Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96, pages

13–24. 382

Hyvärinen, A. (1999). Survey on independent component analysis. Neural Computing

Surveys, 2, 94–128. 494

Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score matching.

Journal of Machine Learning Research, 6, 695–709. 516, 622

Hyvärinen, A. (2007a). Connections between score matching, contrastive divergence,

and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural

Networks, 18, 1529–1531. 623

Hyvärinen, A. (2007b). Some extensions of score matching. Computational Statistics and

Data Analysis, 51, 2499–2512. 624

Hyvärinen, A. and Hoyer, P. O. (1999). Emergence of topography and complex cell

properties from natural images using extensions of ica. In NIPS, pages 827–833. 496

Hyvärinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis:

Existence and uniqueness results. Neural Networks, 12(3), 429–439. 496

Hyvärinen, A., Karhunen, J., and Oja, E. (2001a). Independent Component Analysis.

Wiley-Interscience. 494

Hyvärinen, A., Hoyer, P. O., and Inki, M. O. (2001b). Topographic independent component

analysis. Neural Computation, 13(7), 1527–1558. 496

Hyvärinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A probabilistic

approach to early computational vision. Springer-Verlag. 373

Iba, Y. (2001). Extended ensemble Monte Carlo. International Journal of Modern Physics,

C12, 623–656. 607

Inayoshi, H. and Kurita, T. (2005). Improved generalization by adding both auto-

association and hidden-layer noise to neural-network-based-classiﬁers. IEEE Workshop

on Machine Learning for Signal Processing, pages 141—-146. 518

Ioﬀe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training

by reducing internal covariate shift. 100, 321, 324

Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation.

Neural networks, 1(4), 295–307. 309

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures

of local experts. Neural Computation, 3, 79–87. 189, 453

Jaeger, H. (2003). Adaptive nonlinear system identiﬁcation with echo state networks. In

Advances in Neural Information Processing Systems 15 . 407

754

BIBLIOGRAPHY

Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state

networks. Technical report, Jacobs University. 401

Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 407

Jaeger, H. (2012). Long short-term memory in echo state networks: Details of a simulation

study. Technical report, Technical report, Jacobs University Bremen. 408

Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and

saving energy in wireless communication. Science, 304(5667), 78–80. 27, 407

Jaeger, H., Lukosevicius, M., Popovici, D., and Siewert, U. (2007). Optimization and

applications of echo state networks with leaky- integrator neurons. Neural Networks,

20(3), 335–352. 411

Jain, V., Murray, J. F., Roth, F., Turaga, S., Zhigulin, V., Briggman, K. L., Helmstaedter,

M. N., Denk, W., and Seung, H. S. (2007). Supervised learning of image restoration

with convolutional networks. In Computer Vision, 2007. ICCV 2007. IEEE 11th

International Conference on, pages 1–8. IEEE. 362

Jaitly, N. and Hinton, G. (2011). Learning a better representation of speech soundwaves

using restricted boltzmann machines. In Acoustics, Speech and Signal Processing

(ICASSP), 2011 IEEE International Conference on, pages 5884–5887. IEEE. 461

Jaitly, N. and Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves

speech recognition. In ICML’2013 . 242

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best

multi-stage architecture for object recognition? In ICCV’09 . 16, 24, 27, 174, 193, 227,

366, 367, 526

Jarzynski, C. (1997). Nonequilibrium equality for free energy diﬀerences. Phys. Rev. Lett.,

78, 2690–2693. 631, 633

Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University

Press. 53

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target

vocabulary for neural machine translation. arXiv:1412.2007. 477

Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of Markov source parameters

from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in

Practice. North-Holland, Amsterdam. 465, 476

Jia, Y. (2013). Caﬀe: An open source convolutional architecture for fast feature embedding.

http://caffe.berkeleyvision.org/. 25, 211

755

BIBLIOGRAPHY

Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial pyramids: Receptive ﬁeld

learning for pooled image features. In Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on, pages 3370–3377. IEEE. 348

Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural

networks: convergence and generalization. IEEE Transactions on Neural Networks,

7(6), 1424–1438. 243

Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 18

Joulin, A. and Mikolov, T. (2015). Inferring algorithmic patterns with stack-augmented

recurrent nets. arXiv preprint arXiv:1503.01007 . 421

Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015a). An empirical evaluation of

recurrent network architectures. In ICML’2015 . 415

Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015b). An empirical exploration of

recurrent network architectures. In Proceedings of The 32nd International Conference

on Machine Learning, pages 2342–2350. 308, 415

Judd, J. S. (1989). Neural Network Design and the Complexity of Learning. MIT press.

295

Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive

algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 494

Kahou, S. E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, c., Memisevic, R., Vincent,

P., Courville, A., Bengio, Y., Ferrari, R. C., Mirza, M., Jean, S., Carrier, P.-L., Dauphin,

Y., Boulanger-Lewandowski, N., Aggarwal, A., Zumer, J., Lamblin, P., Raymond, J.-P.,

Desjardins, G., Pascanu, R., Warde-Farley, D., Torabi, A., Sharma, A., Bengio, E., Côté,

M., Konda, K. R., and Wu, Z. (2013). Combining modality speciﬁc deep neural networks

for emotion recognition in video. In Proceedings of the 15th ACM on International

Conference on Multimodal Interaction. 201

Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In

EMNLP’2013 . 477

Kalchbrenner, N., Danihelka, I., and Graves, A. (2015). Grid long short-term memory.

arXiv preprint arXiv:1507.01526 . 398

Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.

IEEE Transactions on Pattern Analysis and Machine Intelligence. 518

Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image

descriptions. In CVPR’2015 . arXiv:1412.2306. 102

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).

Large-scale video classiﬁcation with convolutional neural networks. In CVPR. 21

756

BIBLIOGRAPHY

Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side

Constraints. Master’s thesis, Dept. of Mathematics, Univ. of Chicago. 95

Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model

component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal

Processing, ASSP-35(3), 400–401. 465, 476

Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008). Fast inference in sparse coding

algorithms with applications to object recognition. Technical report, Computational and

Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01.

526

Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant

features through topographic ﬁlter maps. In CVPR’2009. 526

Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.

(2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’2010 .

367, 526

Kelley, H. J. (1960). Gradient theory of optimal ﬂight paths. ARS Journal,

(10),

947–954. 226

Khan, F., Zhu, X., and Mutlu, B. (2011). How do humans teach: On curriculum learning

and teaching dimension. In Advances in Neural Information Processing Systems 24

(NIPS’11), pages 1449–1457. 332

Kim, S. K., McAfee, L. C., McMahon, P. L., and Olukotun, K. (2009). A highly scalable

restricted Boltzmann machine FPGA implementation. In Field Programmable Logic

and Applications, 2009. FPL 2009. International Conference on, pages 367–372. IEEE.

454

Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary

Mathematics ; V. 1). American Mathematical Society. 569

Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980 . 311

Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score

matching. In NIPS’2010 . 516, 625

Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning

with deep generative models. In NIPS’2014 . 429

Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable

models in auxiliary form. Technical report, arxiv:1306.0733. 696, 704

Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings

of the International Conference on Learning Representations (ICLR). 696, 707

757

BIBLIOGRAPHY

Kingma, D. P. and Welling, M. (2014b). Eﬃcient gradient-based inference through

transformations between bayes nets and neural nets. Technical report, arxiv:1402.0480.

696

Kirkpatrick, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by simulated

annealing. Science, 220, 671–680. 331

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.

In ICML’2014 . 102

Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embeddings

with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 102, 413

Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed

representations of words. In Proceedings of COLING 2012 . 479, 542

Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and

Pﬁster, H. (2014). Deep learning for the connectome. GPU Technology Conference. 26

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and

Techniques. MIT Press. 586, 599, 650

Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and

maximization of A posteriori probabilities – application to transition-based connectionist

speech recognition. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in

Neural Information Processing Systems 8 (NIPS’95). MIT Press, Cambridge, MA. 462

Koren, Y. (2009). 1 the bellkor solution to the netﬂix grand prize. 258, 482

Kotzias, D., Denil, M., de Freitas, N., and Smyth, P. (2015). From group to individual

labels using deep features. In ACM SIGKDD. 106

Koutnik, J., Greﬀ, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In

ICML’2014 . 411

Kočiský, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-

sentations by Marginalizing Alignments. In Proceedings of ACL. 479

Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties

of DBNs with binary hidden units and real-valued visible units. In ICML’2013 . 556

Krizhevsky, A. (2010). Convolutional deep belief networks on CIFAR-10. Technical report,

University of Toronto. Unpublished Manuscript: http://www.cs.utoronto.ca/ kriz/conv-

cifar10-aug2010.pdf. 449

Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny

images. Technical report, University of Toronto. 21, 564

758

BIBLIOGRAPHY

Krizhevsky, A. and Hinton, G. E. (2011). Using very deep autoencoders for content-based

image retrieval. In ESANN . 528

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classiﬁcation with deep

convolutional neural networks. In NIPS’2012 . 23, 24, 27, 100, 201, 374, 457, 461

Krueger, K. A. and Dayan, P. (2009). Flexible shaping: how learning in small steps helps.

Cognition, 110, 380–394. 332

Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the

Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–492,

Berkeley, Calif. University of California Press. 95

Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Iyyer,

M., Gulrajani, I., and Socher, R. (2015). Ask me anything: Dynamic memory networks

for natural language processing. arXiv:1506.07285 . 421, 488

Kumar, M. P., Packer, B., and Koller, D. (2010). Self-paced learning for latent variable

models. In NIPS’2010 . 332

Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural network

architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon

University. 370, 377, 410

Lang, K. J., Waibel, A. H., and Hinton, G. E. (1990). A time-delay neural network

architecture for isolated word recognition. Neural networks, 3(1), 23–43. 377

Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for contextual multi-armed

bandits. In NIPS’2008 , pages 1096––1103. 483

Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear

independent component analysis using ensemble learning: Experiments and discussion.

In Proc. ICA. Citeseer. 496

Larochelle, H. and Bengio, Y. (2008). Classiﬁcation using discriminative restricted

Boltzmann machines. In ICML’2008 . 245, 255, 533, 694, 723

Larochelle, H. and Hinton, G. E. (2010). Learning to combine foveal glimpses with a

third-order Boltzmann machine. In Advances in Neural Information Processing Systems

23 , pages 1243–1251. 370

Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.

In AISTATS’2011. 713, 716

Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In

AAAI Conference on Artiﬁcial Intelligence. 542

Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. (2009). Exploring strategies for

training deep neural networks. Journal of Machine Learning Research, 10, 1–40. 538

759

BIBLIOGRAPHY

Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative and

discriminative models. In Proceedings of the Computer Vision and Pattern Recognition

Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer Society.

245, 253

Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled

convolutional neural networks. In J. Laﬀerty, C. K. I. Williams, J. Shawe-Taylor,

R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems

23 (NIPS’10), pages 1279–1287. 355

Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. (2011). On optimization

methods for deep learning. In Proc. ICML’2011 . ACM. 318

Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,

A. (2012). Building high-level features using large scale unsupervised learning. In

ICML’2012 . 24, 27

Le Roux, N. and Bengio, Y. (2008). Representational power of restricted Boltzmann

machines and deep belief networks. Neural Computation, 20(6), 1631–1649. 556, 660

Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approxi-

mators. Neural Computation, 22(8), 2192–2207. 556

LeCun, Y. (1985). Une procédure d’apprentissage pour Réseau à seuil assymétrique. In

Cognitiva 85: A la Frontière de l’Intelligence Artiﬁcielle, des Sciences de la Connaissance

et des Neurosciences, pages 599–604, Paris 1985. CESTA, Paris. 226

LeCun, Y. (1986). Learning processes in an asymmetric threshold network. In F. Fogelman-

Soulié, E. Bienenstock, and G. Weisbuch, editors, Disordered Systems and Biological

Organization, pages 233–240. Springer-Verlag, Les Houches, France. 353

LeCun, Y. (1987). Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université de

Paris VI. 18, 505, 518

LeCun, Y. (1989). Generalization and network design strategies. Technical Report

CRG-TR-89-4, University of Toronto. 333, 353

LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,

Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications

of neural network chips and automatic learning. IEEE Communications Magazine,

27(11), 41–46. 371

LeCun, Y., Bottou, L., Orr, G. B., and Müller, K.-R. (1998a). Eﬃcient backprop. In

Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524.

Springer Verlag. 312, 432

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998b). Gradient-based learning

applied to document recognition. Proceedings of the IEEE,

(11), 2278–2324. 16, 27,

461

760

BIBLIOGRAPHY

LeCun, Y., Bottou, L., Bengio, Y., and Haﬀner, P. (1998c). Gradient based learning

applied to document recognition. Proc. IEEE. 18, 21, 374, 463

LeCun, Y., Kavukcuoglu, K., and Farabet, C. (2010). Convolutional networks and

applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE

International Symposium on, pages 253–256. IEEE. 374

L’Ecuyer, P. (1994). Eﬃciency improvement and variance reduction. In Proceedings of

the 1994 Winter Simulation Conference, pages 122––132. 698

Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. (2014). Deeply-supervised nets.

arXiv preprint arXiv:1409.5185 . 330

Lee, H., Battle, A., Raina, R., and Ng, A. (2007). Eﬃcient sparse coding algorithms.

In B. Schölkopf, J. Platt, and T. Hoﬀman, editors, Advances in Neural Information

Processing Systems 19 (NIPS’06), pages 801–808. MIT Press. 642

Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area

V2. In NIPS’07 . 255

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief

networks for scalable unsupervised learning of hierarchical representations. In L. Bottou

and M. Littman, editors, Proceedings of the Twenty-sixth International Conference on

Machine Learning (ICML’09). ACM, Montreal, Canada. 366, 691, 692

Lee, Y. J. and Grauman, K. (2011). Learning the easy things ﬁrst: self-paced visual

category discovery. In CVPR’2011 . 332

Leibniz, G. W. (1676). Memoir using the chain rule. (Cited in TMME 7:2&3 p 321-332,

2010). 225

Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; representa-

tion and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.

Leshno, M., Lin, V. Y., Pinkus, A., and Schocken, S. (1993). Multilayer feedforward

networks with a nonpolynomial activation function can approximate any function.

Neural Networks, 6, 861––867. 198, 199

Levenberg, K. (1944). A method for the solution of certain non-linear problems in least

squares. Quarterly Journal of Applied Mathematics, II(2), 164–168. 315

L’Hôpital, G. F. A. (1696). Analyse des inﬁniment petits, pour l’intelligence des lignes

courbes. Paris: L’Imprimerie Royale. 225

Li, Y., Swersky, K., and Zemel, R. S. (2015). Generative moment matching networks.

CoRR, abs/1502.02761. 711

761

BIBLIOGRAPHY

Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies

is not as diﬃcult with NARX recurrent neural networks. IEEE Transactions on Neural

Networks, 7(6), 1329–1338. 410

Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. (2015). Learning entity and relation

embeddings for knowledge graph completion. In Proc. AAAI’15 . 487

Linde, N. (1992). The machine that changed the world, episode 3. Documentary miniseries.

Lindsey, C. and Lindblad, T. (1994). Review of hardware neural networks: a user’s

perspective. In Proc. Third Workshop on Neural Networks: From Biology to High

Energy Physics, pages 195––202, Isola d’Elba, Italy. 454

Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT

Numerical Mathematics, 16(2), 146–160. 226

LISA (2008). Deep learning tutorials: Restricted boltzmann machines. Technical report,

LISA Lab, Université de Montréal. 592

Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to

approximately evaluate or simulate. In Proceedings of the 27th International Conference

on Machine Learning (ICML’10). 663

Lotter, W., Kreiman, G., and Cox, D. (2015). Unsupervised learning of visual structure

using predictive generative networks. arXiv preprint arXiv:1511.06380 . 547, 548

Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine

invented by Charles Babbage”. 1

Lu, L., Zhang, X., Cho, K., and Renals, S. (2015). A study of the recurrent neural network

encoder-decoder for large vocabulary speech recognition. In Proc. Interspeech. 463

Lu, T., Pál, D., and Pál, M. (2010). Contextual multi-armed bandits. In International

Conference on Artiﬁcial Intelligence and Statistics, pages 485–492. 483

Luenberger, D. G. (1984). Linear and Nonlinear Programming. Addison Wesley. 319

Lukoševičius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent

neural network training. Computer Science Review, 3(3), 127–149. 407

Luo, H., Shen, R., Niu, C., and Ullrich, C. (2011). Learning class-relevant features and

class-irrelevant features via a hybrid third-order rbm. In International Conference on

Artiﬁcial Intelligence and Statistics, pages 470–478. 694

Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with

convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 102

762

BIBLIOGRAPHY

Lyu, S. (2009). Interpretation and generalization of score matching. In Proceedings of the

Twenty-ﬁfth Conference in Uncertainty in Artiﬁcial Intelligence (UAI’09). 624

Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E., and Svetnik, V. (2015). Deep neural nets

as a method for quantitative structure – activity relationships. J. Chemical information

and modeling. 533

Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectiﬁer nonlinearities improve neural

network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech, and

Language Processing. 193

Maass, W. (1992). Bounds for the computational power and learning complexity of analog

neural nets (extended abstract). In Proc. of the 25th ACM Symp. Theory of Computing,

pages 335–344. 199

Maass, W., Schnitger, G., and Sontag, E. D. (1994). A comparison of the computational

power of sigmoid and boolean threshold circuits. Theoretical Advances in Neural

Computation and Learning, pages 127–151. 199

Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without

stable states: A new framework for neural computation based on perturbations. Neural

Computation, 14(11), 2531–2560. 407

MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge

University Press. 73

Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015). Gradient-based hyperparameter

optimization through reversible learning. arXiv preprint arXiv:1502.03492 . 438

Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning

with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 102

Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem.

Zeitschrift für Operations Research (Theory), 36, 517–545. 278

Marlin, B. and de Freitas, N. (2011). Asymptotic eﬃciency of deterministic estimators for

discrete energy-based models: Ratio matching and pseudolikelihood. In UAI’2011 . 622,

624

Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for

restricted Boltzmann machine learning. In Proceedings of The Thirteenth International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’10), volume 9, pages

509–516. 618, 624

Marquardt, D. W. (1963). An algorithm for least-squares estimation of non-linear param-

eters. Journal of the Society of Industrial and Applied Mathematics,

(2), 431–441.

315

763

BIBLIOGRAPHY

Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science,

194. 370

Martens, J. (2010). Deep learning via Hessian-free optimization. In L. Bottou and

M. Littman, editors, Proceedings of the Twenty-seventh International Conference on

Machine Learning (ICML-10), pages 735–742. ACM. 306

Martens, J. and Medabalimi, V. (2014). On the expressive eﬃciency of sum product

networks. arXiv:1411.7717 . 557

Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free

optimization. In Proc. ICML’2011 . ACM. 416

Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous

state space Gibbsian processes. The Annals of Applied Probability,

(3), pp. 603–612.

621

McClelland, J., Rumelhart, D., and Hinton, G. (1995). The appeal of parallel distributed

processing. In Computation & intelligence, pages 305–341. American Association for

Artiﬁcial Intelligence. 17

McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous

activity. Bulletin of Mathematical Biophysics, 5, 115–133. 14, 15

Mead, C. and Ismail, M. (2012). Analog VLSI implementation of neural systems, volume 80.

Springer Science & Business Media. 454

Melchior, J., Fischer, A., and Wiskott, L. (2013). How to center binary deep boltzmann

machines. arXiv preprint arXiv:1311.1354 . 681

Memisevic, R. and Hinton, G. E. (2007). Unsupervised learning of image transformations.

In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’07).

694

Memisevic, R. and Hinton, G. E. (2010). Learning to represent spatial transformations

with factored higher-order Boltzmann machines. Neural Computation,

(6), 1473–1492.

694

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,

Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,

J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In

JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 201, 535, 541

Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surﬁng on the

manifold. Learning Workshop, Snowbird. 719

Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular

PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 480

764

BIBLIOGRAPHY

Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,

Brno University of Technology. 417

Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empirical

evaluation and combination of advanced language modeling techniques. In Proc. 12th an-

nual conference of the international speech communication association (INTERSPEECH

2011). 475

Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for

training large scale neural network language models. In Proc. ASRU’2011. 332, 475

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Eﬃcient estimation of word rep-

resentations in vector space. In International Conference on Learning Representations:

Workshops Track. 539

Mikolov, T., Le, Q. V., and Sutskever, I. (2013b). Exploiting similarities among languages

for machine translation. Technical report, arXiv:1309.4168. 542

Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cambridge

UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 630

Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 15

Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint

arXiv:1411.1784 . 709

Mishkin, D. and Matas, J. (2015). All you need is a good init. arXiv preprint

arXiv:1511.06422 . 307

Misra, J. and Saha, I. (2010). Artiﬁcial neural networks in hardware: A survey of two

decades of progress. Neurocomputing, 74(1), 239–255. 454

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 99

Miyato, T., Maeda, S., Koyama, M., Nakae, K., and Ishii, S. (2015). Distributional

smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677. 269

Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief

networks. In ICML’2014 . 699, 701

Mnih, A. and Hinton, G. E. (2007). Three new graphical models for statistical language

modelling. In Z. Ghahramani, editor, Proceedings of the Twenty-fourth International

Conference on Machine Learning (ICML’07), pages 641–648. ACM. 467

Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model.

In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural

Information Processing Systems 21 (NIPS’08), pages 1081–1088. 470

765

BIBLIOGRAPHY

Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings eﬃciently with noise-

contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and

K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages

2265–2273. Curran Associates, Inc. 475, 627

Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural

probabilistic language models. In ICML’2012 , pages 1751–1758. 475

Mnih, V. and Hinton, G. (2010). Learning to detect roads in high-resolution aerial images.

In Proceedings of the 11th European Conference on Computer Vision (ECCV). 102

Mnih, V., Larochelle, H., and Hinton, G. (2011). Conditional restricted Boltzmann

machines for structure output prediction. In Proc. Conf. on Uncertainty in Artiﬁcial

Intelligence (UAI). 693

Mnih, V., Kavukcuoglo, K., Silver, D., Graves, A., Antonoglou, I., and Wierstra, D. (2013).

Playing atari with deep reinforcement learning. Technical report, arXiv:1312.5602. 106

Mnih, V., Heess, N., Graves, A., and kavukcuoglu, k. (2014). Recurrent models of visual

attention. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,

editors, NIPS’2014 , pages 2204–2212. 699

Mnih, V., Kavukcuoglo, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves,

A., Riedmiller, M., Fidgeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A.,

Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015).

Human-level control through deep reinforcement learning. Nature, 518, 529–533. 25

Mobahi, H. and Fisher, III, J. W. (2015). A theoretical analysis of optimization by

Gaussian continuation. In AAAI’2015 . 331

Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence

in video. In L. Bottou and M. Littman, editors, Proceedings of the 26th International

Conference on Machine Learning, pages 737–744, Montreal. Omnipress. 497

Mohamed, A., Dahl, G., and Hinton, G. (2012a). Acoustic modeling using deep belief

networks. IEEE Trans. on Audio, Speech and Language Processing,

(1), 14–22. 462

Mohamed, A.-r., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone

recognition. 462

Mohamed, A.-r., Sainath, T. N., Dahl, G., Ramabhadran, B., Hinton, G. E., and Picheny,

M. A. (2011). Deep belief networks using discriminative features for phone recognition. In

Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference

on, pages 5060–5063. IEEE. 462

Mohamed, A.-r., Hinton, G., and Penn, G. (2012b). Understanding how deep belief

networks perform acoustic modelling. In Acoustics, Speech and Signal Processing

(ICASSP), 2012 IEEE International Conference on, pages 4273–4276. IEEE. 462

766

BIBLIOGRAPHY

Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning.

Neural Networks, 6, 525–533. 318

Montavon, G. and Muller, K.-R. (2012). Deep Boltzmann machines and the centering

trick. In G. Montavon, G. Orr, and K.-R. Müller, editors, Neural Networks: Tricks of

the Trade, volume 7700 of Lecture Notes in Computer Science, pages 621–637. Preprint:

http://arxiv.org/abs/1203.3783. 680

Montúfar, G. (2014). Universal approximation depth and errors of narrow belief networks

with discrete units. Neural Computation, 26. 556

Montúfar, G. and Ay, N. (2011). Reﬁnements of universal approximation results for

deep belief networks and restricted Boltzmann machines. Neural Computation,

(5),

1306–1319. 556

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear

regions of deep neural networks. In NIPS’2014 . 19, 200

Mor-Yosef, S., Samueloﬀ, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking

the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet

Gynecol, 75(6), 944–7. 3

Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language

model. In AISTATS’2005 . 470, 472

Mozer, M. C. (1992). The induction of multiscale temporal structure. In J. M. S. Hanson

and R. Lippmann, editors, Advances in Neural Information Processing Systems 4

(NIPS’91), pages 275–282, San Mateo, CA. Morgan Kaufmann. 410, 411

Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press,

Cambridge, MA, USA. 62, 98, 146

Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In

ICML’2014 . 190, 718

Nair, V. and Hinton, G. (2010). Rectiﬁed linear units improve restricted Boltzmann

machines. In ICML’2010 . 16, 174, 197

Nair, V. and Hinton, G. E. (2009). 3d object recognition with deep belief nets. In Y. Bengio,

D. Schuurmans, J. D. Laﬀerty, C. K. I. Williams, and A. Culotta, editors, Advances in

Neural Information Processing Systems 22 , pages 1339–1347. Curran Associates, Inc.

694

Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis.

In NIPS’2010 . 164

Naumann, U. (2008). Optimal jacobian accumulation is np-complete. Mathematical

Programming, 112(2), 427–441. 222

767

BIBLIOGRAPHY

Navigli, R. and Velardi, P. (2005). Structural semantic interconnections: a knowledge-

based approach to word sense disambiguation. IEEE Trans. Pattern Analysis and

Machine Intelligence, 27(7), 1075––1086. 487

Neal, R. and Hinton, G. (1999). A view of the EM algorithm that justiﬁes incremental,

sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models. MIT

Press, Cambridge, MA. 639

Neal, R. M. (1990). Learning stochastic feedforward networks. Technical report. 700

Neal, R. M. (1993). Probabilistic inference using Markov chain Monte-Carlo methods.

Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto. 687

Neal, R. M. (1994). Sampling from multimodal distributions using tempered transitions.

Technical Report 9421, Dept. of Statistics, University of Toronto. 607

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.

Springer. 265

Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing,

(2),

125–139. 631, 632, 633, 634

Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance

sampling. 634, 635

Nesterov, Y. (1983). A method of solving a convex programming problem with convergence

rate O(1/k

). Soviet Mathematics Doklady, 27, 372–376. 302

Nesterov, Y. (2004). Introductory lectures on convex optimization : a basic course. Applied

optimization. Kluwer Academic Publ., Boston, Dordrecht, London. 302

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading

digits in natural images with unsupervised feature learning. Deep Learning and

Unsupervised Feature Learning Workshop, NIPS. 21

Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical

language modelling. In European Conference on Speech Communication and Technology

(Eurospeech), pages 973–976, Berlin. 466

Ng, A. (2015). Advice for applying machine learning.

https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf. 424

Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of part-of-

speech and automatically derived category-based language models for speech recognition.

In International Conference on Acoustics, Speech and Signal Processing (ICASSP),

pages 177–180. 466

768

BIBLIOGRAPHY

Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., and Barbano, P. E. (2005).

Toward automatic phenotyping of developing embryos from videos. Image Processing,

IEEE Transactions on, 14(9), 1360–1371. 363

Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 92, 95

Norouzi, M. and Fleet, D. J. (2011). Minimal loss hashing for compact binary codes. In

ICML’2011 . 528

Nowlan, S. J. (1990). Competing experts: An experimental investigation of associative

mixture models. Technical Report CRG-TR-90-5, University of Toronto. 453

Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing.

Neural Computation, 4(4), 473–493. 139

Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural

Computation, 17, 1665–1699. 16

Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld properties

by learning a sparse code for natural images. Nature,

381

, 607–609. 147, 255, 373, 499

Olshausen, B. A., Anderson, C. H., and Van Essen, D. C. (1993). A neurobiological

model of visual attention and invariant pattern recognition based on dynamic routing

of information. J. Neurosci., 13(11), 4700–4719. 453

Opper, M. and Archambeau, C. (2009). The variational Gaussian approximation revisited.

Neural computation, 21(3), 786–792. 696

Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014). Learning and transferring mid-level

image representations using convolutional neural networks. In Computer Vision and

Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1717–1724. IEEE. 539

Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy

of Markov random ﬁelds. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,

Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1121–1128,

Cambridge, MA. MIT Press. 637

Ovid and Martin, C. (2004). Metamorphoses. W.W. Norton. 1

Paccanaro, A. and Hinton, G. E. (2000). Extracting distributed representations of concepts

and relations from positive and negative propositions. In International Joint Conference

on Neural Networks (IJCNN), Como, Italy. IEEE, New York. 487

Paine, T. L., Khorrami, P., Han, W., and Huang, T. S. (2014). An analysis of unsupervised

pre-training in light of recent advances. arXiv preprint arXiv:1412.6597 . 535

769

BIBLIOGRAPHY

Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). Zero-shot

learning with semantic output codes. In Y. Bengio, D. Schuurmans, J. D. Laﬀerty,

C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing

Systems 22 , pages 1410–1418. Curran Associates, Inc. 542

Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research

in Economics and Management Sci., MIT. 226

Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the diﬃculty of training recurrent

neural networks. In ICML’2013 . 291, 405, 407, 411, 417, 419

Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. Technical report, U.

Montreal, arXiv:1312.6098. 199

Pascanu, R., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014a). How to construct deep

recurrent neural networks. In ICLR’2014 . 19, 200, 265, 401, 402, 403, 413, 463

Pascanu, R., Montufar, G., and Bengio, Y. (2014b). On the number of inference regions

of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 553

Pati, Y., Rezaiifar, R., and Krishnaprasad, P. (1993). Orthogonal matching pursuit:

Recursive function approximation with applications to wavelet decomposition. In Pro-

ceedings of the 27 th Annual Asilomar Conference on Signals, Systems, and Computers,

pages 40–44. 255

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential

reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society,

University of California, Irvine, pages 329–334. 566

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible

Inference. Morgan Kaufmann. 54

Perron, O. (1907). Zur theorie der matrices. Mathematische Annalen,

(2), 248–263. 601

Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 31

Peterson, G. B. (2004). A day of great illumination: B. F. Skinner’s discovery of shaping.

Journal of the Experimental Analysis of Behavior , 82(3), 317–328. 332

Pham, D.-T., Garat, P., and Jutten, C. (1992). Separation of a mixture of independent

sources through a maximum likelihood approach. In EUSIPCO, pages 771–774. 494

Pham, P.-H., Jelaca, D., Farabet, C., Martini, B., LeCun, Y., and Culurciello, E. (2012).

Neuﬂow: dataﬂow vision processing system-on-a-chip. In Circuits and Systems (MWS-

CAS), 2012 IEEE 55th International Midwest Symposium on, pages 1044–1047. IEEE.

454

770

BIBLIOGRAPHY

Pinheiro, P. H. O. and Collobert, R. (2014). Recurrent convolutional neural networks for

scene labeling. In ICML’2014 . 362

Pinheiro, P. H. O. and Collobert, R. (2015). From image-level to pixel-level labeling with

convolutional networks. In Conference on Computer Vision and Pattern Recognition

(CVPR). 362

Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recognition

hard? PLoS Comput Biol, 4. 459

Pinto, N., Stone, Z., Zickler, T., and Cox, D. (2011). Scaling up biologically-inspired

computer vision: A case study in unconstrained face recognition on facebook. In

Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer

Society Conference on, pages 35–42. IEEE. 366

Pollack, J. B. (1990). Recursive distributed representations. Artiﬁcial Intelligence,

(1),

77–105. 404

Polyak, B. and Juditsky, A. (1992). Acceleration of stochastic approximation by averaging.

SIAM J. Control and Optimization, 30(4), 838–855. 325

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.

USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 298

Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders

and deep networks. CoRR, abs/1406.1831. 242

Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In

Proceedings of the Twenty-seventh Conference in Uncertainty in Artiﬁcial Intelligence

(UAI), Barcelona, Spain. 557

Presley, R. K. and Haggard, R. L. (1994). A ﬁxed point implementation of the backpropa-

gation learning algorithm. In Southeastcon’94. Creative Technology Transfer-A Global

Aﬀair., Proceedings of the 1994 IEEE , pages 136–138. IEEE. 454

Price, R. (1958). A useful theorem for nonlinear devices having Gaussian inputs. IEEE

Transactions on Information Theory, 4(2), 69–72. 696

Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual

representation by single neurons in the human brain. Nature,

435

(7045), 1102–1107.

369

Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with

deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .

555, 709, 710

Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive

distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 682, 717

771

BIBLIOGRAPHY

Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning

using graphics processors. In L. Bottou and M. Littman, editors, Proceedings of the

Twenty-sixth International Conference on Machine Learning (ICML’09), pages 873–880,

New York, NY, USA. ACM. 27, 449

Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Foundations

of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster University

Archive for the History of Economic Thought. 56

Ranzato, M. and Hinton, G. H. (2010). Modeling pixel means and covariances using

factorized third-order Boltzmann machines. In CVPR’2010 , pages 2551–2558. 687

Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Eﬃcient learning of sparse

representations with an energy-based model. In NIPS’2006. 14, 19, 510, 531, 533

Ranzato, M., Huang, F., Boureau, Y., and LeCun, Y. (2007b). Unsupervised learning of

invariant feature hierarchies with applications to object recognition. In Proceedings of

the Computer Vision and Pattern Recognition Conference (CVPR’07). IEEE Press. 367

Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief

networks. In NIPS’2007 . 510

Ranzato, M., Krizhevsky, A., and Hinton, G. E. (2010a). Factored 3-way restricted

Boltzmann machines for modeling natural images. In Proceedings of AISTATS 2010 .

685, 686

Ranzato, M., Mnih, V., and Hinton, G. (2010b). Generating more realistic images using

gated MRFs. In NIPS’2010 . 687, 688

Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical

parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 135, 298

Rasmus, A., Valpola, H., Honkala, M., Berglund, M., and Raiko, T. (2015). Semi-supervised

learning with ladder network. arXiv preprint arXiv:1507.02672 . 429, 533

Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to

parallelizing stochastic gradient descent. In NIPS’2011 . 450

Reichert, D. P., Seriès, P., and Storkey, A. J. (2011). Neuronal adaptation for sampling-

based probabilistic inference in perceptual bistability. In Advances in Neural Information

Processing Systems, pages 2357–2365. 671

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation

and approximate inference in deep generative models. In ICML’2014. Preprint:

arXiv:1401.4082. 696, 704

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive

auto-encoders: Explicit invariance during feature extraction. In ICML’2011 . 524, 525,

526

772

BIBLIOGRAPHY

Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.

(2011b). Higher order contractive auto-encoder. In ECML PKDD. 524, 525

Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011c). The manifold

tangent classiﬁer. In NIPS’2011 . 271, 272

Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for

sampling contractive auto-encoders. In ICML’2012 . 719

Ringach, D. and Shapley, R. (2004). Reverse correlation in neurophysiology. Cognitive

Science, 28(2), 147–166. 371

Roberts, S. and Everson, R. (2001). Independent component analysis: principles and

practice. Cambridge University Press. 496

Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech

recognition system. Computer Speech and Language, 5(3), 259–274. 27, 462

Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 93

Romero, A., Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Y.

(2015). Fitnets: Hints for thin deep nets. In ICLR’2015, arXiv:1412.6550 . 328

Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part i.

linear constraints. Journal of the Society for Industrial and Applied Mathematics,

(1),

pp. 181–217. 93

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and

organization in the brain. Psychological Review , 65, 386–408. 14, 15, 27

Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 15, 27

Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear

embedding. Science, 290(5500). 164, 521

Roweis, S., Saul, L., and Hinton, G. (2002). Global coordination of local linear models. In

T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information

Processing Systems 14 (NIPS’01), Cambridge, MA. MIT Press. 492

Rubin, D. B. et al. (1984). Bayesianly justiﬁable and relevant frequency calculations for

the applied statistician. The Annals of Statistics, 12(4), 1151–1172. 724

Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by

back-propagating errors. Nature, 323, 533–536. 14, 18, 23, 204, 226, 376, 479, 485

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal represen-

tations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel

Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cambridge. 21,

27, 226

773

BIBLIOGRAPHY

Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986c). Parallel

Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,

Cambridge. 17

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large

Scale Visual Recognition Challenge. 21

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,

A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition

challenge. arXiv preprint arXiv:1409.0575 . 28

Russel, S. J. and Norvig, P. (2003). Artiﬁcial Intelligence: a Modern Approach. Prentice

Hall. 86

Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal

elements of macaque V1 receptive ﬁelds. Neuron, 46(6), 945–956. 370

Sainath, T., Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolu-

tional neural networks for LVCSR. In ICASSP 2013 . 463

Salakhutdinov, R. (2010). Learning in Markov random ﬁelds using tempered transitions. In

Y. Bengio, D. Schuurmans, C. Williams, J. Laﬀerty, and A. Culotta, editors, Advances

in Neural Information Processing Systems 22 (NIPS’09). 607

Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages

448–455. 24, 27, 532, 668, 672, 675, 679

Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings of

the Twelfth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS

2009), volume 8. 674

Salakhutdinov, R. and Hinton, G. (2009c). Semantic hashing. In International Journal of

Approximate Reasoning. 528

Salakhutdinov, R. and Hinton, G. E. (2007a). Learning a nonlinear embedding by

preserving class neighbourhood structure. In Proceedings of the Eleventh International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS’07), San Juan, Porto

Rico. Omnipress. 530

Salakhutdinov, R. and Hinton, G. E. (2007b). Semantic hashing. In SIGIR’2007 . 528

Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance

kernels for Gaussian processes. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,

Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1249–1256,

Cambridge, MA. MIT Press. 245

774

BIBLIOGRAPHY

Salakhutdinov, R. and Larochelle, H. (2010). Eﬃcient learning of deep Boltzmann machines.

In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and

Statistics (AISTATS 2010), JMLR W&CP, volume 9, pages 693–700. 657

Salakhutdinov, R. and Mnih, A. (2008). Probabilistic matrix factorization. In NIPS’2008 .

482

Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief

networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Proceedings of

the Twenty-ﬁfth International Conference on Machine Learning (ICML’08), volume 25,

pages 872–879. ACM. 634, 668

Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted Boltzmann machines for

collaborative ﬁltering. In ICML. 482

Sanger, T. D. (1994). Neural network learning control of robot manipulators using

gradually increasing task diﬃculty. IEEE Transactions on Robotics and Automation,

10(3). 332

Saul, L. K. and Jordan, M. I. (1996). Exploiting tractable substructures in intractable

networks. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural

Information Processing Systems 8 (NIPS’95). MIT Press, Cambridge, MA. 643

Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean ﬁeld theory for sigmoid belief

networks. Journal of Artiﬁcial Intelligence Research, 4, 61–76. 27, 700

Savich, A. W., Moussa, M., and Areibi, S. (2007). The impact of arithmetic representation

on implementing mlp-bp on fpgas: A study. Neural Networks, IEEE Transactions on,

18(1), 240–252. 454

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On random

weights and unsupervised feature learning. In Proc. ICML’2011 . ACM. 366

Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear

dynamics of learning in deep linear neural networks. In ICLR. 287, 288, 305

Schaul, T., Antonoglou, I., and Silver, D. (2014). Unit tests for stochastic optimization.

In International Conference on Learning Representations. 312

Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of

history compression. Neural Computation, 4(2), 234–242. 401

Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on Neural

Networks, 7(1), 142–146. 480

Schmidhuber, J. (2012). Self-delimiting neural networks. arXiv preprint arXiv:1210.0118 .

393

775

BIBLIOGRAPHY

Schölkopf, B. and Smola, A. J. (2002). Learning with kernels: Support vector machines,

regularization, optimization, and beyond . MIT press. 711

Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a

kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 164, 521

Schölkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods —

Support Vector Learning. MIT Press, Cambridge, MA. 18, 142

Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On

causal and anticausal learning. In ICML’2012 , pages 1255–1262. 548

Schuster, M. (1999). On supervised learning from sequential data with applications for

speech recognition. 190

Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE

Transactions on Signal Processing, 45(11), 2673–2681. 398

Schwenk, H. (2007). Continuous space language models. Computer speech and language,

21, 492–518. 469

Schwenk, H. (2010). Continuous space language models for statistical machine translation.

The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 476

Schwenk, H. (2014). Cleaned subset of wmt ’14 dataset. 21

Schwenk, H. and Bengio, Y. (1998). Training methods for adaptive boosting of neural net-

works. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information

Processing Systems 10 (NIPS’97), pages 647–653. MIT Press. 258

Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large

vocabulary continuous speech recognition. In International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pages 765–768, Orlando, Florida. 469

Schwenk, H., Costa-jussà, M. R., and Fonollosa, J. A. R. (2006). Continuous space

language models for the iwslt 2006 task. In International Workshop on Spoken Language

Translation, pages 166–173. 476

Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-

dependent deep neural networks. In Interspeech 2011 , pages 437–440. 23

Sejnowski, T. (1987). Higher-order boltzmann machines. In AIP Conference Proceedings

151 on Neural Networks for Computing, pages 398–403. American Institute of Physics

Inc. 694

Series, P., Reichert, D. P., and Storkey, A. J. (2010). Hallucinations in charles bonnet

syndrome induced by homeostasis: a deep boltzmann machine model. In Advances in

Neural Information Processing Systems, pages 2020–2028. 671

776

BIBLIOGRAPHY

Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied

to house numbers digit classiﬁcation. CoRR, abs/1204.3968. 459

Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection

with unsupervised multi-stage feature learning. In Proc. International Conference on

Computer Vision and Pattern Recognition (CVPR’13). IEEE. 23, 201

Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications.

Siegelmann, H. (1995). Computation beyond the Turing limit. Science,

268

(5210),

545–548. 382

Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied

Mathematics Letters, 4(6), 77–80. 382

Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.

Journal of Computer and Systems Sciences, 50(1), 132–150. 382, 406

Sietsma, J. and Dow, R. (1991). Creating artiﬁcial neural networks that generalize. Neural

Networks, 4(1), 67–79. 242

Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). Best practices for convolutional

neural networks. In ICDAR’2003. 374

Simard, P. and Graf, H. P. (1994). Backpropagation without multiplication. In Advances

in Neural Information Processing Systems, pages 232–239. 454

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992a). Tangent prop - A formalism

for specifying selected invariances in an adaptive network. In NIPS’1991 . 270, 271, 272

Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992b). Tangent prop - A formalism

for specifying selected invariances in an adaptive network. In J. M. S. Hanson and

R. Lippmann, editors, Advances in Neural Information Processing Systems 4 (NIPS’91),

pages 895–903, San Mateo, CA. Morgan Kaufmann. 359

Simard, P. Y., LeCun, Y., and Denker, J. (1993). Eﬃcient pattern recognition using a

new transformation distance. In NIPS’92 . 270

Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation

invariance in pattern recognition — tangent distance and tangent propagation. Lecture

Notes in Computer Science, 1524. 270

Simons, D. J. and Levin, D. T. (1998). Failure to detect changes to people during a

real-world interaction. Psychonomic Bulletin & Review, 5(4), 644–649. 546

Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale

image recognition. In ICLR. 328

777

BIBLIOGRAPHY

Sjöberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum,

with application to neural networks. International Journal of Control,

(6), 1391–1407.

250

Skinner, B. F. (1958). Reinforcement today. American Psychologist, 13, 94–99. 332

Smolensky, P. (1986). Information processing in dynamical systems: Foundations of

harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed

Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 574, 590, 661

Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of

machine learning algorithms. In NIPS’2012 . 439

Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic

pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011.

404

Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural lan-

guage with recursive neural networks. In Proceedings of the Twenty-Eighth International

Conference on Machine Learning (ICML’2011). 404

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).

Semi-supervised recursive autoencoders for predicting sentiment distributions. In

EMNLP’2011 . 404

Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts,

C. (2013a). Recursive deep models for semantic compositionality over a sentiment

treebank. In EMNLP’2013 . 404

Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. Y. (2013b). Zero-shot learning through

cross-modal transfer. In 27th Annual Conference on Neural Information Processing

Systems (NIPS 2013). 542

Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep

unsupervised learning using nonequilibrium thermodynamics. 724

Sohn, K., Zhou, G., and Lee, H. (2013). Learning and selecting features jointly with

point-wise gated Boltzmann machines. In ICML’2013 . 694

Solomonoﬀ, R. J. (1989). A system for incremental learning based on algorithmic proba-

bility. 332

Sontag, E. D. (1998). Vc dimension of neural networks. NATO ASI Series F Computer

and Systems Sciences, 168, 69–96. 550, 554

Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local

minima even for networks without hidden layers. Complex Systems, 3, 91–106. 286

778

BIBLIOGRAPHY

Sparkes, B. (1996). The Red and the Black: Studies in Greek Pottery. Routledge. 1

Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog: how

"less is more" in unsupervised dependency parsing. In HLT’10. 332

Squire, W. and Trapp, G. (1998). Using complex variables to estimate derivatives of real

functions. SIAM Rev., 40(1), 110––112. 442

Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Proceedings of

the 18th Annual Conference on Learning Theory, pages 545–560. Springer-Verlag. 240

Srivastava, N. (2013). Improving Neural Networks With Dropout. Master’s thesis, U.

Toronto. 538

Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann

machines. In NIPS’2012 . 544

Srivastava, N., Salakhutdinov, R. R., and Hinton, G. E. (2013). Modeling documents with

deep boltzmann machines. arXiv preprint arXiv:1309.6865 . 668

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).

Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine

Learning Research, 15, 1929–1958. 258, 265, 267, 679

Srivastava, R. K., Greﬀ, K., and Schmidhuber, J. (2015). Highway networks.

arXiv:1505.00387 . 330

Steinkrau, D., Simard, P. Y., and Buck, I. (2005). Using gpus for machine learning

algorithms. 2013 12th International Conference on Document Analysis and Recognition,

0, 1115–1119. 448

Stoyanov, V., Ropson, A., and Eisner, J. (2011). Empirical risk minimization of graphical

model parameters given approximate inference, decoding, and model structure. In

Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics

(AISTATS), volume 15 of JMLR Workshop and Conference Proceedings, pages 725–733,

Fort Lauderdale. Supplementary material (4 pages) also available. 681, 706

Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2015). Weakly supervised memory

networks. arXiv preprint arXiv:1503.08895 . 421

Supancic, J. and Ramanan, D. (2013). Self-paced learning for long-term tracking. In

CVPR’2013 . 332

Sussillo, D. (2014). Random walks: Training very deep nonlinear feed-forward networks

with smart initialization. CoRR, abs/1412.6558. 292, 306, 308, 406

Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of

computer science, University of Toronto. 409, 416

779

BIBLIOGRAPHY

Sutskever, I. and Hinton, G. E. (2008). Deep narrow sigmoid belief networks are universal

approximators. Neural Computation, 20(11), 2629–2636. 700

Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive

Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS), volume 9, pages 789–795.

617

Sutskever, I., Hinton, G., and Taylor, G. (2009). The recurrent temporal restricted

Boltzmann machine. In NIPS’2008 . 693

Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent

neural networks. In ICML’2011 , pages 1017–1024. 480

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of

initialization and momentum in deep learning. In ICML. 302, 409, 416

Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with

neural networks. In NIPS’2014 . arXiv:1409.3215. 25, 101, 413, 414

Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with

neural networks. In NIPS’2014 . 400, 477, 478

Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press.

106

Sutton, R. S., Mcallester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods

for reinforcement learning with function approximation. In NIPS’1999, pages 1057–

–1063. MIT Press. 699

Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On

autoencoders and score matching for energy based models. In ICML’2011 . ACM. 516

Swersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw Bayesian optimization.

arXiv preprint arXiv:1406.3896 . 439

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke,

V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical report,

arXiv:1409.4842. 24, 27, 201, 258, 269, 330, 350

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and

Fergus, R. (2014b). Intriguing properties of neural networks. ICLR,

abs/1312.6199

268, 271

Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the

Inception Architecture for Computer Vision. ArXiv e-prints. 245, 326

Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to

human-level performance in face veriﬁcation. In CVPR’2014 . 100

780

BIBLIOGRAPHY

Tandy, D. W. (1997). Works and Days: A Translation and Commentary for the Social

Sciences. University of California Press. 1

Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In

Proceedings of the 27th International Conference on Machine Learning, June 21-24,

2010, Haifa, Israel. 242

Tang, Y., Salakhutdinov, R., and Hinton, G. (2012). Deep mixtures of factor analysers.

arXiv preprint arXiv:1206.4635 . 492

Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines

for modeling motion style. In L. Bottou and M. Littman, editors, Proceedings of

the Twenty-sixth International Conference on Machine Learning (ICML’09), pages

1025–1032, Montreal, Quebec, Canada. ACM. 693

Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary

latent variables. In B. Schölkopf, J. Platt, and T. Hoﬀman, editors, Advances in Neural

Information Processing Systems 19 (NIPS’06), pages 1345–1352. MIT Press, Cambridge,

MA. 693

Teh, Y., Welling, M., Osindero, S., and Hinton, G. E. (2003). Energy-based models

for sparse overcomplete representations. Journal of Machine Learning Research,

1235–1260. 494

Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework

for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 164, 521, 536

Theis, L., van den Oord, A., and Bethge, M. (2015). A note on the evaluation of generative

models. arXiv:1511.01844. 705, 727

Thompson, J., Jain, A., LeCun, Y., and Bregler, C. (2014). Joint training of a convolutional

network and a graphical model for human pose estimation. In NIPS’2014 . 363

Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994. 271

Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society B, 58, 267–288. 237

Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to

the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Pro-

ceedings of the Twenty-ﬁfth International Conference on Machine Learning (ICML’08),

pages 1064–1071. ACM. 617

Tieleman, T. and Hinton, G. (2009). Using fast weights to improve persistent contrastive

divergence. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-sixth

International Conference on Machine Learning (ICML’09), pages 1033–1040. ACM.

619

781

BIBLIOGRAPHY

Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.

Journal of the Royal Statistical Society B, 61(3), 611–622. 494

Torralba, A., Fergus, R., and Weiss, Y. (2008). Small codes and large databases for

recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference

(CVPR’08), pages 1–8. 528

Touretzky, D. S. and Minton, G. E. (1985). Symbols among the neurons: Details of

a connectionist inference architecture. In Proceedings of the 9th International Joint

Conference on Artiﬁcial Intelligence - Volume 1, IJCAI’85, pages 238–243, San Francisco,

CA, USA. Morgan Kaufmann Publishers Inc. 17

Tu, K. and Honavar, V. (2011). On the utility of curricula in unsupervised learning of

probabilistic grammars. In IJCAI’2011 . 332

Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., Denk,

W., and Seung, H. S. (2010). Convolutional networks can learn to generate aﬃnity

graphs for image segmentation. Neural Computation, 22(2), 511–538. 362

Turian, J., Ratinov, L., and Bengio, Y. (2010). Word representations: A simple and

general method for semi-supervised learning. In Proc. ACL’2010, pages 384–394. 538

Töscher, A., Jahrer, M., and Bell, R. M. (2009). The bigchaos solution to the netﬂix

grand prize. 482

Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autoregres-

sive density-estimator. In NIPS’2013 . 717

van den Oörd, A., Dieleman, S., and Schrauwen, B. (2013). Deep content-based music

recommendation. In NIPS’2013 . 483

van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using t-SNE. J. Machine

Learning Res., 9. 480, 522

Vanhoucke, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks

on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop.

447, 455

Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-

Verlag, Berlin. 114

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.

114

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative

frequencies of events to their probabilities. Theory of Probability and Its Applications,

16, 264–280. 114

782

BIBLIOGRAPHY

Vincent, P. (2011). A connection between score matching and denoising autoencoders.

Neural Computation, 23(7). 516, 518, 720

Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.

523

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and

composing robust features with denoising autoencoders. In ICML 2008 . 242, 518

Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked

denoising autoencoders: Learning useful representations in a deep network with a local

denoising criterion. J. Machine Learning Res., 11. 518

Vincent, P., de Brébisson, A., and Bouthillier, X. (2015). Eﬃcient exact gradient update

for training deep networks with very large sparse targets. In C. Cortes, N. D. Lawrence,

D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information

Processing Systems 28 , pages 1108–1116. Curran Associates, Inc. 468

Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a).

Grammar as a foreign language. Technical report, arXiv:1412.7449. 413

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural image

caption generator. arXiv 1411.4555. 413

Vinyals, O., Fortunato, M., and Jaitly, N. (2015a). Pointer networks. arXiv preprint

arXiv:1506.03134 . 421

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015b). Show and tell: a neural image

caption generator. In CVPR’2015 . arXiv:1411.4555. 102

Viola, P. and Jones, M. (2001). Robust real-time object detection. In International

Journal of Computer Vision. 452

Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., and Bengio, Y. (2015).

ReNet: A recurrent neural network based alternative to convolutional networks. arXiv

preprint arXiv:1505.00393 . 398

Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal

projections directed to the auditory pathway. Nature, 404(6780), 871–876. 16

Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.

In Advances in Neural Information Processing Systems 26 , pages 351–359. 265

Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme

recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,

and Signal Processing, 37, 328–339. 377, 456, 462

Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural

networks using dropconnect. In ICML’2013 . 266

783

BIBLIOGRAPHY

Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013 . 266

Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014a). Knowledge graph and text jointly

embedding. In Proc. EMNLP’2014 . 487

Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014b). Knowledge graph embedding by

translating on hyperplanes. In Proc. AAAI’2014 . 487

Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical

analysis of dropout in piecewise linear networks. In ICLR’2014 . 262, 266, 267

Wawrzynek, J., Asanovic, K., Kingsbury, B., Johnson, D., Beck, J., and Morgan, N.

(1996). Spert-II: A vector microprocessor system. Computer , 29(3), 79–86. 454

Weaver, L. and Tao, N. (2001). The optimal reward baseline for gradient-based reinforce-

ment learning. In Proc. UAI’2001 , pages 538–545. 699

Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by

semideﬁnite programming. In CVPR’2004 , pages 988–995. 164, 522

Weiss, Y., Torralba, A., and Fergus, R. (2008). Spectral hashing. In NIPS, pages

1753–1760. 528

Welling, M., Zemel, R. S., and Hinton, G. E. (2002). Self supervised boosting. In Advances

in Neural Information Processing Systems, pages 665–672. 711

Welling, M., Hinton, G. E., and Osindero, S. (2003a). Learning sparse topographic

representations with products of Student-t distributions. In NIPS’2002 . 687

Welling, M., Zemel, R., and Hinton, G. E. (2003b). Self-supervised boosting. In S. Becker,

S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing

Systems 15 (NIPS’02), pages 665–672. MIT Press. 628

Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums

with an application to information retrieval. In L. Saul, Y. Weiss, and L. Bottou,

editors, Advances in Neural Information Processing Systems 17 (NIPS’04), volume 17,

Cambridge, MA. MIT Press. 684

Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In

Proceedings of the 10th IFIP Conference, 31.8 - 4.9, NYC , pages 762–770. 226

Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to

rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 404

Weston, J., Chopra, S., and Bordes, A. (2014). Memory networks. arXiv preprint

arXiv:1410.3916 . 421, 488

Widrow, B. and Hoﬀ, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON

Convention Record, volume 4, pages 96–104. IRE, New York. 15, 21, 24, 27

784

BIBLIOGRAPHY

Wikipedia (2015). List of animals by number of neurons — wikipedia, the free encyclopedia.

[Online; accessed 4-March-2015]. 24, 27

Williams, C. K. I. and Agakov, F. V. (2002). Products of Gaussians and Probabilistic

Minor Component Analysis. Neural Computation, 14(5), 1169–1182. 690

Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In

D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information

Processing Systems 8 (NIPS’95), pages 514–520. MIT Press, Cambridge, MA. 142

Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist

reinforcement learning. Machine Learning, 8, 229–256. 696, 697

Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully

recurrent neural networks. Neural Computation, 1, 270–280. 223

Wilson, D. R. and Martinez, T. R. (2003). The general ineﬃciency of batch training for

gradient descent learning. Neural Networks, 16(10), 1429–1451. 281

Wilson, J. R. (1984). Variance reduction techniques for digital simulation. American

Journal of Mathematical and Management Sciences, 4(3), 277––312. 698

Wiskott, L. and Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of

invariances. Neural Computation, 14(4), 715–770. 497

Wolpert, D. and MacReady, W. (1997). No free lunch theorems for optimization. IEEE

Transactions on Evolutionary Computation, 1, 67–82. 295

Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural

Computation, 8(7), 1341–1390. 116

Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image

recognition. arXiv:1501.02876. 450

Wu, Z. (1997). Global continuation for distance geometry problems. SIAM Journal of

Optimization, 7, 814–836. 330

Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated

splicing using RNA sequence and cellular context. Bioinformatics,

(18), 2554–2562.

265

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S., and

Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with visual

attention. In ICML’2015 . 102, 699

Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S., and

Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with visual

attention. arXiv:1502.03044. 413

785

BIBLIOGRAPHY

Yildiz, I. B., Jaeger, H., and Kiebel, S. J. (2012). Re-visiting the echo state property.

Neural networks, 35, 1–9. 408

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features

in deep neural networks? In NIPS’2014 . 328, 539

Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly

decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 617

Yu, D., Wang, S., and Deng, L. (2010). Sequential labeling using deep-structured

conditional random ﬁelds. IEEE Journal of Selected Topics in Signal Processing. 328

Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv 1410.4615. 332

Zaremba, W. and Sutskever, I. (2015). Reinforcement learning neural turing machines.

arXiv:1505.00521 . 422

Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions

of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical

Society. American Mathematical Society. 553

Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks.

In ECCV’14 . 6

Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior,

A., Vanhoucke, V., Dean, J., and Hinton, G. E. (2013). On rectiﬁed linear units for

speech processing. In ICASSP 2013 . 462

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015). Object detectors

emerge in deep scene CNNs. ICLR’2015, arXiv:1412.6856. 554

Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative

stochastic network for protein secondary structure prediction. In ICML’2014 . 723

Zhou, Y. and Chellappa, R. (1988). Computation of optical ﬂow using a neural network.

In Neural Networks, 1988., IEEE International Conference on, pages 71–78. IEEE. 342

Zöhrer, M. and Pernkopf, F. (2014). General stochastic networks for classiﬁcation. In

NIPS’2014 . 723

786