Bibliography
Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for
Boltzmann machines. Cognitive Science, 9, 147–169. 478
Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data gen-
erating distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal.
390
Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data
generating distribution. In ICLR’2013 . also arXiv report 1211.4246. 373, 390, 392
Alain, G., Bengio, Y., Yao, L.,
´
Eric Thibodeau-Laufer, Yosinski, J., and Vincent, P.
(2015). GSNs: Generative stochastic networks. arXiv:1503.05571. 377
Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian
gradient. In Advances in Neural Information Processing Systems, pages 127–133. MIT
Press. 158
Anderson, E. (1935). The Irises of the Gaspe Peninsula. Bulletin of the American Iris
Society, 59, 2–5. 19
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. Technical report, arXiv:1409.0473. 22, 86, 325, 334,
335
Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition
with continuous-parameter hidden Markov models. Computer, Speech and Language,
2, 219–234. 62, 288
Baldi, P. and Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. MIT
Press. 290
Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural
Information Processing Systems 26 , pages 2814–2822. 201
Baldi, P., Brunak, S., Frasconi, P., Soda, G., and Pollastri, G. (1999). Exploiting the
past and the future in protein secondary structure prediction. Bioinformatics, 15(11),
937–946. 258
521
BIBLIOGRAPHY
Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in high-
energy physics with deep learning. Nature communications, 5. 22
Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Trans. on Information Theory, 39, 930–945. 170
Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University
Press. 378
Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and
Applications. Wiley. 378
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A.,
Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements.
Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 70
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of
finite state Markov chains. Ann. Math. Stat., 37, 1559–1563. 286
Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th Inter-
national Conference on Computational Learning Theory (COLT’95), pages 311–320,
Santa Cruz, California. ACM Press. 202
Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces
in random-dot stereograms. Nature, 355, 161–163. 425
Belkin, M. and Niyogi, P. (2002). Laplacian eigenmaps and spectral techniques for em-
bedding and clustering. In NIPS’01, Cambridge, MA. MIT Press. 411
Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and
data representation. Neural Computation, 15(6), 1373–1396. 139, 429
Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distri-
butions using neural networks. IEEE Transactions on Neural Networks, special issue
on Data Mining and Knowledge Discovery, 11(3), 550–557. 263
Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence Recog-
nition. Ph.D. thesis, McGill University, (Computer Science), Montreal, Canada. 269,
290
Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal
on Pattern Recognition and Artificial Intelligence, 7(4), 647–668. 288
Bengio, Y. (1999a). Markovian models for sequential data. Neural Computing Surveys,
2, 129–162. 288
Bengio, Y. (1999b). Markovian models for sequential data. Neural Computing Surveys,
2, 129–162. 290
Bengio, Y. (2002). New distributed probabilistic language models. Technical Report
1215, Dept. IRO, Universit´e de Montr´eal. 326
522
BIBLIOGRAPHY
Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 133, 171
Bengio, Y. (2013). Estimating or propagating gradients through stochastic neurons.
Technical Report arXiv:1305.2982, Universite de Montreal. 360
Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-
layer neural networks. In NIPS’99 , pages 400–406. MIT Press. 263, 265, 266
Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence.
Neural Computation, 21(6), 1601–1621. 390, 449, 486
Bengio, Y. and Frasconi, P. (1996). Input/Output HMMs for sequence processing. IEEE
Transactions on Neural Networks, 7(5), 1231–1249. 290
Bengio, Y. and Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold
cross-validation. In NIPS’03, Cambridge, MA. MIT Press, Cambridge. 102
Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In L. Bottou,
O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT
Press. 17, 172
Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In Large
Scale Kernel Machines. 133
Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04 ,
pages 129–136. MIT Press. 137, 431, 432
Bengio, Y. and en´ecal, J.-S. (2003). Quick training of probabilistic neural nets by
importance sampling. In Proceedings of AISTATS 2003 . 330
Bengio, Y. and en´ecal, J.-S. (2008). Adaptive importance sampling to accelerate training
of a neural probabilistic language model. IEEE Trans. Neural Networks, 19(4), 713–
722. 330
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motivated
acoustic parameters for continuous speech recognition using artificial neural networks.
In Proceedings of EuroSpeech’91 . 23, 318
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992a). Global optimization of a
neural network-hidden Markov model hybrid. IEEE Transactions on Neural Networks,
3(2), 252–259. 288, 290
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992b). Neural network - gaussian
mixture hybrid for speech recognition or density estimation. In NIPS 4, pages 175–182.
Morgan Kaufmann. 318
Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term depen-
dencies in recurrent networks. In IEEE International Conference on Neural Networks,
pages 1183–1195, San Francisco. IEEE Press. (invited paper). 213, 276
523
BIBLIOGRAPHY
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE Tr. Neural Nets. 213, 214, 267, 274, 276, 277
Bengio, Y., LeCun, Y., Nohl, C., and Burges, C. (1995). Lerec: A NN/HMM hybrid for
on-line handwriting recognition. Neural Computation, 7(6), 1289–1303. 290
Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language
model. In NIPS’00, pages 932–938. MIT Press. 16
Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language
model. In NIPS’2000, pages 932–938. 319, 321, 322, 332
Bengio, Y., Ducharme, R., and Vincent, P. (2001c). A neural probabilistic language
model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages
932–938. MIT Press. 433, 434
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003a). A neural probabilistic
language model. JMLR, 3, 1137–1155. 321, 325, 332
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003b). A neural probabilistic
language model. Journal of Machine Learning Research, 3, 1137–1155. 433, 434
Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions
for local kernel machines. In NIPS’2005 . 133
Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows.
In NIPS’2005 . MIT Press. 137, 431
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise
training of deep networks. In NIPS’2006 . 12, 16, 396, 397
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In
ICML’09 . 158
Bengio, Y., L´eonard, N., and Courville, A. (2013a). Estimating or propagating gradients
through stochastic neurons for conditional computation. arXiv:1308.3432. 332, 360
Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-
encoders as generative models. In NIPS’2013. 392, 508, 512
Bengio, Y., Courville, A., and Vincent, P. (2013c). Representation learning: A review and
new perspectives. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI),
35(8), 1798–1828. 423, 506
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014a). Deep generative
stochastic networks trainable by backprop. Technical Report arXiv:1306.1091. 360
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative
stochastic networks trainable by backprop. In ICML’2014 . 360, 509, 510, 511, 513,
514
524
BIBLIOGRAPHY
Bennett, C. (1976). Efficient estimation of free energy differences from Monte Carlo data.
Journal of Computational Physics, 22(2), 245–268. 465
Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy
approach to natural language processing. Computational Linguistics, 22, 39–71. 333
Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive
divergence and persistent contrastive divergence. CoRR, abs/1312.6002. 451
Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Clas-
sification. Ph.D. thesis, Universit´e de Montr´eal. 373
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian,
J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression
compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy).
Oral Presentation. 70
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.
453
Bishop, C. M. (1994). Mixture density networks. 154
Bishop, C. M. (1995). Regularization and complexity control in feed-forward networks.
In Proceedings International Conference on Artificial Neural Networks ICANN’95 , vol-
ume 1, page 141–148. 196
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. 84, 132, 134
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability
and the vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 97
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and
meaning representations for open-text semantic parsing. AISTATS’2012 . 261
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal
margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Com-
putational learning theory, pages 144–152, New York, NY, USA. ACM. 16, 123, 133,
149
Bottou, L. (1991). Une approche th´eorique de l’apprentissage connexioniste; applications
`a la reconnaissance de la parole. Ph.D. thesis, Universit´e de Paris XI. 290
Bottou, L. (2011). From machine learning to machine reasoning. Technical report,
arXiv.1102.1808. 260, 261
Bottou, L., Fogelman-Souli´e, F., Blanchet, P., and Lienard, J. S. (1990). Speaker inde-
pendent isolated digit recognition: multilayer perceptrons vs dynamic time warping.
Neural Networks, 3, 453–465. 290
525
BIBLIOGRAPHY
Bottou, L., Bengio, Y., and LeCun, Y. (1997). Global training of document processing
systems using graph transformer networks. In Proceedings of the Computer Vision and
Pattern Recognition Conference (CVPR’97), pages 490–494, Puerto Rico. IEEE. 282,
289, 291, 300, 301, 302
Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and
singular value decomposition. Biological Cybernetics, 59, 291–294. 369
Bourlard, H. and Morgan, N. (1993). Connectionist Speech Recognition. A Hybrid Ap-
proach, volume 247 of The Kluwer international series in engineering and computer
science. Kluwer Academic Publishers, Boston. 290
Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered
perceptrons. Computer Speech and Language, 3, 1–19. 318
Bourlard, H. and Wellekens, C. (1990). Links between hidden Markov models and multi-
layer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence,
12, 1167–1178. 290
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University
Press, New York, NY, USA. 80
Brady, M. L., Raghavan, R., and Slawny, J. (1989). Back-propagation fails to separate
where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36, 665–674.
208
Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 139,
429
Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 188
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and
Regression Trees. Wadsworth International Group, Belmont, CA. 134
Brown, P. (1987). The Acoustic-Modeling problem in Automatic Speech Recognition.
Ph.D. thesis, Dept. of Computer Science, Carnegie-Mellon University. 288
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D.,
Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.
Computational linguistics, 16(2), 79–85. 19
Brown, P. F., Pietra, V. J. D., DeSouza, P. V., Lai, J. C., and Mercer, R. L. (1992).
Class-based n-gram models of natural language. Computational Linguistics, 18, 467–
479. 323
Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. (2006). Model compression. In Pro-
ceedings of the 12th ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 535–541. ACM. 305
526
BIBLIOGRAPHY
Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning.
In R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005, pages 33–40. Society for
Artificial Intelligence and Statistics. 449, 486
Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models
Summer School, pages 372–379. 202
Cauchy, A. (1847). M´ethode en´erale pour la r´esolution de syst`emes d’´equations simul-
tan´ees. In Compte rendu des s´eances de l’acad´emie des sciences, pages 536–538. 72
Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923,
UCSD. 139, 426
Chapelle, O., Weston, J., and Scolkopf, B. (2003). Cluster kernels for semi-supervised
learning. In NIPS’02 , pages 585–592, Cambridge, MA. MIT Press. 411
Chapelle, O., Sch¨olkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning. MIT
Press, Cambridge, MA. 411
Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neu-
ral Networks for Document Processing. In Guy Lorette, editor, Tenth International
Workshop on Frontiers in Handwriting Recognition, La Baule (France). Universit´e de
Rennes 1, Suvisoft. http://www.suvisoft.com. 20, 23, 304
Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for
language modeling. Computer, Speech and Language, 13(4), 359–393. 280, 281, 333
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,
and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for
statistical machine translation. In Proceedings of the Empiricial Methods in Natural
Language Processing (EMNLP 2014). 274, 334
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The
loss surface of multilayer networks. 208, 399
Chorowski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous
speech recognition using attention-based recurrent nn: First results. arXiv:1412.1602.
319
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated
recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop,
arXiv 1412.3555. 319
Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural
network for traffic sign classification. Neural Networks, 32, 333–338. 21, 171
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big
simple neural nets for handwritten digit recognition. Neural Computation, 22, 1–14.
20, 23
527
BIBLIOGRAPHY
Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse
coding and vector quantization. In ICML’2011 . 23
Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in un-
supervised feature learning. In Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics (AISTATS 2011). 314
Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep
learning with cots hpc systems. In S. Dasgupta and D. McAllester, editors, Proceedings
of the 30th International Conference on Machine Learning (ICML-13), volume 28 (3),
pages 1337–1345. JMLR Workshop and Conference Proceedings. 20, 23
Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris
VI, LIP6. 149
Collobert, R. and Weston, J. (2008). A unified architecture for natural language process-
ing: Deep neural networks with multitask learning. In ICML’2008 . 331
Comon, P. (1994). Independent component analysis - a new concept? Signal Processing,
36, 287–314. 379, 380
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20,
273–297. 16, 123, 133
Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmenta-
tion using depth information. In International Conference on Learning Representations
(ICLR2013). 21, 171
Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by
spike-and-slab RBMs. In ICML’11 . 341, 503
Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spike-and-slab
RBM and extensions to discrete and sparse data distributions. Pattern Analysis and
Machine Intelligence, IEEE Transactions on, 36(9), 1874–1887. 504
Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition.
Wiley-Interscience. 54
Cox, R. T. (1946). Probability, frequency and reasonable expectation. American Journal
of Physics, 14, 1––10. 47
Cram´er, H. (1946). Mathematical methods of statistics. Princeton University Press. 114
Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304,
111–114. 447
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-
matics of Control, Signals, and Systems, 2, 303–314. 420
Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition
with the mean-covariance restricted Boltzmann machine. In NIPS’2010 . 21
528
BIBLIOGRAPHY
Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained
deep neural networks for large vocabulary speech recognition. IEEE Transactions on
Audio, Speech, and Language Processing, 20(1), 33–42. 318
Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for
QSAR predictions. arXiv:1406.1231. 22
Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-
dimensional inputs. In NIPS26 . NIPS Foundation. 457
Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with
reconstruction sampling. In ICML’2011 . 330
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization. In NIPS’2014 . 74, 208, 399
Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T.
(2014). The visual microphone: Passive recovery of sound from video. ACM Transac-
tions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 311
de Finetti, B. (1937). La pr´evision: ses lois logiques, ses sources subjectives. Annales de
l’institut Henri Poincar´e, 7, 1–68. 47
Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS.
17, 171, 420, 421
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09 . 19, 128
Deng, J., Berg, A. C., Li, K., and Fei-Fei, L. (2010a). What does classifying more than
10,000 image categories tell us? In Proceedings of the 11th European Conference on
Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag.
19
Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., and
Adam, H. (2014). Large-scale object classification using label relation graphs. In
ECCV’2014 , pages 48–64. 282
Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and
Trends in Signal Processing. 318
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010b). Bi-
nary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010 ,
Makuhari, Chiba, Japan. 21
Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs
for vision. Technical Report 1327, epartement d’Informatique et de Recherche
Op´erationnelle, Universit´e de Montr´eal. 504
529
BIBLIOGRAPHY
Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function.
In NIPS’2011 . 465
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast
and robust neural network joint models for statistical machine translation. In Proc.
ACL’2014 . 334
DiCarlo, J. J. (2013). Mechanisms underlying visual object recognition: Humans vs.
neurons vs. machines. NIPS Tutorial. 22, 247
Do, T.-M.-T. and Arti`eres, T. (2010). Neural conditional random fields. In International
Conference on Artificial Intelligence and Statistics, pages 177–184. 282
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual
recognition and description. arXiv:1411.4389. 86
Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embed-
ding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics,
Stanford University. 139, 429
Doob, J. (1953). Stochastic processes. Wiley: New York. 47
Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning.
IEEE Transactions on Neural Networks, 1, 75–80. 214, 267
Dugas, C., Bengio, Y., elisle, F., and Nadeau, C. (2001). Incorporating second-order
functional knowledge for better option pricing. In NIPS’00 , pages 472–478. MIT Press.
62, 149
Ebrahimi, S., Pal, C., Bouthillier, X., Froumenty, P., Jean, S., Konda, K. R., Vincent,
P., Courville, A., and Bengio, Y. (2013). Combining modality specific deep neural
network models for emotion recognition in video. In Emotion Recognition In The Wild
Challenge and Workshop (Emotiw2013). 171
El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term
dependencies. In NIPS 8 . MIT Press. 275, 279, 280
ElHihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term
dependencies. In NIPS’1995 . 270
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010).
Why does unsupervised pre-training help deep learning? J. Machine Learning Res.
397, 399, 400, 401
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Doll´ar, P., Gao, J., He, X.,
Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual
concepts and back. arXiv:1411.4952. 86
530
BIBLIOGRAPHY
Farabet, C., LeCun, Y., Kavukcuoglu, K., Culurciello, E., Martini, B., Akselrod, P.,
and Talay, S. (2011). Large-scale FPGA-based convolutional networks. In R. Bekker-
man, M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and
Distributed Approaches. Cambridge University Press. 386
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013a). Learning hierarchical
features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-
telligence. 21, 171
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013b). Learning hierarchical
features for scene labeling. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 35(8), 1915–1929. 282
Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object categories.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.
408
Fischer, A. and Igel, C. (2011). Bounding the bias of contrastive divergence learning.
Neural Computation, 23(3), 664–73. 486
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7, 179–188. 19, 89
Frasconi, P., Gori, M., and Sperduti, A. (1997). On the efficient classification of data
structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence. 261
Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive pro-
cessing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786.
261
Frey, B. J. (1998). Graphical models for machine learning and digital communication.
MIT Press. 262
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36,
193–202. 14, 20, 23, 248
Garson, J. (1900). The metric system of identification of criminals, as used in in great
britain and ireland. The Journal of the Anthropological Institute of Great Britain and
Ireland, (2), 177–227. 19
Girosi, F. (1994). Regularization theory, radial basis functions and networks. In
V. Cherkassky, J. Friedman, and H. Wechsler, editors, From Statistics to Neural Net-
works, volume 136 of NATO ASI Series, pages 166–187. Springer Berlin Heidelberg.
170
Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In
AISTATS’2011 . 14, 149, 385
531
BIBLIOGRAPHY
Glorot, X., Bordes, A., and Bengio, Y. (2011b). Deep sparse rectifier neural networks.
In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial
Intelligence and Statistics (AISTATS 2011). 174, 385
Glorot, X., Bordes, A., and Bengio, Y. (2011c). Domain adaptation for large-scale senti-
ment classification: A deep learning approach. In ICML’2011 . 385, 405
Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face
Recognition. Imperial College Press. 430, 432
Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep
networks. In NIPS’2009 , pages 646–654. 373, 385
Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L.
(2010). Help me help you: Interfaces for personal robots. In Proc. of Human Robot
Interaction (HRI), Osaka, Japan. ACM Press, ACM Press. 85
Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with
spike-and-slab sparse coding. In ICML’2012 . 381
Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution
for autoencoders. Technical report, Universit´e de Montr´eal. 241
Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding
for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning
Hierarchical Models. 171, 405
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a).
Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–
1327. 174, 200, 246, 314
Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction
deep Boltzmann machines. In NIPS26 . NIPS Foundation. 86, 455, 500, 501
Goodfellow, I. J., Courville, A., and Bengio, Y. (2013c). Scaling up spike-and-slab models
for unsupervised feature learning. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 35(8), 1902–1914. 504
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014). Multi-digit
number recognition from Street View imagery using deep convolutional neural net-
works. In International Conference on Learning Representations. 21, 307
Goodman, J. (2001). Classes for fast maximum entropy training. In International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), Utah. 326
Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(1), 76–86. 208
Gosset, W. S. (1908). The probable error of a mean. Biometrika, 6(1), 1–25. Originally
published under the pseudonym “Student”. 19
532
BIBLIOGRAPHY
Gouws, S., Bengio, Y., and Corrado, G. (2014). Bilbowa: Fast bilingual distributed
representations without word alignments. Technical report, arXiv:1410.2455. 409
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies
in Computational Intelligence. Springer. 258, 273, 274, 282, 319
Graves, A. (2013). Generating sequences with recurrent neural networks. Technical
report, arXiv:1308.0850. 155, 273, 275
Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent
neural networks. In ICML’2014 . 273
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirec-
tional LSTM and other neural network architectures. Neural Networks, 18(5), 602–610.
258
Graves, A. and Schmidhuber, J. (2009). Offline handwriting recognition with multidi-
mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and
L. Bottou, editors, NIPS’2008 , pages 545–552. 258
Graves, A., Fern´andez, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist tempo-
ral classification: Labelling unsegmented sequence data with recurrent neural networks.
In ICML’2006 , pages 369–376, Pittsburgh, USA. 282, 319
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Uncon-
strained on-line handwriting recognition with recurrent neural networks. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 258
Graves, A., Liwicki, M., Fern´andez, S., Bertolami, R., Bunke, H., and Schmidhuber,
J. (2009). A novel connectionist system for unconstrained handwriting recognition.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5), 855–868.
273
Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recur-
rent neural networks. In ICASSP’2013 , pages 6645–6649. 258, 273, 274, 319
Graves, A., Wayne, G., and Danihelka, I. (2014). Neural Turing machines.
arXiv:1410.5401. 22
G¨ul¸cehre, C¸ . and Bengio, Y. (2013). Knowledge matters: Importance of prior infor-
mation for optimization. In International Conference on Learning Representations
(ICLR’2013). 21
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estima-
tion principle for unnormalized statistical models. In Proceedings of The Thirteenth
International Conference on Artificial Intelligence and Statistics (AISTATS’10). 457
Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Muller, U., and LeCun, Y.
(2007). Online learning for offroad robots: Spatial label propagation to learn long-
range traversability. In Proceedings of Robotics: Science and Systems, Atlanta, GA,
USA. 312
533
BIBLIOGRAPHY
Haffner, P., Franzini, M., and Waibel, A. (1991). Integrating time alignment and neural
networks for high performance continuous speech recognition. In International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pages 105–108, Toronto.
290
H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings
of the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley,
California. ACM Press. 171, 420
H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits.
Computational Complexity, 1, 113–129. 171, 420
Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York. 15
Henaff, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning
of sparse features for scalable audio classification. In ISMIR’11 . 386
Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modifiables: D´ecodage de
messages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie
des Sciences, 299(III-13), 525––528. 379
Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic
modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 21,
318
Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence.
Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 448
Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 429
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with
neural networks. Science, 313(5786), 504–507. 375, 396, 397
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with
Neural Networks. Science, 313, 504–507. 399
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and
Helmholtz free energy. In NIPS’1993 . 369
Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief
nets. Neural Computation, 18, 1527–1554. 12, 16, 23, 124, 396, 397, 487
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A.,
Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012b). Deep neural
networks for acoustic modeling in speech recognition: The shared views of four research
groups. IEEE Signal Process. Mag., 29(6), 82–97. 86
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
(2012c). Improving neural networks by preventing co-adaptation of feature detectors.
Technical report, arXiv:1207.0580. 185
534
BIBLIOGRAPHY
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma
thesis, T.U. M¨unich. 213, 267, 276
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computa-
tion, 9(8), 1735–1780. 22, 273, 274
Hochreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2000).
Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In
J. Kolen and S. Kremer, editors, Field Guide to Dynamical Recurrent Networks. IEEE
Press. 274
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural Networks, 2, 359–366. 420
Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World
Chess Champion. Princeton University Press, Princeton, NJ, USA. 2
Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov
random fields on lattice. Annals of the Institute of Statistical Mathematics, 54(1),
1–18. 454
Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monkey
striate cortex. Journal of Physiology (London), 195, 215–243. 245
Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s
striate cortex. Journal of Physiology, 148, 574–591. 245
Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction, and
functional architecture in the cat’s visual cortex. Journal of Physiology (London),
160, 106–154. 245
Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96,
pages 13–24. 253
Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing
Surveys, 2, 94–128. 379
Hyv¨arinen, A. (2005a). Estimation of non-normalized statistical models using score
matching. J. Machine Learning Res., 6. 390
Hyv¨arinen, A. (2005b). Estimation of non-normalized statistical models using score
matching. Journal of Machine Learning Research, 6, 695–709. 455
Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence,
and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural
Networks, 18, 1529–1531. 456
Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and
Data Analysis, 51, 2499–2512. 456
535
BIBLIOGRAPHY
Hyv¨arinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Ex-
istence and uniqueness results. Neural Networks, 12(3), 429–439. 380
Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis.
Wiley-Interscience. 379
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training
by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 . 85
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture
of local experts. Neural Computation, 3, 79–87. 154
Jaeger, H. (2003). Adaptive nonlinear system identification with echo state networks. In
Advances in Neural Information Processing Systems 15 . 268
Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state
networks. Technical report, Jacobs University. 275
Jaeger, H. (2007b). Echo state network. Scholarpedia, 2(9), 2330. 267
Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and
saving energy in wireless communication. Science, 304(5667), 78–80. 23, 267
Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J. M., and Scolkopf, B. (2012).
On causal and anticausal learning. In ICML’2012 , pages 1255–1262. 412, 414
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009a). What is the best
multi-stage architecture for object recognition? In ICCV’09. 14, 149, 386
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009b). What is the best
multi-stage architecture for object recognition? In Proc. International Conference on
Computer Vision (ICCV’09), pages 2146–2153. IEEE. 20, 23, 173
Jarzynski, C. (1997). Nonequilibrium equality for free energy differences. Phys. Rev.
Lett., 78, 2690–2693. 464
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University
Press. 46
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target
vocabulary for neural machine translation. arXiv:1412.2007. 334
Jelinek, F. and Mercer, R. L. (1980). Interpolated estimation of markov source parameters
from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in
Practice. North-Holland, Amsterdam. 280, 333
Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 16
Juang, B. H. and Katagiri, S. (1992). Discriminative learning for minimum error classi-
fication. IEEE Transactions on Signal Processing, 40(12), 3043–3054. 288
536
BIBLIOGRAPHY
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algo-
rithm based on neuromimetic architecture. Signal Processing, 24, 1–10. 379
Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In
EMNLP’2013 . 334
Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder.
IEEE Transactions on Pattern Analysis and Machine Intelligence. 392
Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image
descriptions. In CVPR’2015 . arXiv:1412.2306. 86
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014).
Large-scale video classification with convolutional neural networks. In CVPR. 19
Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side
Constraints. Master’s thesis, Dept.˜of Mathematics, Univ.˜of Chicago. 82
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model
component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-35(3), 400–401. 280, 333
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008a). Fast inference in sparse coding
algorithms with applications to object recognition. CBLL-TR-2008-12-01, NYU. 372
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008b). Fast inference in sparse coding
algorithms with applications to object recognition. Technical report, Computational
and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-
12-01. 386
Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant
features through topographic filter maps. In CVPR’2009. 386
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y.
(2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’2010 .
386
Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary
Mathematics ; V. 1). American Mathematical Society. 345
Kingma, D. and LeCun, Y. (2010a). Regularized estimation of image statistics by score
matching. In NIPS’2010 . 390
Kingma, D. and LeCun, Y. (2010b). Regularized estimation of image statistics by score
matching. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta,
editors, Advances in Neural Information Processing Systems 23, pages 1126–1134. 457
Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning
with deep generative models. In NIPS’2014. 360
537
BIBLIOGRAPHY
Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable
models in auxiliary form. Technical report, arxiv:1306.0733. 360
Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings
of the International Conference on Learning Representations (ICLR). 360, 432, 433
Kingma, D. P. and Welling, M. (2014b). Efficient gradient-based inference through trans-
formations between bayes nets and neural nets. Technical report, arxiv:1402.0480. 360
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models.
In ICML’2014 . 86
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embed-
dings with multimodal neural language models. arXiv:1411.2539 [cs.LG]. 86, 273
Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed
representations of words. In Proceedings of COLING 2012 . 409
Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and
Pfister, H. (2014). Deep learning for the connectome. GPU Technology Conference. 22
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and
Techniques. MIT Press. 286, 358, 365
Konig, Y., Bourlard, H., and Morgan, N. (1996). REMAP: Recursive estimation and max-
imization of A posteriori probabilities application to transition-based connectionist
speech recognition. In NIPS’95 . MIT Press, Cambridge, MA. 318
Koren, Y. (2009). 1 the bellkor solution to the netflix grand prize. 191
Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In
ICML’2014 . 275, 280
Koˇcisk´y, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre-
sentations by Marginalizing Alignments. In Proceedings of ACL. 335
Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties
of DBNs with binary hidden units and real-valued visible units. In ICML’2013. 420
Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny
images. Technical report, University of Toronto. 19, 341
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems
25 (NIPS’2012). 20, 23, 85, 312
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classification with deep
convolutional neural networks. In NIPS’2012 . 21, 171, 385
538
BIBLIOGRAPHY
Kuhn, H. W. and Tucker, A. W. (1951). Nonlinear programming. In Proceedings of the
Second Berkeley Symposium on Mathematical Statistics and Probability, pages 481–
492, Berkeley, Calif. University of California Press. 82
Lafferty, J., McCallum, A., and Pereira, F. C. N. (2001). Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In C. E. Brodley and
A. P. Danyluk, editors, ICML 2001 . Morgan Kaufmann. 282, 289
Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural net-
work architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-
Mellon University. 250, 269
Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear
independent component analysis using ensemble learning: Experiments and discussion.
In Proc. ICA. Citeseer. 380
Larochelle, H. and Bengio, Y. (2008a). Classification using discriminative restricted Boltz-
mann machines. In ICML’2008 . 373, 515
Larochelle, H. and Bengio, Y. (2008b). Classification using discriminative restricted
Boltzmann machines. In ICML’08 , pages 536–543. ACM. 411
Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator.
In AISTATS’2011 . 262, 265
Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In
AAAI Conference on Artificial Intelligence. 409
Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative
and discriminative models. In Proceedings of the Computer Vision and Pattern Recog-
nition Conference (CVPR’06), pages 87–94, Washington, DC, USA. IEEE Computer
Society. 411
Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng,
A. (2012). Building high-level features using large scale unsupervised learning. In
ICML’2012 . 20, 23
Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approx-
imators. Neural Computation, 22(8), 2192–2207. 420
Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural
gradient algorithm. In NIPS’07 . 158
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de
Paris VI. 16, 369
LeCun, Y., Jackel, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D.,
Howard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications
of neural network chips and automatic learning. IEEE Communications Magazine,
27(11), 41–46. 248
539
BIBLIOGRAPHY
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998a). Gradient-based learning
applied to document recognition. Proceedings of the IEEE , 86(11), 2278–2324. 14, 23
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 16, 19,
282, 289, 291, 319
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area
V2. In NIPS’07 . 373
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In L. Bottou
and M. Littman, editors, ICML 2009. ACM, Montreal, Canada. 504, 505
Lenat, D. B. and Guha, R. V. (1989). Building large knowledge-based systems; represen-
tation and inference in the Cyc project. Addison-Wesley Longman Publishing Co., Inc.
2
Leprieur, H. and Haffner, P. (1995). Discriminant learning with minimum memory loss
for improved non-vocabulary rejection. In EUROSPEECH’95, Madrid, Spain. 288
Lin, T., Horne, B. G., Tino, P., and Giles, C. L. (1996). Learning long-term dependencies
is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural
Networks, 7(6), 1329–1338. 270
Linde, N. (1992). The machine that changed the world, episode 3. Documentary minis-
eries. 2
Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to ap-
proximately evaluate or simulate. In Proceedings of the 27th International Conference
on Machine Learning (ICML’10). 482
Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine
invented by Charles Babbage”. 1
Lowerre, B. (1976). The Harpy Speech Recognition System. Ph.D. thesis. 282, 288, 292
Lukoˇseviˇcius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent
neural network training. Computer Science Review, 3(3), 127–149. 267
Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with
convolutional spike-and-slab RBMs and deep extensions. In AISTATS’2013 . 87
Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09 . 456
Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without
stable states: A new framework for neural computation based on perturbations. Neural
Computation, 14(11), 2531–2560. 267
MacKay, D. (2003). Information Theory, Inference and Learning Algorithms. Cambridge
University Press. 54
540
BIBLIOGRAPHY
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning
with multimodal recurrent neural networks. In ICLR’2015 . arXiv:1410.1090. 86
Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for
restricted Boltzmann machine learning. In Proceedings of The Thirteenth International
Conference on Artificial Intelligence and Statistics (AISTATS’10), volume 9, pages
509–516. 451, 456, 484
Martens, J. and Medabalimi, V. (2014). On the expressive efficiency of sum product
networks. arXiv:1411.7717 . 421
Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-
free optimization. In Proc. ICML’2011 . ACM. 277
Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous
state space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612.
454
Matan, O., Burges, C. J. C., LeCun, Y., and Denker, J. S. (1992). Multi-digit recognition
using a space displacement neural network. In NIPS’91 , pages 488–495, San Mateo
CA. Morgan Kaufmann. 290
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall,
London. 150
McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5, 115–133. 13
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E.,
Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra,
J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In
JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7. 171, 405
Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surfing on the
manifold. Learning Workshop, Snowbird. 508
Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular
PDP networks and distributed lexicon. Cognitive Science, 15, 343–399. 321
Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis,
Brno University of Technology. 155, 278
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011a). Empiri-
cal evaluation and combination of advanced language modeling techniques. In Proc.
12th annual conference of the international speech communication association (INTER-
SPEECH 2011). 332
Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for
training large scale neural network language models. In Proc. ASRU’2011. 332
541
BIBLIOGRAPHY
Mikolov, T., Le, Q. V., and Sutskever, I. (2013). Exploiting similarities among languages
for machine translation. Technical report, arXiv:1309.4168. 409
Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cam-
bridge UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 461
Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 13
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 84
Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-
contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and
K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages
2265–2273. Curran Associates, Inc. 331, 459
Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural proba-
bilistic language models. In ICML’2012 , pages 1751–1758. 331
Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief
networks. IEEE Trans. on Audio, Speech and Language Processing, 20(1), 14–22. 318
Mont´ufar, G. (2014). Universal approximation depth and errors of narrow belief networks
with discrete units. Neural Computation, 26. 420
Mont´ufar, G. and Ay, N. (2011). Refinements of universal approximation results for
deep belief networks and restricted Boltzmann machines. Neural Computation, 23(5),
1306–1319. 420
Montufar, G. and Morton, J. (2014). When does a mixture of products contain a product
of mixtures? SIAM Journal on Discrete Mathematics, 29(1), 321–347. 419
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear
regions of deep neural networks. In NIPS’2014 . 17, 418, 421, 422
Mor-Yosef, S., Samueloff, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking
the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet
Gynecol, 75(6), 944–7. 2
Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language
model. In AISTATS’2005. 326, 329
Mozer, M. C. (1992). The induction of multiscale temporal structure. In NIPS’91 , pages
275–282, San Mateo, CA. Morgan Kaufmann. 270, 271, 280
Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cam-
bridge, MA, USA. 84, 132, 134
Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In
ICML’2014 . 155, 266, 267
542
BIBLIOGRAPHY
Nadas, A., Nahamoo, D., and Picheny, M. A. (1988). On a model-robust training method
for speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing,
ASSP-36(9), 1432–1436. 288
Nair, V. and Hinton, G. (2010a). Rectified linear units improve restricted Boltzmann
machines. In ICML’2010 . 149, 385
Nair, V. and Hinton, G. E. (2010b). Rectified linear units improve restricted Boltzmann
machines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh
International Conference on Machine Learning (ICML-10), pages 807–814. ACM. 14
Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypoth-
esis. In NIPS’2010 . 139, 426
Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56,
71–113. 506
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.
Springer. 201
Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2),
125–139. 463, 464
Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance
sampling. 465
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Read-
ing digits in natural images with unsupervised feature learning. Deep Learning and
Unsupervised Feature Learning Workshop, NIPS. 19
Ney, H. and Kneser, R. (1993). Improved clustering techniques for class-based statistical
language modelling. In European Conference on Speech Communication and Technol-
ogy (Eurospeech), pages 973–976, Berlin. 323
Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison of
part-of-speech and automatically derived category-based language models for speech
recognition. In International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 177–180. 323
Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in
classifying static speech patterns. Computer Speech and Language, 4, 275–289. 149
Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 80, 82
Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural
Computation, 17, 1665–1699. 14
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field prop-
erties by learning a sparse code for natural images. Nature, 381, 607–609. 372, 425
543
BIBLIOGRAPHY
Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set:
a strategy employed by V1? Vision Research, 37, 3311–3325. 317, 385
Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning
algorithms for various stochastic models. Neural Networks, 13(7), 755 764. 158
Pascanu, R. (2014). On recurrent and deep networks. Ph.D. thesis, Universit´e de
Montr´eal. 210, 211
Pascanu, R. and Bengio, Y. (2012). On the difficulty of training recurrent neural networks.
Technical Report arXiv:1211.5063, Universite de Montreal. 155
Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Tech-
nical report, arXiv:1301.3584. 158
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent
neural networks. In ICML’2013 . 155, 214, 267, 271, 278, 279, 280
Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. Technical report, U.
Montreal, arXiv:1312.6098. 171
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014a). How to construct deep
recurrent neural networks. In ICLR’2014 . 17, 273, 275, 319, 421, 422
Pascanu, R., G¨ul¸cehre, C¸ ., Cho, K., and Bengio, Y. (2014b). How to construct deep
recurrent neural networks. In ICLR’2014 . 200
Pascanu, R., Montufar, G., and Bengio, Y. (2014c). On the number of inference regions
of deep feed forward networks with piece-wise linear activations. In ICLR’2014 . 419
Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential
reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, Uni-
versity of California, Irvine, pages 329–334. 343
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann. 47
Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 27
Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recog-
nition hard? PLoS Comput Biol, 4. 315, 505
Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46(1),
77–105. 260
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. 217
Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders
and deep networks. CoRR, abs/1406.1831. 188
544
BIBLIOGRAPHY
Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In
UAI’2011 , Barcelona, Spain. 171, 421
Poundstone, W. (2005). Fortune’s Formula: The untold story of the scientific betting
system that beat the casinos and Wall Street. Macmillan. 55
Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 149
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual
representation by single neurons in the human brain. Nature, 435(7045), 1102–1107.
246
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2), 257–286. 286, 318
Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE
ASSP Magazine, pages 257–285. 250, 286
Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive
distribution estimator (NADE-k). Technical report, arXiv:1406.1485. 266
Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale deep unsupervised learning
using graphics processors. In L. Bottou and M. Littman, editors, ICML 2009 , pages
873–880, New York, NY, USA. ACM. 23
Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Founda-
tions of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster
University Archive for the History of Economic Thought. 48
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse
representations with an energy-based model. In NIPS’2006. 12, 16, 385, 396, 397
Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief
networks. In NIPS’2007 . 385
Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical
parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–89. 114
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and
approximate inference in deep generative models. In ICML’2014. 360
Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning
through cross-modal transfer. In 27th Annual Conference on Neural Information Pro-
cessing Systems (NIPS 2013). 409
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-
encoders: Explicit invariance during feature extraction. In ICML’2011. 392, 394,
428
545
BIBLIOGRAPHY
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.
(2011b). Higher order contractive auto-encoder. In European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in Databases (ECML
PKDD). 373
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X.
(2011c). Higher order contractive auto-encoder. In ECML PKDD. 392
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011d). The manifold
tangent classifier. In NIPS’2011 . 441
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for
sampling contractive auto-encoders. In ICML’2012. 507, 508
Roberts, S. and Everson, R. (2001). Independent component analysis: principles and
practice. Cambridge University Press. 380
Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech
recognition system. Computer Speech and Language, 5(3), 259–274. 23, 318
Rockafellar, R. T. (1997). Convex analysis. princeton landmarks in mathematics. 80
Rosen, J. B. (1960). The gradient projection method for nonlinear programming. part
i. linear constraints. Journal of the Society for Industrial and Applied Mathematics,
8(1), pp. 181–217. 80
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological Review, 65, 386–408. 12, 13, 23
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 13, 23
Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. Science, 290(5500). 139, 429
Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-
propagating errors. Nature, 323, 533–536. 12, 16, 21, 319
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning internal repre-
sentations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,
Parallel Distributed Processing, volume 1, chapter 8, pages 318–362. MIT Press, Cam-
bridge. 19, 23
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986c). Learning representations
by back-propagating errors. Nature, 323, 533–536. 144, 250
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986d). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press,
Cambridge. 15
546
BIBLIOGRAPHY
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986e). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, volume 1.
MIT Press, Cambridge. 144
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large
Scale Visual Recognition Challenge. 19
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,
A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition
challenge. arXiv preprint arXiv:1409.0575 . 24
Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal
elements of macaque V1 receptive fields. Neuron, 46(6), 945–956. 248
Sainath, T., rahman Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep
convolutional neural networks for LVCSR. In ICASSP 2013 . 319
Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of
the International Conference on Artificial Intelligence and Statistics, volume 5, pages
448–455. 20, 23, 397, 489, 493, 498, 500
Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings
of the Twelfth International Conference on Artificial Intelligence and Statistics (AIS-
TATS 2009), volume 8. 496, 502, 513
Salakhutdinov, R. and Hinton, G. E. (2008). Using deep belief nets to learn covariance
kernels for Gaussian processes. In NIPS’07 , pages 1249–1256, Cambridge, MA. MIT
Press. 412
Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief
networks. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,
volume 25, pages 872–879. ACM. 464
Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean field theory for sigmoid belief
networks. Journal of Artificial Intelligence Research, 4, 61–76. 23
Schaul, T., Zhang, S., and LeCun, Y. (2012). No More Pesky Learning Rates. Technical
report, New York University, arxiv 1206.1106. 222
Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of
history compression. Neural Computation, 4(2), 234–242. 275
Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on
Neural Networks, 7(1), 142–146. 321
Scolkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 133
Scolkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation, 10, 1299–1319. 139, 429
547
BIBLIOGRAPHY
Scolkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods
Support Vector Learning. MIT Press, Cambridge, MA. 16, 149, 172
Schulz, H. and Behnke, S. (2012). Learning two-layer contractive encodings. In
ICANN’2012 , pages 620–628. 394
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11), 2673–2681. 258
Schwenk, H. (2007). Continuous space language models. Computer speech and language,
21, 492–518. 321, 325
Schwenk, H. (2010). Continuous space language models for statistical machine translation.
The Prague Bulletin of Mathematical Linguistics, 93, 137–146. 321, 333
Schwenk, H. (2014). Cleaned subset of wmt ’14 dataset. 19
Schwenk, H. and Gauvain, J.-L. (2002). Connectionist language modeling for large vocab-
ulary continuous speech recognition. In International Conference on Acoustics, Speech
and Signal Processing (ICASSP), volume 1, pages 765–768. 321
Schwenk, H. and Gauvain, J.-L. (2005). Building continuous space language models for
transcribing european languages. In Interspeech, pages 737–740. 321
Schwenk, H., Costa-juss`a, M. R., and Fonollosa, J. A. R. (2006). Continuous space lan-
guage models for the iwslt 2006 task. In International Workshop on Spoken Language
Translation, pages 166–173. 321, 333
Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-
dependent deep neural networks. In Interspeech 2011 , pages 437–440. 21
Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied
to house numbers digit classification. CoRR, abs/1204.3968. 316
Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection
with unsupervised multi-stage feature learning. In Proc. International Conference on
Computer Vision and Pattern Recognition (CVPR’13). IEEE. 21, 171
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical
Journal, 27(3), 379—-423. 55
Shannon, C. E. (1949). Communication in the presence of noise. Proceedings of the
Institute of Radio Engineers, 37(1), 10–21. 55
Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publica-
tions. 27
Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–
548. 253
548
BIBLIOGRAPHY
Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied
Mathematics Letters, 4(6), 77–80. 253
Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets.
Journal of Computer and Systems Sciences, 50(1), 132–150. 214
Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism
for specifying selected invariances in an adaptive network. In NIPS’1991 . 440, 441
Simard, P. Y., LeCun, Y., and Denker, J. (1993). Efficient pattern recognition using a
new transformation distance. In NIPS’92 . 439
Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. (1998). Transformation
invariance in pattern recognition — tangent distance and tangent propagation. Lecture
Notes in Computer Science, 1524. 439
Sj¨oberg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a min-
imum, with application to neural networks. International Journal of Control, 62(6),
1391–1407. 196
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of
harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed
Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 350, 362
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a).
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In
NIPS’2011 . 261
Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural
language with recursive neural networks. In Proceedings of the Twenty-Eighth Inter-
national Conference on Machine Learning (ICML’2011). 261
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c).
Semi-supervised recursive autoencoders for predicting sentiment distributions. In
EMNLP’2011 . 261
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.
(2013). Recursive deep models for semantic compositionality over a sentiment treebank.
In EMNLP’2013 . 261
Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural
networks. Complex Systems, 2, 625–639. 152
Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local
minima even for networks without hidden layers. Complex Systems, 3, 91–106. 208
Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann
machines. In NIPS’2012 . 410
549
BIBLIOGRAPHY
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: A simple way to prevent neural networks from overfitting. Journal of Ma-
chine Learning Research, 15, 1929–1958. 198, 200, 201, 500
Stewart, L., He, X., and Zemel, R. S. (2007). Learning flexible features for conditional
random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30(8), 1415–1426. 282
Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Department of
computer science, University of Toronto. 268, 269, 277
Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive
Divergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International
Conference on Artificial Intelligence and Statistics (AISTATS), volume 9, pages 789–
795. 449
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of
initialization and momentum in deep learning. In ICML. 217, 268, 269, 277
Sutskever, I., Vinyals, O., and Le, Q. V. (2014a). Sequence to sequence learning with
neural networks. Technical report, arXiv:1409.3215. 22, 86, 273, 274
Sutskever, I., Vinyals, O., and Le, Q. V. (2014b). Sequence to sequence learning with
neural networks. In NIPS’2014 . 334, 335
Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines.
Master’s thesis, University of British Columbia. 390
Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On
autoencoders and score matching for energy based models. In ICML’2011 . ACM. 457
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-
houcke, V., and Rabinovich, A. (2014). Going deeper with convolutions. Technical
report, arXiv:1409.4842. 20, 21, 23, 204, 236
Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). Deepface: Closing the gap to
human-level performance in face verification. In CVPR’2014 . 85
Tang, Y. and Eliasmith, C. (2010). Deep networks for robust visual recognition. In
Proceedings of the 27th International Conference on Machine Learning, June 21-24,
2010, Haifa, Israel. 187
Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework
for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 139, 400, 401,
429
Thrun, S. (1995). Learning to play the game of chess. In NIPS’1994 . 441
Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society B, 58, 267–288. 183
550
BIBLIOGRAPHY
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to
the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors,
ICML 2008 , pages 1064–1071. ACM. 451, 487
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis.
Journal of the Royal Statistical Society B, 61(3), 611–622. 378, 379
Torabi, A., Pal, C. J., Larochelle, H., and Courville, A. C. (2015). Using descriptive
video services to create a large data source for video annotation research. CoRR,
abs/1503.01070. 128
Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autore-
gressive density-estimator. In NIPS’2013 . 264, 266
van der Maaten, L. and Hinton, G. E. (2008a). Visualizing data using t-SNE. J. Machine
Learning Res., 9. 321, 400, 429, 433
van der Maaten, L. and Hinton, G. E. (2008b). Visualizing data using t-SNE. Journal of
Machine Learning Research, 9, 2579–2605. 401
Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-
Verlag, Berlin. 97
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York.
97
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Probability and Its Applications,
16, 264–280. 97
Vincent, P. (2011a). A connection between score matching and denoising autoencoders.
Neural Computation, 23(7). 390, 392, 507
Vincent, P. (2011b). A connection between score matching and denoising autoencoders.
Neural Computation, 23(7), 1661–1674. 457, 509
Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press.
431
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and
composing robust features with denoising autoencoders. In ICML 2008 . 387
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network with a local
denoising criterion. J. Machine Learning Res., 11. 387
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a). Gram-
mar as a foreign language. Technical report, arXiv:1412.7449. 273
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural
image caption generator. arXiv 1411.4555. 273
551
BIBLIOGRAPHY
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: a neural image
caption generator. In CVPR’2015 . arXiv:1411.4555. 86
Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal
projections directed to the auditory pathway. Nature, 404(6780), 871–876. 14
Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization.
In Advances in Neural Information Processing Systems 26 , pages 351–359. 201
Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme
recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech,
and Signal Processing, 37, 328–339. 250, 312, 318
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of
neural networks using dropconnect. In ICML’2013. 202
Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 201
Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical
analysis of dropout in piecewise linear networks. In ICLR’2014 . 201
Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by
semidefinite programming. In CVPR’2004 , pages 988–995. 139, 429
Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised em-
bedding. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 , pages
1168–1175, New York, NY, USA. ACM. 411
Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning
to rank with joint word-image embeddings. Machine Learning, 81(1), 21–35. 261
White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward net-
works can learn arbitrary mappings. Neural Networks, 3(5), 535–549. 170
Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON
Convention Record, volume 4, pages 96–104. IRE, New York. 13, 19, 20, 23
Wikipedia (2015). List of animals by number of neurons wikipedia, the free encyclo-
pedia. [Online; accessed 4-March-2015]. 20, 23
Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In
NIPS’95 , pages 514–520. MIT Press, Cambridge, MA. 172
Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms.
Neural Computation, 8(7), 1341–1390. 99, 170
Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image
recognition. arXiv:1501.02876. 21
552
BIBLIOGRAPHY
Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated
splicing using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562.
201
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,
and Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with
visual attention. 86
Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. S.,
and Bengio, Y. (2015b). Show, attend and tell: Neural image caption generation with
visual attention. arXiv:1502.03044. 273
Xu, L. and Jordan, M. I. (1996). On convergence properties of the EM algorithm for
gaussian mixtures. Neural Computation, 8, 129–151. 287
Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly
decreasing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 449,
487
Zaslavsky, T. (1975). Facing Up to Arrangements: Face-Count Formulas for Partitions
of Space by Hyperplanes. Number no. 154 in Memoirs of the American Mathematical
Society. American Mathematical Society. 419
Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional net-
works. In ECCV’14 . 6
Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative
stochastic network for protein secondary structure prediction. In ICML’2014 . 514, 515
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society, Series B, 67(2), 301–320. 157
ohrer, M. and Pernkopf, F. (2014). General stochastic networks for classification. In
NIPS’2014 . 514
553
Index
L
p
norm, 47
, 347
Active constraint, 116
ADALINE, see Adaptive Linear Element
Adaptive Linear Element, 18, 28, 32
Affine, 135
AIS, see annealed importance sampling
Almost everywhere, 92
Ancestral sampling, 505
ANN, see Artificial neural network
Annealed importance sampling, 638, 685
Approximate inference, 496
Artificial intelligence, 1
Artificial neural network, see Neural net-
work
Asymptotically unbiased, 150
Autoencoder, 6
Bagging, 269
Bayes’ rule, 90, 91
Bayesian hyperparameter optimization, 426
Bayesian network, see directed graphical model
Bayesian probability, 68
Bayesian statistics, 166
Beam Search, 416
Belief network, see directed graphical model
Bernoulli distribution, 82
Bias, 150
Boltzmann distribution, 482
Boltzmann machine, 483
Boltzmann Machines, 660
Broadcasting, 39
Calculus of variations, 654
Categorical distribution, see multinoulli dis-
tribution82
CD, see contrastive divergence
Centering trick (DBM), 690
Central limit theorem, 85
Chain rule of probability, 74
Chess, 2
Chord, 490
Chordal graph, 490
Classical regularization, 253
Classification, 121
Cliffs, 297
Clipping the gradient, 395
Clique potential, see factor (graphical model)
CNN, see convolutional neural network
Collider, see explaining away
Computer vision, 431
Conditional computation, see dynamically
structured nets, 423
Conditional Computation in Neural Nets,
460
Conditional independence, vi, 75
Conditional probability, 73
Connectionism, 20
consistency, 159
Constrained optimization, 114
Context-specific independence, 486
Contrast, 434
Contrastive divergence, 619, 685, 691
Convolution, 316, 696
Convolutional network, 20
Convolutional neural network, 316
Coordinate descent, 311, 690
Correlation, 77
Cost function, see objective function
554
INDEX
Covariance, vi, 76
Covariance matrix, 77
Cross-entropy, 215
cross-entropy, 162
Cross-validation, 147
curse of dimensionality, 187
Cyc, 3
D-separation, 485
Data generating distribution, 136
Data generating process, 136
Dataset, 128
Dataset augmentation, 433, 440
DBM, see deep Boltzmann machine
Decoder, 6
Deep belief network, 32, 645, 661, 674, 697
Deep Blue, 2
Deep Boltzmann machine, 28, 32, 645, 661,
677, 691, 697
Deep learning, 2, 7
Denoising score matching, 631
Density estimation, 125
Derivative, vi, 102
Detector layer, 328
Diagonal matrix, 49
Dirac delta function, 86
Directed graphical model, 473
Directional derivative, 105
Distributed Representation, 574
Distributed representation, 21
domain adaptation, 559
Dot product, 41
Doubly block circulant matrix, 319
Dream sleep, 618, 659
DropConnect, 287
Dropout, 282, 426, 427, 691
Dynamically structured networks, 423
E-step, 649
Early stopping, 226, 272, 274, 276–278
EBM, see energy-based model
Echo state network, 28, 32
Effective number of parameters, 257
Efficiency, 164
Eigendecomposition, 51
Eigenvalue, 52
Eigenvector, 52
ELBO, see evidence lower bound
Element-wise product, see Hadamard prod-
uct, see Hadamard product
EM, see expectation maximization
Embedding, 594
Empirical distribution, 86
Empirical risk, 292
Empirical risk minimization, 292
Encoder, 6
Energy function, 482
Energy-based model, 482, 678
Ensemble methods, 269
Epoch, 295, 307
Equality constraint, 115
Equivariance, 324
Error function, see objective function
Euclidean norm, 47
Euler-Lagrange equation, 655
Evidence lower bound, 647, 648, 650, 652,
676
Example, 128
Expectation, 76
Expectation maximization, 649
Expected value, see expectation
Explaining away, 487
Factor (graphical model), 478
Factor graph, 492
Factors of variation, 6
Feature, 128
Fourier transform, 344
Fovea, 348
Frequentist probability, 68
Frequentist statistics, 166
Functional derivatives, 654
Gaussian distribution, see Normal distri-
bution83
Gaussian kernel, 179
Gaussian mixture, 88
GCN, see Global contrast normalization
Generalization, 135
Generalized Lagrange function, see Gener-
alized Lagrangian
Generalized Lagrangian, 115
555
INDEX
Gibbs distribution, 479
Gibbs sampling, 507
Global contrast normalization, 434
GPU, see Graphics processing unit
Gradient, 105
Gradient clipping, 395
Gradient descent, 107
Graph, iv, v
Graph Transformer, 414
Graphical model, see structured probabilis-
tic model
Graphics processing unit, 419
Greedy layer-wise unsupervised pre-training,
550
Grid search, 426
Hadamard product, v, 41
Harmonium, see Restricted Boltzmann ma-
chine500
Harmony theory, 484
Helmholtz free energy, see evidence lower
bound
Hessian matrix, vi, 108
Hidden layer, 9
Hyperparameters, 144, 426
i.i.d., 148
i.i.d. assumptions, 136
Identity matrix, 43
Immorality, 489
Independence, vi, 75
Independent and identically distributed, 148
Inequality constraint, 115
Inference, 471, 496, 645, 646, 648–650, 652,
654, 658
Integral, vi
Invariance, 328
Jacobian matrix, vi, 93, 108
Joint probability, 70
Karush-Kuhn-Tucker conditions, 117
Karush–Kuhn–Tucker, 114
Kernel (convolution), 318, 319
Kernel trick, 178
KKT, see Karush–Kuhn–Tucker
KKT conditions, see Karush-Kuhn-Tucker
conditions
KL divergence, see Kllback-Leibler diver-
gence81
Knowledge base, 3
Kullback-Leibler divergence, vi, 81
Lagrange multipliers, 114, 117, 656
Lagrangian, see Gneralized Lagrangian115
Latent variable, 518
LCN, see local contrast normalization
Line search, 107
Linear combination, 45
Linear dependence, 46
Linear regression, 132, 135, 177
Local conditional probability distribution,
474
Local contrast normalization, 437
Logistic regression, 3, 178
Logistic sigmoid, 10, 88
Loop, 490
Loss function, see objective function
LSTM, 29
M-step, 649
Machine learning, 3
Main diagonal, 39
Manifold, 198
Manifold hypothesis, 590
Manifold hypothesis, 200
Manifold learning, 198, 590
MAP inference, 652
Marginal probability, 72
Markov chain, 505
Markov network, see undirected model476
Markov random field, see undirected model476
Matrix, iv, v, 38
Matrix inverse, 43
Matrix product, 40
Max pooling, 328
Maximum likelihood, 160
Mean field, 685, 691
Mean squared error, 133
Measure theory, 91
Measure zero, 92
Method of steepest descent, see gradient de-
scent
556
INDEX
Missing inputs, 122
Mixing (Markov chain), 508
Mixture distribution, 87
MLP, see multilayer perception
MNIST, 691
Model averaging, 269
Model capacity, 426
Model compression, 422
Moore-Penrose Pseudoinverse, 57
Moore-Penrose pseudoinverse, 265
Moralized graph, 490
MP-DBM, see multi-prediction DBM
MRF (Markov Random Field), see undi-
rected model476
MSE, see mean squared error133
Multi-modal learning, 568
Multi-prediction DBM, 689, 690
Multi-task learning, 287
Multilayer perception, 7
Multilayer perceptron, 32
Multinomial distribution, 82
Multinoulli distribution, 82
Naive Bayes, 4, 93
Nat, 79
natural image, 468
Negative definite, 109
Negative phase, 615, 617
Neocognitron, 20, 28, 32
Nesterov momentum, 308
Netflix Grand Prize, 272
Neural network, 16
Neuroscience, 18
Noise-contrastive estimation, 632
Non-parametric, 141
Norm, vi, 47
Normal distribution, 83, 85
Normal equations, 257
Object detection, 432
Object recognition, 432
Objective function, 101
Offset, 211
one-shot learning, 565
Orthodox statistics, see frequentist statis-
tics
Orthogonal matrix, 51
Orthogonality, 50
Overfitting, 426
Parallel distributed processing, 20
Parameter sharing, 321
Parametric, 141
Partial derivative, 105
Partition function, 480, 613, 685
PCA, see principal components analysis
PCD, see stochastic maximum likelihood
Perceptron, 17, 32
Perplexity, 164
Persistent contrastive divergence, see stochas-
tic maximum likelihood
Point Estimator, 148
Pooling, 316, 696
Positive definite, 109
Positive phase, 615, 617
Pre-training, 550
Precision (of a normal distribution), 83, 86
Predictive sparse decomposition, 516, 532
Preprocessing, 432
Primary visual cortex, 346
Principal components analysis, 59, 438, 645
Principle components analysis, 183–186, 201
Prior, 166
Prior probability distribution, 166
Probabilistic max pooling, 696
Probability density function, 71
Probability distribution, 70
Probability function estimation, 125
Probability mass function, 70
Product rule of probability, see chain rule
of probability
PSD, see predictive sparse decomposition
Pseudolikelihood, 625
Random search, 426
Random variable, 69
Ratio matching, 630
RBM, see restricted Boltzmann machine
Receptive field, 322
Recurrent network, 32
Regression, 123
Regularization, 250, 426
557
INDEX
Representation learning, 4
Restricted Boltzmann machine, 500, 645,
661, 664, 691, 693, 694, 696
Ridge regression, 254
Risk, 292
Sample mean, 151
Scalar, iv, v, 37
Score matching, 628
Second derivative, 108
Second derivative test, 108
Self-information, 79
Semi-supervised learning, 181
Separable convolution, 344
Separation (probabilistic modeling), 484
Set, iv, v
SGD, see stochastic gradient descent, see
stochastic gradient descent
Shannon entropy, vi, 79, 655
Sigmoid, vi, see logistic sigmoid
Sigmoid belief network, 32
Simple cell, 347
Singular value, see singular value decompo-
sition
Singular value decomposition, 55, 184, 185
Singular vector, see singular value decom-
position
SML, see stochastic maximum likelihood
Softmax, 217
Softplus, vi, 88
Spam detection, 4
Sparse coding, 527, 646
Spearmint, 426
spectral radius, 381
Sphering, see Whitening, 436
Spike and slab restricted Boltzmann ma-
chine, 694
Square matrix, 46
ssRBM, see spike and slab restricted Boltz-
mann machine
Standard deviation, 76
Statistic, 148
Statistical learning theory, 136
Steepest descent, see gradient descent
Stochastic gradient descent, 18, 295, 307,
691
Stochastic maximum likelihood, 622, 686,
691
Stochastic pooling, 287
Structure learning, 495
Structured output, 123, 124
Structured probabilistic model, 466
Sum rule of probability, 73
Supervised learning, 129
Support vector machine, 178
Surrogate loss function, 293
SVD, see singular value decomposition
Symmetric matrix, 50, 54
Tangent plane, 594
Tensor, iv, v, 39
Test set, 136
Tiled convolution, 338
Toeplitz matrix, 319
Trace operator, 58
Training error, 136
Transcription, 123
Transfer learning, 559
Transpose, v, 39
Triangle inequality, 47
Triangulated graph, see chordal graph
Unbiased, 150
Undirected model, 476
Uniform distribution, 71
Unit norm, 50
Unit vector, 50
Unnormalized probability distribution, 478
Unsupervised learning, 128, 181
Unsupervised pre-training, 550
V-structure, see explaining away
V1, 346
Variance, vi, 76
Variational derivatives, see functional deriva-
tives
Variational free energy, see evidence lower
bound
Vector, iv, v, 37
Visible layer, 9
Viterbi decoding, 406
Weight decay, 254, 427
558
INDEX
Weights, 17, 132
Whitening, 436, 438
ZCA, see zero-phase components analysis
zero-data learning, 565
Zero-phase components analysis, 438
zero-shot learning, 565
559