Bibliography
Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data generating
distribution. Technical Report Arxiv report 1211.4246, Universit´e de Montr´eal. 224
Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data generating
distribution. In ICLR’2013. also arXiv report 1211.4246. 224, 226
Amari, S. (1997). Neural learning in structured parameter spaces - natural Riemannian gradient.
In Advances in Neural Information Processing Systems, pages 127–133. MIT Press. 105
Bahl, L. R., Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition with
continuous-parameter hidden Markov models. Computer, Speech and Language, 2, 219–234.
47
Baldi, P. and Sadowski, P. J. (2013). Understanding dropout. In Advances in Neural Information
Processing Systems 26 , pages 2814–2822. 128
Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal function.
IEEE Trans. on Information Theory, 39, 930–945. 112
Bartholomew, D. J. (1987). Latent variable models and factor analysis. Oxford University Press.
159
Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and Applications.
Wiley. 159
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard,
N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning
and Unsupervised Feature Learning NIPS 2012 Workshop. 55
Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation, 15(6), 1373–1396. 91, 211
Bengio, S. and Bengio, Y. (2000a). Taking on the curse of dimensionality in joint distributions
using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining
and Knowledge Discovery, 11(3), 550–557. 202
Bengio, Y. (2009). Learning deep architectures for AI . Now Publishers. 19, 87, 114
Bengio, Y. (2011). Deep learning of representations for unsupervised and transfer learning. In
JMLR W&CP: Proc. Unsupervised and Transfer Learning. 18
Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi-layer
neural networks. In NIPS’99, pages 400–406. MIT Press. 202, 204, 205
283
Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural
Computation, 21(6), 1601–1621. 240
Bengio, Y. and LeCun, Y. (2007a). Scaling learning algorithms towards AI. In Large Scale
Kernel Machines. 87
Bengio, Y. and LeCun, Y. (2007b). Scaling learning algorithms towards AI. In L. Bottou,
O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press.
115
Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’04, pages
129–136. MIT Press. 90, 212, 213
Bengio, Y., Ducharme, R., and Vincent, P. (2001a). A neural probabilistic language model. In
NIPS’00 , pages 932–938. MIT Press. 15
Bengio, Y., Ducharme, R., and Vincent, P. (2001b). A neural probabilistic language model. In
T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in NIPS 13, pages 932–938.
MIT Press. 215, 216
Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language
model. Journal of Machine Learning Research, 3, 1137–1155. 215, 216
Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions for
local kernel machines. In NIPS’2005. 87
Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows. In
NIPS’2005 . MIT Press. 90, 212
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of
deep networks. In NIPS’2006. 15, 121, 164
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In
ICML’09 . 106
Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013). Generalized denoising auto-encoders as
generative models. In Advances in Neural Information Processing Systems 26 (NIPS’13). 227
Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative stochastic
networks trainable by backprop. In Proceedings of the 30th International Conference on
Machine Learning (ICML’14). 227, 228
Bennett, C. (1976). Efficient estimation of free energy differences from Monte Carlo data. Journal
of Computational Physics, 22(2), 245–268. 236
Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive diver-
gence and persistent contrastive divergence. CoRR, abs/1312.6002. 243
Bergstra, J. (2011). Incorporating Complex Cells into Neural Networks for Pattern Classification.
Ph.D. thesis, Universit´e de Montr´eal. 155
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J.,
Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler.
In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
55
284
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195. 245
Bishop, C. M. (1994). Mixture density networks. 102
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the
vapnik–chervonenkis dimension. Journal of the ACM , 36(4), 929––865. 75, 76
Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and meaning
representations for open-text semantic parsing. AISTATS’2012 . 200
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for optimal margin
classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Computational learning
theory, pages 144–152, New York, NY, USA. ACM. 15, 87, 98
Bottou, L. (2011). From machine learning to machine reasoning. Technical report,
arXiv.1102.1808. 199, 200
Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular
value decomposition. Biological Cybernetics, 59, 291–294. 152
Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, New
York, NY, USA. 61
Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 91, 211
Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. 123
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regres-
sion Trees. Wadsworth International Group, Belmont, CA. 87
Carreira-Perpi˜nan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In
R. G. Cowell and Z. Ghahramani, editors, AISTATS’2005 , pages 33–40. Society for Artificial
Intelligence and Statistics. 240
Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923, UCSD.
14, 91, 207
Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural network
for traffic sign classification. Neural Networks, 32, 333–338. 16, 114
Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in unsuper-
vised feature learning. In Proceedings of the Thirteenth International Conference on Artificial
Intelligence and Statistics (AISTATS 2011). 278
Collobert, R. (2004). Large Scale Machine Learning. Ph.D. thesis, Universit´e de Paris VI, LIP6.
99
Comon, P. (1994). Independent component analysis - a new concept? Signal Processing, 36,
287–314. 162
Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297.
15, 87
285
Couprie, C., Farabet, C., Najman, L., and LeCun, Y. (2013). Indoor semantic segmentation using
depth information. In International Conference on Learning Representations (ICLR2013). 16,
114
Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by spike-and-
slab RBMs. In ICML’11. 134
Cover, T. (2006). Elements of Information Theory. Wiley-Interscience. 41
Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature, 304, 111–114.
239
Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the
mean-covariance restricted Boltzmann machine. In NIPS’2010 . 15
Dauphin, Y. and Bengio, Y. (2013a). Big neural networks waste capacity. In ICLR’2013 work-
shops track (oral presentation), arXiv: 1301.3583 . 18
Dauphin, Y. and Bengio, Y. (2013b). Stochastic ratio matching of RBMs for sparse high-
dimensional inputs. In NIPS26. NIPS Foundation. 248
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying
and attacking the saddle point problem in high-dimensional non-convex optimization. In
NIPS’2014 . 59
Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T. (2014).
The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics
(Proc. SIGGRAPH), 33(4), 79:1–79:10. 276
Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS. 114
Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010). Binary coding
of speech spectrograms using a deep auto-encoder. In Interspeech 2010 , Makuhari, Chiba,
Japan. 15
Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision.
Technical Report 1327, D´epartement d’Informatique et de Recherche Op´erationnelle, Univer-
sit´e de Montr´eal. 268
Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function. In
NIPS’2011 . 237
Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and
robust neural network joint models for statistical machine translation. In Proc. ACL’2014.
10
Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding tech-
niques for high-dimensional data. Technical Report 2003-08, Dept. Statistics, Stanford Uni-
versity. 91, 211
Dugas, C., Bengio, Y., elisle, F., and Nadeau, C. (2001). Incorporating second-order functional
knowledge for better option pricing. In NIPS’00, pages 472–478. MIT Press. 99
286
Ebrahimi, S., Pal, C., Bouthillier, X., Froumenty, P., Jean, S., Konda, K. R., Vincent, P.,
Courville, A., and Bengio, Y. (2013). Combining modality specific deep neural network mod-
els for emotion recognition in video. In Emotion Recognition In The Wild Challenge and
Workshop (Emotiw2013). 7, 114
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why
does unsupervised pre-training help deep learning? JMLR, 11, 625–660. 18
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013). Learning hierarchical features for
scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence. in press.
16, 114
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179–188. 70
Frasconi, P., Gori, M., and Sperduti, A. (1997). On the efficient classification of data structures
by neural networks. In Proc. Int. Joint Conf. on Artificial Intelligence. 199, 200
Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive processing of
data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. 200
Frey, B. J. (1998). Graphical models for machine learning and digital communication. MIT
Press. 201
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism
of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202. 15
Girosi, F. (1994). Regularization theory, radial basis functions and networks. In V. Cherkassky,
J. Friedman, and H. Wechsler, editors, From Statistics to Neural Networks, volume 136 of
NATO ASI Series, pages 166–187. Springer Berlin Heidelberg. 112
Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward
neural networks. In AISTATS’2010. 15
Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In
AISTATS’2011 . 15, 99, 221, 222
Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for large-scale sentiment
classification: A deep learning approach. In ICML’2011. 221
Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face Recog-
nition. Imperial College Press. 211, 214
Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In
NIPS’2009 , pages 646–654. 155, 221
Goodfellow, I., Koenig, N., Muja, M., Pantofaru, C., Sorokin, A., and Takayama, L. (2010).
Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction
(HRI), Osaka, Japan. ACM Press, ACM Press. 68
Goodfellow, I., Courville, A., and Bengio, Y. (2012). Large-scale feature learning with spike-
and-slab sparse coding. In ICML’2012 . 164
287
Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution for
autoencoders. Technical report, Universit´e de Montr´eal. 183
Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsu-
pervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.
10, 18, 114, 166
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout
networks. In ICML’2013 . 15
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013b). Maxout
networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319–1327. 126, 278
Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013c). Multi-prediction deep
Boltzmann machines. In NIPS26. NIPS Foundation. 246, 265, 267
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in
Computational Intelligence. Springer. 197
Graves, A. (2013). Generating sequences with recurrent neural networks. Technical report, arXiv
preprint arXiv:1308.0850. 103
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional
LSTM and other neural network architectures. Neural Networks, 18(5), 602–610. 197
Graves, A. and Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional
recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,
NIPS’2008 , pages 545–552. 197
Graves, A., Liwicki, M., Bunke, H., Schmidhuber, J., and Fern´andez, S. (2008). Unconstrained
on-line handwriting recognition with recurrent neural networks. In J. Platt, D. Koller,
Y. Singer, and S. Roweis, editors, NIPS’2007 , pages 577–584. 197
Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent
neural networks. In ICASSP’2013, pages 6645–6649. IEEE. 198
Gulcehre, C. and Bengio, Y. (2013). Knowledge matters: Importance of prior information for
optimization. In International Conference on Learning Representations (ICLR’2013). 19
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation prin-
ciple for unnormalized statistical models. In Proceedings of The Thirteenth International
Conference on Artificial Intelligence and Statistics (AISTATS’10). 248
H˚astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of
the 18th annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California.
ACM Press. 114
H˚astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Compu-
tational Complexity, 1, 113–129. 114
Herault, J. and Ans, B. (1984). Circuits neuronaux `a synapses modifiables: D´ecodage de mes-
sages composites par apprentissage non supervis´e. Comptes Rendus de l’Acad´emie des Sci-
ences, 299(III-13), 525––528. 162
288
Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen,
P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in
speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97. 10, 16
Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence. Tech-
nical Report GCNU TR 2000-004, Gatsby Unit, University College London. 239
Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 211
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Dimensionality of Data with Neural
Networks. Science, 313, 504–507. 121
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural
networks. Science, 313(5786), 504–507. 157, 164, 165
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and
Helmholtz free energy. In NIPS’1993 . 152
Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets.
Neural Computation, 18, 1527–1554. 15, 121, 164, 165, 262
Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World Chess
Champion. Princeton University Press, Princeton, NJ, USA. 2
Huang, F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for markov random
fields on lattice. Annals of the Institute of Statistical Mathematics, 54(1), 1–18. 245
Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96, pages
13–24. 189
Hyv¨arinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys,
2, 94–128. 162
Hyv¨arinen, A. (2005). Estimation of non-normalized statistical models using score matching.
Journal of Machine Learning Research, 6, 695–709. 246
Hyv¨arinen, A. (2007a). Connections between score matching, contrastive divergence, and pseu-
dolikelihood for continuous-valued variables. IEEE Transactions on Neural Networks, 18,
1529–1531. 247
Hyv¨arinen, A. (2007b). Some extensions of score matching. Computational Statistics and Data
Analysis, 51, 2499–2512. 247
Hyv¨arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. Wiley-
Interscience. 162
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixture of local
experts. Neural Computation, 3, 79–87. 102
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage
architecture for object recognition? In ICCV’09 . 99
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. 34
Jordan, M. I. (1998). Learning in Graphical Models. Kluwer, Dordrecht, Netherlands. 15
289
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algorithm
based on neuromimetic architecture. Signal Processing, 24, 1–10. 162
Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008). Fast inference in sparse coding algorithms
with applications to object recognition. CBLL-TR-2008-12-01, NYU. 155
Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary Mathe-
matics ; V. 1). American Mathematical Society. 138
Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score matching.
In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances
in Neural Information Processing Systems 23 , pages 1126–1134. 248
Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the
International Conference on Learning Representations (ICLR). 214, 215
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques.
MIT Press. 145, 147
Koren, Y. (2009). 1 the bellkor solution to the netflix grand prize. 125
Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images.
Technical report, University of Toronto. 134
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012a). ImageNet classification with deep convo-
lutional neural networks. In NIPS’2012. 7, 10, 15, 114
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012b). ImageNet classification with deep
convolutional neural networks. In Advances in Neural Information Processing Systems 25
(NIPS’2012). 68
Lake, B., Salakhutdinov, R., and Tenenbaum, J. (2013). One-shot learning by inverting a
compositional causal process. In NIPS’2013 . 18
Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural network archi-
tecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University.
187
Larochelle, H. and Bengio, Y. (2008). Classification using discriminative restricted Boltzmann
machines. In ICML’2008 . 155
Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. In
AISTATS’2011 . 200, 204
Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI
Conference on Artificial Intelligence. 18
Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural gradient
algorithm. In NIPS’07. 105
LeCun, Y. (1987). Mod`eles connexionistes de l’apprentissage. Ph.D. thesis, Universit´e de Paris
VI. 14, 152
290
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel,
L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Compu-
tation, 1(4), 541–551. 15
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998a). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 15
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient based learning applied to
document recognition. Proc. IEEE . 15
Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In
NIPS’07 . 155
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief networks for
scalable unsupervised learning of hierarchical representations. In L. Bottou and M. Littman,
editors, ICML 2009 . ACM, Montreal, Canada. 268, 269
Linde, N. (1992). The machine that changed the world, episode 3. Documentary miniseries. 3
Long, P. M. and Servedio, R. A. (2010). Restricted Boltzmann machines are hard to approxi-
mately evaluate or simulate. In Proceedings of the 27th International Conference on Machine
Learning (ICML’10). 260
Lovelace, A. (1842). Notes upon L. F. Menabrea’s “Sketch of the Analytical Engine invented by
Charles Babbage”. 2
Luo, H., Carrier, P.-L., Courville, A., and Bengio, Y. (2013). Texture modeling with convolu-
tional spike-and-slab RBMs and deep extensions. In AISTATS’2013. 69
Lyu, S. (2009). Interpretation and generalization of score matching. In UAI’09. 247
Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for restricted
Boltzmann machine learning. In Proceedings of The Thirteenth International Conference on
Artificial Intelligence and Statistics (AISTATS’10), volume 9, pages 509–516. 243, 247
Mase, S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous state
space Gibbsian processes. The Annals of Applied Probability, 5(3), pp. 603–612. 245
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London.
100
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller,
X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011).
Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP:
Proc. Unsupervised and Transfer Learning, volume 7. 10, 18, 114, 166
Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surfing on the manifold.
Learning Workshop, Snowbird. 226
Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno
University of Technology. 103
Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cambridge
UK Tech Rep MSRTR2005173 , 72(TR-2005-173). 233
291
Minsky, M. L. and Papert, S. A. (1969). Perceptrons. MIT Press, Cambridge. 14
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York. 67
Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-
contrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Wein-
berger, editors, Advances in Neural Information Processing Systems 26 , pages 2265–2273.
Curran Associates, Inc. 250
Mor-Yosef, S., Samueloff, A., Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking the
risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Gynecol,
75(6), 944–7. 3
Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cambridge,
MA, USA. 101
Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In
ICML’2014 . 103, 204, 205
Nair, V. and Hinton, G. (2010a). Rectified linear units improve restricted Boltzmann machines.
In ICML’2010. 99
Nair, V. and Hinton, G. E. (2010b). Rectified linear units improve restricted Boltzmann ma-
chines. In L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International
Conference on Machine Learning (ICML-10), pages 807–814. ACM. 15
Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis. In
NIPS’2010 . 14, 91, 207
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Lecture Notes in Statistics.
Springer. 127
Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2), 125–139.
235, 236
Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance sampling.
236, 237
Niranjan, M. and Fallside, F. (1990). Neural networks and radial basis functions in classifying
static speech patterns. Computer Speech and Language, 4, 275–289. 98
Nocedal, J. and Wright, S. (2006). Numerical Optimization. Springer. 65
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by
learning a sparse code for natural images. Nature, 381, 607–609. 154, 155
Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: a
strategy employed by V1? Vision Research, 37, 3311–3325. 221
Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning algorithms
for various stochastic models. Neural Networks, 13(7), 755 764. 105
Pascanu, R. and Bengio, Y. (2012). On the difficulty of training recurrent neural networks.
Technical Report arXiv:1211.5063, Universite de Montreal. 103
292
Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. Technical
report, arXiv:1301.3584. 105
Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural
networks. In ICML’2013 . 103
Pascanu, R., Montufar, G., and Bengio, Y. (2013b). On the number of inference regions of
deep feed forward networks with piece-wise linear activations. Technical report, U. Montreal,
arXiv:1312.6098. 114
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2014). How to construct deep recurrent
neural networks. In ICLR’2014. 127
Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidential reasoning.
In Proceedings of the 7th Conference of the Cognitive Science Society, University of California,
Irvine, pages 329–334. 136
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann. 35
Petersen, K. B. and Pedersen, M. S. (2006). The matrix cookbook. Version 20051003. 20
Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual object recognition
hard? PLoS Comput Biol, 4. 269
Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46(1), 77–
105. 199
Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In
UAI’2011 , Barcelona, Spain. 114
Powell, M. (1987). Radial basis functions for multivariable interpolation: A review. 98
Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP
Magazine, pages 257–285. 187
Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive distribution
estimator (NADE-k). Technical report, arXiv preprint arXiv:1406.1485. 204
Ramsey, F. P. (1926). Truth and probability. In R. B. Braithwaite, editor, The Foundations
of Mathematics and other Logical Essays, chapter 7, pages 156–198. McMaster University
Archive for the History of Economic Thought. 36
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse
representations with an energy-based model. In NIPS’2006 . 15, 121, 122, 164
Richard Socher, Milind Ganjoo, C. D. M. and Ng, A. Y. (2013). Zero-shot learning through
cross-modal transfer. In 27th Annual Conference on Neural Information Processing Systems
(NIPS 2013). 18
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011).
Higher order contractive auto-encoder. In European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). 155
293
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling
contractive auto-encoders. In ICML’2012 . 226
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological Review, 65, 386–408. 14
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. 14
Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed-
ding. Science, 290(5500). 91, 211
Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by back-
propagating errors. Nature, 323, 533–536. 14
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). Learning representations by
back-propagating errors. Nature, 323, 533–536. 95, 187
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group (1986c). Parallel Dis-
tributed Processing: Explorations in the Microstructure of Cognition, volume 1. MIT Press,
Cambridge. 95
Salakhutdinov, R. and Hinton, G. (2009a). Deep Boltzmann machines. In Proceedings of the
International Conference on Artificial Intelligence and Statistics, volume 5, pages 448–455.
165, 263, 265
Salakhutdinov, R. and Hinton, G. (2009b). Deep Boltzmann machines. In Proceedings of the
Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009),
volume 8. 266
Salakhutdinov, R. and Murray, I. (2008). On the quantitative analysis of deep belief networks.
In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, volume 25, pages
872–879. ACM. 236
Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history
compression. Neural Computation, 4(2), 234–242. 15
Scolkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation, 10, 1299–1319. 91, 211
Scolkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods Support
Vector Learning. MIT Press, Cambridge, MA. 15, 98, 114
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions
on Signal Processing, 45(11), 2673–2681. 197
Scolkopf, B. and Smola, A. (2002). Learning with kernels. MIT Press. 87
Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-dependent
deep neural networks. In Interspeech 2011 , pages 437–440. 15
Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian detection with
unsupervised multi-stage feature learning. In Proc. International Conference on Computer
Vision and Pattern Recognition (CVPR’13). IEEE. 16, 114
294
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2014). Overfeat:
Integrated recognition, localization and detection using convolutional networks. International
Conference on Learning Representations. 68
Shilov, G. (1977). Linear Algebra. Dover Books on Mathematics Series. Dover Publications. 20
Siegelmann, H. (1995). Computation beyond the Turing limit. Science, 268(5210), 545–548.
189
Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied Mathe-
matics Letters, 4(6), 77–80. 189
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony
theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing,
volume 1, chapter 6, pages 194–281. MIT Press, Cambridge. 141, 148
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic
pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011. 200
Socher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural language
with recursive neural networks. In Proceedings of the Twenty-Eighth International Conference
on Machine Learning (ICML’2011). 200
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c). Semi-
supervised recursive autoencoders for predicting sentiment distributions. In EMNLP’2011.
200
Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C.
(2013). Recursive deep models for semantic compositionality over a sentiment treebank. In
EMNLP’2013 . 200
Solla, S. A., Levin, E., and Fleisher, M. (1988). Accelerated learning in layered neural networks.
Complex Systems, 2, 625–639. 101
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout:
A simple way to prevent neural networks from overfitting. Journal of Machine Learning
Research, 15, 1929–1958. 125, 127, 265
Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive Divergence.
In Y. W. Teh and M. Titterington, editors, Proc. of the International Conference on Artificial
Intelligence and Statistics (AISTATS), volume 9, pages 789–795. 242
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural
networks. Technical report, arXiv preprint arXiv:1409.3215. 10
Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On autoencoders
and score matching for energy based models. In ICML’2011. ACM. 248
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V.,
and Rabinovich, A. (2014). Going deeper with convolutions. Technical report, arXiv preprint
arXiv:1409.4842. 10
Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework for
nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. 91, 211
295
Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the
likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 ,
pages 1064–1071. ACM. 243
Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis. Journal
of the Royal Statistical Society B, 61(3), 611–622. 159, 160
Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autoregressive
density-estimator. In NIPS’2013 . 203, 204
Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14,
2497–2539. 15
van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using t-SNE. J. Machine Learning
Res., 9. 211, 215
Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data. Springer-Verlag,
Berlin. 75, 76
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York. 75, 76
Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies
of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. 75,
76
Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural
Computation, 23(7), 1661–1674. 227, 248
Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press. 212
Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. In
Advances in Neural Information Processing Systems 26 , pages 351–359. 127
Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1989). Phoneme recogni-
tion using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 37, 328–339. 187
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural
networks using dropconnect. In ICML’2013 . 128
Wang, S. and Manning, C. (2013). Fast dropout training. In ICML’2013. 128
Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical analysis
of dropout in piecewise linear networks. In ICLR’2014. 128
Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by semidef-
inite programming. In CVPR’2004, pages 988–995. 91, 211
Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding.
In ICML 2008 . 19
Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to rank
with joint word-image embeddings. Machine Learning, 81(1), 21–35. 200
296
White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can
learn arbitrary mappings. Neural Networks, 3(5), 535–549. 112
Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON
Convention Record, volume 4, pages 96–104. IRE, New York. 14
Williams, C. K. I. and Rasmussen, C. E. (1996). Gaussian processes for regression. In NIPS’95,
pages 514–520. MIT Press, Cambridge, MA. 114
Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural
Computation, 8(7), 1341–1390. 113
Xiong, H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated splicing
using RNA sequence and cellular context. Bioinformatics, 27(18), 2554–2562. 127
Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly decreas-
ing ergodicity rates. In Stochastics and Stochastics Models, pages 177–228. 243
Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. In
ECCV’14 . 6, 68
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal
of the Royal Statistical Society, Series B, 67(2), 301–320. 105
297
Index
l
p
norm, 26
Active constraint, 65
AIS, see annealed importance sampling
Almost everywhere, 51
Ancestral sampling, 145
Annealed importance sampling, 233, 264
Artificial intelligence, 2
Asymptotically unbiased, 79
Bagging, 123
Bayes’ rule, 50
Bayesian network, see directed graphical model
Bayesian probability, 36
Belief network, see directed graphical model
Bernoulli distribution, 43
Boltzmann distribution, 140
Boltzmann machine, 140
Calculus of variations, 257
CD, see contrastive divergence
Centering trick (DBM), 266
Central limit theorem, 44
Chain rule of probability, 39
Chess, 2
Classification, 68
Clique potential, see factor (graphical model)
CNN, see convolutional neural network
Collider, see explaining away
Computer vision, 276
Conditional computation, see dynamically struc-
tured nets, 271
Conditional independence, 39
Conditional probability, 38
Constrained optimization, 64
Context-specific independence, 143
Contrast, 277
Contrastive divergence, 239, 264, 265
Convolution, 169, 268
Convolutional neural network, 169
Correlation, 40
Cost function, see objective function
Covariance, 40
Covariance matrix, 40
curse of dimensionality, 91
Cyc, 2
D-separation, 143
Dataset augmentation, 277, 282
DBM, see deep Boltzmann machine
Deep belief network, 251, 261, 262, 269
Deep Blue, 2
Deep Boltzmann machine, 251, 261, 263, 265,
269
Deep learning, 2, 5
Denoising score matching, 248
Density estimation, 68
Derivative, 56
Detector layer, 174
Dirac delta function, 46
Directed graphical model, 136
Directional derivative, 58
domain adaptation, 165
Dot product, 22
Doubly block circulant matrix, 171
Dream sleep, 239, 259
DropConnect, 128
Dropout, 125, 265
Dynamically structured networks, 271
E-step, 254
Early stopping, 105, 120, 129
EBM, see energy-based model
Eigendecomposition, 28
Eigenvalue, 28
Eigenvector, 28
ELBO, see evidence lower bound
Element-wise product, see Hadamard product
298
EM, see expectation maximization
Empirical distribution, 46
Energy function, 140
Energy-based model, 140, 263
Ensemble methods, 123
Equality constraint, 64
Equivariance, 172
Error function, see objective function
Euclidean norm, 26
Euler-Lagrange equation, 257
Evidence lower bound, 251, 253255, 263
Expectation, 40
Expectation maximization, 253
Expected value, see expectation
Explaining away, 144
Factor (graphical model), 138
Factor graph, 143
Factors of variation, 5
Frequentist probability, 36
Functional derivatives, 257
Gaussian distribution, see Normal distribu-
tion44
Gaussian mixture, 47
GCN, see Global contrast normalization
Gibbs distribution, 139
Gibbs sampling, 147
Global contrast normalization, 278
Global minimum, 13
Gradient, 58
Gradient descent, 58
Graphical model, see structured probabilistic
model
Hadamard product, 22
Harmonium, see Restricted Boltzmann ma-
chine148
Harmony theory, 141
Helmholtz free energy, see evidence lower bound
Hessian matrix, 60
Identity matrix, 24
Independence, 39
Inequality constraint, 64
Inference, 133, 251, 253256, 258, 259
Invariance, 177
Jacobian matrix, 60
Joint probability, 37
Karush-Kuhn-Tucker conditions, 65
Kernel (convolution), 170
KKT conditions, see Karush-Kuhn-Tucker con-
ditions
KL divergence, see Kllback-Leibler divergence41
Kullback-Leibler divergence, 41
Lagrange function, see Lagrangian
Lagrange multipliers, 64, 258
Lagrangian, 64
Learner, 3
Line search, 58
Linear combination, 25
Linear dependence, 26
Local conditional probability distribution, 136
Local minimum, 13
Logistic regression, 3
Logistic sigmoid, 47
Loss function, see objective function
M-step, 254
Machine learning, 3
Manifold hypothesis, 207
manifold hypothesis, 91
Manifold learning, 90, 207
MAP inference, 255
Marginal probability, 38
Markov chain, 145
Markov network, see undirected model138
Markov random field, see undirected model138
Matrix, 21
Matrix inverse, 24
Matrix product, 22
Max pooling, 177
Mean field, 264, 265
Measure theory, 50
Measure zero, 50
Method of steepest descent, see gradient de-
scent
Missing inputs, 68
Mixing (Markov chain), 150
Mixture distribution, 47
MNIST, 265
Model averaging, 123
MP-DBM, see multi-prediction DBM
Multi-prediction DBM, 264, 266
Multinomial distribution, 43
Multinoulli distribution, 43
Naive Bayes, 51
299
Nat, 41
Negative definite, 60
Negative phase, 238
Netflix Grand Prize, 125
Noise-contrastive estimation, 248
Norm, 26
Normal distribution, 44, 46
Normalized probability distribution, 139
Object detection, 276
Object recognition, 276
Objective function, 13, 55
Orthogonality, 28
Overfitting, 75
Parameter sharing, 172
Partial derivative, 58
Partition function, 94, 139, 231, 264
PCA, see principal components analysis
PCD, see stochastic maximum likelihood
Persistent contrastive divergence, see stochas-
tic maximum likelihood
Pooling, 169, 268
Positive definite, 60
Positive phase, 238
Precision (of a normal distribution), 44, 46
Predictive sparse decomposition, 155, 220
Preprocessing, 277
Principal components analysis, 31, 251, 279
Probabilistic max pooling, 268
Probability density function, 37
Probability distribution, 36
Probability mass function, 36
Product rule of probability, see chain rule of
probability
PSD, see predictive sparse decomposition
Pseudolikelihood, 244
Random variable, 36
Ratio matching, 247
RBM, see restricted Boltzmann machine
Receptive field, 173
Regression, 68
Representation learning, 3
Restricted Boltzmann machine, 148, 164, 251,
260, 261, 265, 266, 268
Scalar, 20
Score matching, 246
Second derivative, 60
Second derivative test, 60
Self-information, 41
Separable convolution, 186
Separation (probabilistic modeling), 141
Shannon entropy, 41, 257
Sigmoid, see logistic sigmoid
SML, see stochastic maximum likelihood
Softmax, 101
Softplus, 47
Spam detection, 3
Sparse coding, 163, 251
Sphering, see Whitening, 279
Square matrix, 26
Standard deviation, 40
Statistic, 79
Steepest descent, see gradient descent
Stochastic gradient descent, 265
Stochastic maximum likelihood, 243, 264, 265
Stochastic pooling, 128
Structure learning, 147
Structured output, 68
Structured probabilistic model, 132
Sum rule of probability, 38
Surrogate loss function, 129
Symmetric matrix, 28
Tangent plane, 210
Tensor, 21
Test example, 13
Tiled convolution, 182
Toeplitz matrix, 171
Trace operator, 30
Training criterion, 13
Transcription, 68
Transfer learning, 165
Transpose, 21
Unbiased, 79
Underfitting, 75
Undirected model, 138
Uniform distribution, 37
Unit norm, 28
Unnormalized probability distribution, 138
V-structure, see explaining away
Variance, 40
Variational derivatives, see functional deriva-
tives
Variational free energy, see evidence lower bound
Vector, 20
300
Whitening, 279
ZCA, see zero-phase components analysis
Zero-phase components analysis, 279
301