Yale University STAT 365/665: Data Mining and Machine Learning

Date	Description	Resources	References
2016-01-20	Course Introduction	[Syllabus] [Lecture 01]	W. N. Venables, D. M. Smith and the R Core Team. An Introduction to R. [pdf notes] Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie. The Elements of Statistical Learning. (EoSL) [pdf book] Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. (DL) [html book] Yoshua Bengio. Learning Deep Architectures for AI. [pdf paper] Juergen Schmidhuber. Deep Learning in Neural Networks: An Overview [pdf paper] L.J.P. van der Maaten, E.O. Postma, H.J. van den Herik. Dimensionality Reduction: A Comparative Review [pdf paper] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives [pdf paper]
2016-01-22	Linear classification methods I	[Lecture 02] [knn video] [kernel smoother video] [regression video] [lowess video]	EoSL 3 & 4 Cosma Rohilla Shalizi. Advanced Data Analysis from an Elementary Point of View. Book in preparation. [book pdf] Gyorfi, Laszlo, Michael Kohler, Adam Krzyzak and Harro Walk (2002). A Distribution-Free Theory of Nonparametric Regression. New York: Springer-Verlag. [book pdf] Simonoff, Jeffrey S. (1996). Smoothing Methods in Statistics. Berlin: Springer-Verlag. [book pdf]
2016-01-25	Linear classification methods II	[Lecture 03] [linear smoother video] [validation video]	EoSL 3 & 4 Buja, Andreas, Trevor Hastie and Robert Tibshirani (1989). “Linear Smoothers and Additive Models.” Annals of Statistics, 17: 453–555. [paper] Ye, Jianming (1998). “On Measuring and Correcting the Effects of Data Mining and Model Selection.” Journal of the American Statistical Association, 93: 120–131. doi:10.1080/01621459.1998.10474094. [paper]
2016-01-27	Least squares and classification	[Lecture 04]	EoSL 2.4 Bellman, Richard, et al. Adaptive control processes: a guided tour. Vol. 4. Princeton: Princeton university press, 1961. Ledoux, Michel. The concentration of measure phenomenon. No. 89. American Mathematical Soc., 2005.
2016-02-01	Additive Models	[Lecture 05] [script05.R]	EoSL 9 Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion), Annals of Statistics 28: 337–307. [paper]
2016-02-05	Problem Set #1 Due (via ClassesV2 by 13:00)	[Problem Set #1] [NYC Taxi Data] [pset01_starter.py] [pset01_starter.R] [nyc_test.csv] [nyc_train.csv] [nyc_nta.csv] [pset01_grading_rubric.txt] [solutions01.csv]
2016-02-07	Decision Trees I	[Lecture 06]	EoSL 10 & 15 Breiman, L. (2001). Random forests, Machine Learning 45: 5–32. [paper] Buhlmann, P. and Hothorn, T. (2007). Boosting algorithms: regularization, prediction and model fitting (with discussion), Statistical Science 22(4): 477–505. [paper] randomForest: Breiman and Cutler's Random Forests for Classification and Regression. [R package]
2016-02-10	Dimensionality Reduction with PCA	[Lecture 07] [script07.R]	EoSL 14 ADAfaEPoV 16 Hand, David, Heikki Mannila and Padhraic Smyth (2001). Principles of Data Mining. Cambridge, Massachusetts: MIT Press.
2016-02-15	Decision Trees II	[Lecture 08] [script08.Rmd] [script08.html]	EoSL 10 Friedman, J. (2001). Greedy function approximation: A gradient boosting machine, Annals of Statistics 29(5): 1189–1232. [paper] Schapire, Robert E. "The boosting approach to machine learning: An overview." Nonlinear estimation and classification. Springer New York, 2003. 149-171. [pdf chapter]
2016-02-17	Linear Classification Models and Support Vector Machines I	[Lecture 09] [script09.Rmd] [script09.html]	EoSL 12 Burges, C. (1998). A tutorial on support vector machines for pattern recognition, Knowledge Discovery and Data Mining 2(2): 121–167. [pdf] Vapnik, V. (1996). The Nature of Statistical Learning Theory, Springer, New York. [book pdf] Wahba, G., Lin, Y. and Zhang, H. (2000). GACV for support vector machines, in A. Smola, P. Bartlett, B. Scholkopf and D. Schuurmans (eds), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA., pp. 297–311. [pdf]
2016-02-19	Problem Set #2 Due	[Problem Set #2] [pset02_starter.R] [pset02_starter.py] [train.csv] [test.csv] [results.csv] [ca_rac_S000_JT00_2013.csv] [ct_rac_S000_JT00_2013.csv] [mt_rac_S000_JT00_2013.csv] [ny_rac_S000_JT00_2013.csv] [LODESTechDoc7.0.pdf]
2016-02-22	Support vector machines II	[Lecture 10] [script10.Rmd] [script10.html]	EoSL 12 Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: A library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27. [pdf, website] Karatzoglou, Alexandros, David Meyer, and Kurt Hornik. "Support Vector Machines in R." Journal of Statistical Software 15.09 (2006). [pdf]
2016-02-24	Support vector machines III	[Lecture 11] [script11.Rmd] [script11.html] [svmCostsLinear.mp4] [svmCostsPoly.mp4] [svmCostsSigmoid.mp4] [svmCostsRadial.mp4] [mnist_train.psv] [mnist_test.psv]	EoSL 12 Zhu, Kaihua, et al. "Parallelizing support vector machines on distributed computers." Advances in Neural Information Processing Systems. 2008. [pdf]
2016-02-26	Problem Set #3 Due	[Problem Set #3] [chiCrimeTest.psv] [chiCrimeTrain.psv] [chiCrimeTestSolutions.psv] [testScript03.R] [pset03_sample.csv]
2016-02-29	Introduction to Neural Networks	[Lecture 12] [script12.Rmd] [script12.html]	DL 6.1-6.2 Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, 2015. [html tutorial] MNIST Database of Handwritten Digits [data files] http://yann.lecun.com/exdb/mnist/
2016-03-02	Back-propagation	[Lecture 13]	DL 6.3-6.5 Hinton, GE; Osindero, S; Teh, YW (Jul 2006). "A fast learning algorithm for deep belief nets.". Neural computation 18 (7): 1527–54. [pdf paper] Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2007). "Greedy Layer-Wise Training of Deep Networks". Advances in Neural Information Processing Systems: 153–160. [pdf paper] Ranzato, MarcAurelio; Poultney, Christopher; Chopra, Sumit; LeCun, Yann (2007). "Efficient Learning of Sparse Representations with an Energy-Based Model". Advances in Neural Information Processing Systems. [pdf paper] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." Cognitive modeling 5 (1988): 3. [pdf paper]
2016-03-04	Problem Set #4 Due	[Problem Set #4] [pset04_start.R] [pset04_start.py] [testScript04.R] [testScript04.py] [pset04_solutions.R]
2016-03-07	Cost Functions and Regularization	[Lecture 14] [script14.Rmd] [script14.html]	DL 8.1-8.3 Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13) (pp. 1139-1147). [pdf paper] Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR. [pdf paper] Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2015). Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations. [pdf paper] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The loss surface of multilayer networks. [pdf paper] Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse high-dimensional inputs. [pdf paper]
2016-03-09	Weight Initialization, Momentum, and Learning Rate Schedules	[Lecture 15] [script15.Rmd] [script15.html]	DL 8.4 Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural networks, 1(4), 295–307. [pdf paper] Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [pdf paper] Tom Schaul, Ioannis Antonoglou, D. S. (2014). Unit tests for stochastic optimization. International Conference on Learning Representations. [pdf paper]
2016-03-14	Problem Set #5 Due	[Problem Set #5] [pset05.zip]
2016-03-28	Neural Network Software	[Lecture 16] [notebook16.ipynb] [notebook16.html]	Bergstra, James, et al. "Theano: a CPU and GPU math expression compiler." Proceedings of the Python for scientific computing conference (SciPy). Vol. 4. 2010. [pdf paper] Bastien, Frédéric, et al. "Theano: new features and speed improvements." arXiv preprint arXiv:1211.5590 (2012). [pdf paper] van Merriënboer, B., Bahdanau, D., Dumoulin, V., Serdyuk, D., Warde-Farley, D., Chorowski, J. and Bengio, Y., 2015. Blocks and fuel: Frameworks for deep learning. arXiv preprint arXiv:1506.00619. [pdf paper] Theano library [website] Keras library [website] Caffe model zoo [website]
2016-03-30	Convolutional Neural Networks	[Lecture 17] [notebook17.ipynb] [notebook17.html]	DL 9.1-9.4 Ciresan, Dan; Meier, Ueli; Schmidhuber, Jürgen (June 2012). "Multi-column deep neural networks for image classification". CVPR 2012. [pdf paper] Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." Computer Vision and Pattern Recognition, 2009. CVPR 2009. [pdf paper] Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in vision algorithms. In Proc. International Conference on Machine learning (ICML 2010). [pdf paper] Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals: multi-way local pooling for image recognition. In Proc. International Conference on Computer Vision (ICCV 2011) [pdf paper] Jia, Y., Huang, C., and Darrell, T. (2012). Beyond spatial pyramids: Receptive field learning for pooled image features. In Computer Vision and Pattern Recognition (CVPR). [pdf paper] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2014). Going deeper with convolutions. Technical report, arXiv:1409.4842. [pdf paper]
2016-04-04	Transfer Learning and Computer Vision I	[Lecture 18] [notebook18.ipynb] [notebook18.html]	DL 14.2, 12.2 Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. (AISTATS 2011). [pdf paper] Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. [pdf paper] Davis, A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T. (2014). The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. [pdf paper] Krizhevsky, A., Sutskever, I., and Hinton, G. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS 2012). [pdf paper] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative adversarial nets." Advances in Neural Information Processing Systems, pp. 2672-2680. 2014. [pdf paper] He, Kaiming, et al. "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385 (2015). [pdf paper] Misra, Ishan, Abhinav Shrivastava, and Martial Hebert. "Watch and learn: Semi-supervised learning of object detectors from videos." arXiv:1505.05769 (2015). [pdf paper] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS (2015). [pdf paper] Dai, Jifeng, Kaiming He, and Jian Sun. "Instance-aware Semantic Segmentation via Multi-task Network Cascades." arXiv preprint arXiv:1512.04412 (2015). [pdf paper]
2016-04-06	Computer Vision II	[Lecture 19] [notebook19.ipynb] [notebook19.html]	DL 12.2 Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models. [pdf paper] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." Advances in Neural Information Processing Systems. 2014. [pdf paper] Donahue, Jeff, et al. "Decaf: A deep convolutional activation feature for generic visual recognition." arXiv preprint arXiv:1310.1531 (2013). [pdf paper] Razavian, Ali S., et al. "CNN features off-the-shelf: an astounding baseline for recognition." Computer Vision and Pattern Recognition Workshops (CVPRW). [pdf paper]
2016-04-08	Problem Set #6 Due	[Problem Set #6] [pset06_starter.py] [pset06_solutions.py]
2016-04-15	Problem Set #7 Due	[Problem Set #7] [pset07_starter.py] [pset06_solutions.py]
2016-04-18	Word Embeddings	[Lecture 20] [notebook20.ipynb] [notebook20.html]	Hinton, G.E., Osindero, S. and Teh, Y.W., 2006. A fast learning algorithm for deep belief nets. Neural computation, 18(7), pp.1527-1554. [pdf paper] Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N. and Kingsbury, B., 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6), pp.82-97. [pdf paper] Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words and phrases and their compositionality." In Advances in neural information processing systems, pp. 3111-3119. (2013). [pdf paper] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).[pdf paper] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." In EMNLP, vol. 14, pp. 1532-1543. (2014). [website] Huang, Eric H., Richard Socher, Christopher D. Manning, and Andrew Y. Ng. "Improving word representations via global context and multiple word prototypes." In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pp. 873-882. Association for Computational Linguistics, 2012.[pdf paper] Mnih, A. and Kavukcuoglu, K. Learning word embeddings efficiently with noise contrastive estimation (2013). [pdf paper] Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with reconstruction sampling. (ICML 2011). [pdf paper]
2016-04-25	Problem Set #8 Due	[Problem Set #8] [pset08_preprocess_stl10.py] [pset08_starter_chicago.py] [pset08_starter_stl.py]
2016-04-25	Recurrent Neural Networks	[Lecture 21] [notebook21.ipynb] [notebook21.html]	DL 10 Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. Technical report, arXiv:1506.03099. [pdf paper] Pascanu, R., Gulcehre, Ç., Cho, K., and Bengio, Y. How to construct deep recurrent neural networks. (ICLR 2014). [pdf paper] Cho, Kyunghyun, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. "On the properties of neural machine translation: Encoder-decoder approaches." [pdf paper] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997): 1735-1780. [pdf paper]
2016-04-27	Theory, Depth, Representation, Future	[Lecture 22] [notebook22.ipynb] [notebook22.html] [notebook_addition.ipynb] [notebook_addition.html]	Barron, Andrew. ``Universal Approximation Bounds for Superpositions of a Sigmoidal Function." IEEE Transactions on Information Theory, Vol. 39, No.3, May 1993. [pdf paper] Bengio, Yoshua, et al. ``Generalized denoising auto-encoders as generative models." Advances in Neural Information Processing Systems. 2013. [pdf paper] Alain, Guillaume, and Yoshua Bengio. ``What regularized auto-encoders learn from the data-generating distribution." The Journal of Machine Learning Research 15.1 (2014): 3563-3593. [pdf paper] Gregor, Karol, et al. "DRAW: A recurrent neural network for image generation." arXiv preprint arXiv:1502.04623 (2015). [pdf paper]
2016-05-02	Problem Set #9 Due	[Problem Set #9] [mdsd.zip]

STAT 365/665: Data Mining and Machine Learning

Course Notes and Assignments