This course covers methodology, major software tools, and applications in data mining. By introducing principal ideas in statistical learning, the course will help students to understand the conceptual underpinnings of methods in data mining. It focuses more on the usage of existing software packages (mainly in R) than developing the algorithms by the students. Students will be required to work on projects to practice applying the existing software. The topics include statistical learning; resampling methods; linear regression; variable selection; regression shrinkage; dimension reduction; non-linear methods; logistic regression, discriminant analysis; nearest-neighbors; decision trees; bagging; boosting; support vector machines; principal components analysis; clustering.
- STAT 501 (Regression Methods) or a similar course that covers analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
- Basics of probability, expectation, and conditional distributions. Review the Basic Statistical Concepts notes on the STAT online site.
- Matrix algebra and multivariate calculus will be beneficial but is not required. Review the Matrix Algebra Review notes on the STAT online site.
- The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful. Introductions to R are available at Statistical R Tutorial and Cran R Project Intro Manual.
Required: An Introduction to Statistical Learning, with applications in R (2013), G. James, D. Witten, T. Hastie, R. Tibshirani (Springer).
- The Elements of Statistical Learning, 2nd edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- Pattern Recognition and Machine Learning by C. M. Bishop
- All of Statistics: A Concise Course in Statistical Inference by L. Wasserman.
- Classification and Regression Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
- Principles of Data Mining by H. Mannila, P. Smyth and D. J. Hand
- Pattern Recognition and Neural Networks by B. Ripley