About
Data mining and statistical learning methods use a variety of computational tools for understanding large, complex datasets. In some cases, the focus is on building models to predict a quantitative or qualitative output based on a collection of inputs. In others, the goal is simply to find relationships and structure from data with no specific output variable. This course takes an applied approach to understand the methodology, motivation, assumptions, strengths, and weaknesses of the most widely applicable methods in this field.
Course Topics
This graduate level course covers the following topics:
- Understanding statistical learning and model selection
- Using resampling methods such as cross-validation and bootstrap
- Using linear regression methods
- Examining variable selection in building regression models
- Using regression shrinkage methods such as ridge regression and LASSO
- Using dimension reduction methods such as principle components regression and partial least squares
- Methods for modeling non-linear relationships
- Using classification methods such as logistic regression, discriminant analysis, and nearest-neighbors
- Using decision tree methods including bagging and boosting
- Understanding the use of support vector machines
- Using principal components analysis methods
- Using cluster analysis methods
Course Author(s)
Dr. Jia Li is the original author of these course materials. They have been adapted and enhanced by Dr. Le Bao, Dr. Iain Pardoe and Dr. Megan Romer.
Software
The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful.
R involves programming. Students should already feel comfortable using R at a basic level, be a quick learner of software packages, or able to figure out how to do the required analyses in another package of their choice. Students who have no experience with programming or are anxious about being able to manipulate software code are strongly encouraged to take the one-credit course in R in order to establish this foundation before taking this course.
R will be supported and sample programs will be supplied but you will be required to do some programing on your own. Due to different software applications, software versions and platforms there may be issues with running code. Students must be proactive in seeking advice and help from appropriate sources including documentation resources, other students, the teaching assistant, instructor or helpdesk.
Textbook
An Introduction to Statistical Learning: with Applications in R, By James, G., Witten, D., Hastie, T., Tibshirani, R. Springer, 2013.
Assessment Plan
- Weekly Quizzes: 20%
- R Labs: 20%
- Individual Projects (2): 25%
- Team Project: 25%
- Participation in online discussion forums: 10%
Prerequisites
- STAT 501 (Regression Methods) or a similar course that cover analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
- Basics on probability, expectation, and conditional distributions.
- Matrix algebra and multivariate calculus will be beneficial but is not required.