STAT 508: Applied Data Mining & Statistical Learning

About

Credits

Data mining and statistical learning methods use a variety of computational tools for understanding large, complex datasets. In some cases, the focus is on building models to predict a quantitative or qualitative output based on a collection of inputs. In others, the goal is simply to find relationships and structure from data with no specific output variable. This course takes an applied approach to understand the methodology, motivation, assumptions, strengths, and weaknesses of the most widely applicable methods in this field.

Course Topics

This graduate level course covers the following topics:

Understanding statistical learning and model selection
Using resampling methods such as cross-validation and bootstrap
Using linear regression methods
Examining variable selection in building regression models
Using regression shrinkage methods such as ridge regression and LASSO
Using dimension reduction methods such as principle components regression and partial least squares
Methods for modeling non-linear relationships
Using classification methods such as logistic regression, discriminant analysis, and nearest-neighbors
Using decision tree methods including bagging and boosting
Understanding the use of support vector machines
Using principal components analysis methods
Using cluster analysis methods

Course Author(s)

Dr. Jia Li is the original author of these course materials. They have been adapted and enhanced by Dr. Le Bao, Dr. Iain Pardoe, and Dr. Megan Romer.

Software

The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful.

R involves programming. Students should already feel comfortable using R at a basic level, be a quick learner of software packages, or able to figure out how to do the required analyses in another package of their choice. Students who have no experience with programming or are anxious about being able to manipulate software code are strongly encouraged to take the one-credit course in R in order to establish this foundation before taking this course.

R will be supported and sample programs will be supplied but you will be required to do some programing on your own. Due to different software applications, software versions and platforms there may be issues with running code. Students must be proactive in seeking advice and help from appropriate sources including documentation resources, other students, the teaching assistant, instructor or helpdesk.

Textbook

James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R, 2nd Edition. Springer. ISBN-13: 978-1071614174

(*) This textbook will also be readily available through the Penn State Libraries E-Book program at no cost to the student. Students do not need to purchase a physical copy of the book. Instructions for accessing the E-Book will be provided in the course.

Last updated:

FA23

Assessment Plan

Weekly Quizzes: Two types of short weekly quizzes will ensure that you are progressing at an appropriate pace throughout the term. Although there is no time-limit on these quizzes, please allow sufficient time to take each quiz without interruption.

“Theory”: quizzes based on the assigned reading material.
“R Labs”: quizzes based on the assigned R labs, usually to replicate and modify them.

Weekly Assignments: In the weekly data analysis assignments, you will apply the week’s concepts to an assigned dataset, work toward writing your own code, and practice synthesizing your analyses and findings in concise statistical reports that could be read by a peer or a supervisor.

Projects: There will be one midterm project and one final project. The midterm project will be an individual project. For the final project, teams of 2–3 students are expected.

Grading:

Weekly Quizzes (25%): Knowledge Checks (15%) + R Labs (10%)
Weekly Assignments (25%)
Individual Midterm Project (25%)
Final Team Project (25%)
Participation (Up to 2% extra credit)

Prerequisites

STAT 501 (Regression Methods) or a similar course that cover analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
Basics on probability, expectation, and conditional distributions.
Matrix algebra and multivariate calculus will be beneficial but is not required.