Welcome to STAT 508!

About this Course

Welcome to the course notes for STAT 508: Applied Data Mining and Statistical Learning. These notes are designed and developed by Penn State's Department of Statistics and offered as open educational resources. These notes are free to use under Creative Commons license CC BY-NC 4.0.

This course is part of the Online Master of Applied Statistics program offered by Penn State's World Campus.

Currently enrolled?

If you are a current student in this course, please see Canvas for your syllabus, assignments, lesson videos and communication from your instructor.

How to enroll?

If you would like to enroll and experience the entire course for credit please see 'How to enroll in a course' on the World Campus website.

nearest neighbors example

This course covers methodology, major software tools, and applications in data mining. By introducing principal ideas in statistical learning, the course will help students to understand the conceptual underpinnings of methods in data mining. It focuses more on the usage of existing software packages (mainly in R) than developing the algorithms by the students. Students will be required to work on projects to practice applying the existing software. The topics include statistical learning; resampling methods; linear regression; variable selection; regression shrinkage; dimension reduction; non-linear methods; logistic regression, discriminant analysis; nearest-neighbors; decision trees; bagging; boosting; support vector machines; principal components analysis; clustering.

Prerequisites

  • STAT 501 (Regression Methods) or a similar course that covers analysis of research data through simple and multiple regression and correlation; polynomial models; indicator variables; step-wise, piece-wise, and logistic regression.
  • Basics of probability, expectation, and conditional distributions. Review the Basic Statistical Concepts notes on the STAT online site.
  • Matrix algebra and multivariate calculus will be beneficial but is not required. Review the Matrix Algebra Review notes on the STAT online site.
  • The examples in the course use R and students will do weekly R Labs to apply statistical learning methods to real-world data. Extensive guidance in using R will be provided, but previous basic programming skills in R or exposure to a programming language such as MATLAB or Python will be useful.  Introductions to R are available at Statistical R Tutorial and Cran R Project Intro Manual.

Textbooks

Required: An Introduction to Statistical Learning, with applications in R (2013), G. James, D. Witten, T. Hastie, R. Tibshirani (Springer).

Recommended Reading

  • The Elements of Statistical Learning, 2nd edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
  • Pattern Recognition and Machine Learning by C. M. Bishop
  • All of Statistics: A Concise Course in Statistical Inference by L. Wasserman.
  • Classification and Regression Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
  • Principles of Data Mining by H. Mannila, P. Smyth and D. J. Hand
  • Pattern Recognition and Neural Networks by B. Ripley

Other Resources

Example datasets