STAT 508 | Applied Data Mining and Statistical Learning

About this course

Welcome to the course notes for STAT 508: Applied Data Mining and Statistical Learning. These notes are designed and developed by Penn State’s Department of Statistics and offered as open educational resources. These notes are free to use under Creative Commons license CC BY-NC 4.0.

This course is part of the Online Master of Applied Statistics program offered by Penn State’s World Campus.

Ready to enroll?

Currently enrolled?

If you are a current student in this course, please see Canvas for your syllabus, assignments, lesson videos, and communication from your instructor.

Course Structure

STAT 508 is structured to maximize learning through a ‘hands-on’ approach to data mining and statistical learning. Each lesson features instructional videos that cover key statistical concepts, complemented by guided notes to aid understanding and retention. Practical application is emphasized with R code examples and datasets provided for hands-on learning and active engagement. Weekly assignments are assigned to apply the week’s learning, while each unit culminates in a substantial project requiring integration of multiple skills and concepts learned throughout the unit. This structured progression ensures students not only grasp theoretical foundations but also develop practical statistical analysis skills using R, enhancing their ability to solve real-world problems and interpret data effectively.

Text and Materials

Text

[ISLR2] James G, Witten D, Hastie T, Tibshirani R (2021). An Introduction to Statistical Learning with Applications in R, 2nd ed. Springer: New York, NY. (Due to errata as well as differences in R versions, try to get the latest printing you can find.)

You can download a pdf of the book from the [textbook’s website][https://www.statlearning.com/].

Technology

Throughout the course, students will actively apply statistical concepts using R through guided exercises, assignments, and unit projects.

R can be downloaded from: https://cran.r-project.org
RStudio can be downloaded from: https://posit.co/download/rstudio-desktop/

Course Content

Unit 1: Introduction to Statistical Learning and Exploratory Data Analysis

Lesson 1: Introduction to Statistical Learning, EDA, and Unsupervised Learning

Objectives

Upon completion of this lesson, you should be able to:

Explain the difference between supervised and unsupervised learning.
Explain the difference between regression and classification.
Explain the difference between inference and prediction.
Gain proficiency in R programming by understanding its distinction from RStudio, using built-in functions, installing packages and loading libraries, and loading datasets from various sources into R.
Perform an exploratory data analysis, including the calculation of summary statistics and data visualization, to gain insights from the data.
Perform data wrangling tasks such as subsetting a dataset and creating new variables.

Sample Lesson 1 Content

Here is an excerpt from Lesson 1 that students receive when enrolling in this course.

Video: Introduction to Statistical Learning

Here are the accompanying notes for this video:

STAT 508 Lesson 1 Intro to Statistical Learning and Exploratory Data Analysis guided notes.

Lesson 2: Principal Components Analysis

Objectives

Upon completion of this lesson, you should be able to:

Describe how principal component analysis (PCA) fits into the larger framework of statistical learning.
Perform PCA using statistical software, including analyzing the proportion of variance explained by each principal component.
Use PCA to reduce the dimensionality of high dimensional data, visualize the result, and extract insights.

Lesson 3: Clustering

Objectives

Upon completion of this lesson, you should be able to:

Describe how clustering fits into the larger framework of statistical learning.
Explain a K-means algorithm and potential implementation issues, such as sensitivity to the initialization and locally optimal solutions.
Perform K-means clustering using statistical software, including choosing a value for K.
Describe common types of linkage used in hierarchical clustering.
Interpret a dendrogram.
Perform hierarchical clustering using statistical software, including the creation of a dendrogram for visualizing the results.
Explore the subgroups discovered in a clustering problem and extract meaningful insights

Unit 2: Supervised Learning Regression

Lesson 4: Linear Regression

Objectives

Upon completion of this lesson, you should be able to:

Describe how linear regression fits into the larger framework of statistical learning.
Explain how parameters are estimated using the least-squares criterion.
Fit a linear regression model using statistical software, including situations involving polynomial terms, categorical predictions, and/or interaction terms
Interpret the output of a linear regression model, including coefficient estimates and quality of fit metrics.
Explain why cross validation techniques, such as a training/validation split, are used.
Assess the predictive ability of the linear regression model on new data using measures such as the root mean square error.

Lesson 5: Feature Selection and Regularization

Objectives

Upon completion of this lesson, you should be able to:

Identify the best subset of predictors for a linear regression model using subset selection methods, such as best subsets or stepwise regression, using statistical software.
Compare competing linear regression models using a variety of approaches that estimate test set error directly, such as cross validation techniques, or indirectly, such as AIC.
Explain the concept of bias-variance tradeoff.
Compare and contrast the methods of ridge regression and LASSO.
Apply shrinkage/regularization methods for estimating parameters using statistical software, including selecting a value for the tuning parameter using cross validation.
Interpret the results of LASSO in the context of feature selection

Lesson 6: Dimension Reduction

Objectives

Upon completion of this lesson, you should be able to:

Explain the difference between feature selection and dimension reduction in the context of regression.
Describe the procedure of principal component regression.
Perform principal component regression using statistical software, including using cross validation to select the number of principal components
Describe the procedure of partial least squares.
Perform partial least squares (PLS) using statistical software, including using cross validation to select an appropriate number of PLS directions.
Compare the performance of PCR and PLS to other regression models using cross validation techniques.

Lesson 7: Nonlinear Regression

Objectives

Upon completion of this lesson, you should be able to:

Explain limitations of linear regression.
Compare and contrast regression splines, smoothing splines, and natural spline regression.
Use statistical software to build a generalized additive model for predicting the response based on multiple predictors.
Compare the performance of the nonlinear methods to each other and other regression models using cross validation techniques.

Unit 3: Supervised Learning Classification

Lesson 8: Linear Classification

Objectives

Upon completion of this lesson, you should be able to:

Describe how logistic regression and linear discriminant analysis fits into the larger framework of statistical learning.
Describe the different forms (logit, odds, and probability) of a logistic regression model.
Fit a logistic regression model using statistical software.
Interpret the output of a logistic regression model, including coefficient and probability estimates.
Perform linear discriminant analysis using statistical software.
Use tools, such as a receiver operating characteristic curve and/or confusion matrix, to assess the performance of a linear classification model.
Compare the performance of linear classification models using cross validation techniques.

Lesson 9: Nonlinear Classification

Objectives

Upon completion of this lesson, you should be able to:

Describe how k-nearest neighbors classification and classification trees fit into the larger framework of statistical learning.
Explain the k-nearest neighbors procedure for classification.
Perform k-nearest neighbors classification using statistical software, including using cross validation to select the number of neighbors.
Build a classification tree using statistical software, visualize the results, and explain the variable importance values.
Explain how we guard against overfitting in the context of classification trees.
Compare the performance of the nonlinear methods to other classification models using cross validation techniques.

Unit 4: Advanced Supervised Learning

Lesson 10: Ensemble Learning

Objectives

Upon completion of this lesson, you should be able to:

Compare three classes of ensemble learning techniques: bagging, stacking, and boosting.
Describe how random forest and fits into the larger framework of statistical learning.
Describe how bagging decision trees is different than random forest.
Explain a random forest algorithm.
Perform random forest regression and classification using statistical software.
Evaluate the performance of random forest using out-of-bag observations and validation data and explain the variable importance values.
Compare the performance of the ensemble learning methods to other models using cross validation techniques.

Lesson 11: Support Vector Machines

Objectives

Upon completion of this lesson, you should be able to:

Describe how support vector machines fit into the larger framework of statistical learning.
Define the maximal margin classifier in the context of classes that are separable by a linear boundary.
Explain how support vector classifiers, or soft margin classifiers, differ from the maximal margin classifier.
Explain how support vector machines differ from the support vector classifier.
Fit a support vector machine classifier using statistical software.
Compare the performance of a support vector machine to other classification models using cross validation techniques.

Lesson 12: Neural Networks

Objectives

Upon completion of this lesson, you should be able to:

Describe how neural networks fit into the larger framework of statistical learning.
Describe the architecture of single and multilevel neural networks.
Explain the role of activation functions in neural networks and provide examples of commonly used activation functions.

Supplementary Content

Note!
The following resources are NOT the course content for STAT 508 offered through World Campus. These are supplementary and optional resources only.

Variable Selection

Regression Shrinkage Methods

Ridge Regression

The Lasso

Principal Components Analysis

SVD

PCA

Dimension Reduction Methods

PCR

PLS

Modeling Non-linear Relationships

Non-linear relationships

Classification

Logistic Regression

Binary Classification

Discriminant Analysis

Nearest-Neighbor

Support Vector Machines

Support Vector

Multiclass SVM

Tree-based Methods

Tree-based

Impurity Function

Random Forests

--- title-block-banner: 508banner.png title-block-banner-color: light section-divs: FALSE listing: id: lessons contents: - Lesson01.qmd - Lesson02.qmd - Lesson03.qmd - Lesson04.qmd - Lesson05.qmd - Lesson06.qmd - Lesson07.qmd - Lesson08.qmd - Lesson09.qmd - Lesson10.qmd - Lesson11.qmd - Lesson12.qmd type: grid fields: [image,title, categories] #categories: true # this will add the categories to the right below the toc sort: "filename" --- ## About this course {.unnumbered} Welcome to the course notes for STAT 508: Applied Data Mining and Statistical Learning. These notes are designed and developed by [Penn State's Department of Statistics](https://science.psu.edu/stat) and offered as open educational resources. These notes are free to use under Creative Commons license [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). This course is part of the Online Master of Applied Statistics program offered by Penn State's World Campus. <center><a href="https://www.worldcampus.psu.edu/register-or-schedule-courses" class="btn btn-primary btn-lg active" role="button" aria-pressed="true">Ready to enroll?</a></center> > ### Currently enrolled? > > If you are a current student in this course, please see [Canvas](https://psu.instructure.com/) for your syllabus, assignments, lesson videos, and communication from your instructor. ::: content-hidden > ### How to enroll? > > If you would like to enroll and experience the entire course for credit please see ['How to enroll in a course'](https://www.worldcampus.psu.edu/register-or-schedule-courses) on the World Campus website. ::: ### Course Structure STAT 508 is structured to maximize learning through a 'hands-on' approach to data mining and statistical learning. Each lesson features instructional videos that cover key statistical concepts, complemented by guided notes to aid understanding and retention. Practical application is emphasized with R code examples and datasets provided for hands-on learning and active engagement. Weekly assignments are assigned to apply the week’s learning, while each unit culminates in a substantial project requiring integration of multiple skills and concepts learned throughout the unit. This structured progression ensures students not only grasp theoretical foundations but also develop practical statistical analysis skills using R, enhancing their ability to solve real-world problems and interpret data effectively. ### Text and Materials :::: ms-3 **Text** ![An Introduction to Statistical Learning](assets/ISLRtextbook.png){.lightbox .float-end .d-flex fig-alt="ISLR Textbook cover" fig-align="center" width="10%"} \[ISLR2\] James G, Witten D, Hastie T, Tibshirani R (2021). An Introduction to Statistical Learning with Applications in R, 2nd ed. Springer: New York, NY. (Due to errata as well as differences in R versions, try to get the latest printing you can find.) You can download a pdf of the book from the \[textbook's website\]\[https://www.statlearning.com/\]. ::: {.ricon .float-end .d-flex} ::: **Technology** Throughout the course, students will actively apply statistical concepts using R through guided exercises, assignments, and unit projects. - R can be downloaded from: <https://cran.r-project.org> - RStudio can be downloaded from: <https://posit.co/download/rstudio-desktop/> :::: ### Course Content ```{=html} <div class="accordion" id="508-accordion"> <div class="accordion-item"> <button class="accordion-button" type="button" data-bs-toggle="collapse" data-bs-target="#unit1-panel" aria-expanded="false" aria-controls="unit1-panel"> <h3 class="accordion-header" id="unit1-header">Unit 1: Introduction to Statistical Learning and Exploratory Data Analysis</h3> </button> <div id="unit1-panel" class="accordion-collapse collapse show" aria-labelledby="unit1-header"> <div class="accordion-body"> ``` ### Lesson 1: Introduction to Statistical Learning, EDA, and Unsupervised Learning {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Explain the difference between supervised and unsupervised learning. 2. Explain the difference between regression and classification. 3. Explain the difference between inference and prediction. 4. Gain proficiency in R programming by understanding its distinction from RStudio, using built-in functions, installing packages and loading libraries, and loading datasets from various sources into R. 5. Perform an exploratory data analysis, including the calculation of summary statistics and data visualization, to gain insights from the data. 6. Perform data wrangling tasks such as subsetting a dataset and creating new variables. ::: ::::: {.callout-warning appearance="minimal"} ### Sample Lesson 1 Content {.unnumbered .unlisted} Here is an excerpt from Lesson 1 that students receive when enrolling in this course. #### Video: Introduction to Statistical Learning :::: text-center ::: {style="position:relative;padding-bottom:60%"} <iframe id="kaltura_player" type="text/javascript" src="https://cdnapisec.kaltura.com/p/2356971/embedPlaykitJs/uiconf_id/54598732?iframeembed=true&entry_id=1_92gjt6p2&config[provider]={"widgetId":"1_4rwn9jv7"}&config[playback]={"startTime":0}" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" sandbox="allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation" title="STAT 508: Lesson 1 Part 1" style="position:absolute;top:0;left:0;width:100%;height:100%;border:0"> </iframe> ::: :::: \ Here are the accompanying notes for this video: [<i class="bi bi-file-earmark-pdf"></i> STAT 508 Lesson 1 Intro to Statistical Learning and Exploratory Data Analysis guided notes.](/lesson1guidednotes.pdf){download="" target="_blank"} ::::: ### Lesson 2: Principal Components Analysis {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Describe how principal component analysis (PCA) fits into the larger framework of statistical learning. 2. Perform PCA using statistical software, including analyzing the proportion of variance explained by each principal component. 3. Use PCA to reduce the dimensionality of high dimensional data, visualize the result, and extract insights. ::: ### Lesson 3: Clustering {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Describe how clustering fits into the larger framework of statistical learning. 2. Explain a K-means algorithm and potential implementation issues, such as sensitivity to the initialization and locally optimal solutions. 3. Perform K-means clustering using statistical software, including choosing a value for K. 4. Describe common types of linkage used in hierarchical clustering. 5. Interpret a dendrogram. 6. Perform hierarchical clustering using statistical software, including the creation of a dendrogram for visualizing the results. 7. Explore the subgroups discovered in a clustering problem and extract meaningful insights ::: ```{=html} </div> </div> </div> <div class="accordion-item"> <button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#unit2-panel" aria-expanded="false" aria-controls="unit2-panel"> <h3 class="accordion-header" id="unit2-header">Unit 2: Supervised Learning Regression</h3> </button> <div id="unit2-panel" class="accordion-collapse collapse" aria-labelledby="unit2-header"> <div class="accordion-body"> ``` ### Lesson 4: Linear Regression {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Describe how linear regression fits into the larger framework of statistical learning. 2. Explain how parameters are estimated using the least-squares criterion. 3. Fit a linear regression model using statistical software, including situations involving polynomial terms, categorical predictions, and/or interaction terms 4. Interpret the output of a linear regression model, including coefficient estimates and quality of fit metrics. 5. Explain why cross validation techniques, such as a training/validation split, are used. 6. Assess the predictive ability of the linear regression model on new data using measures such as the root mean square error. ::: ### Lesson 5: Feature Selection and Regularization {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Identify the best subset of predictors for a linear regression model using subset selection methods, such as best subsets or stepwise regression, using statistical software. 2. Compare competing linear regression models using a variety of approaches that estimate test set error directly, such as cross validation techniques, or indirectly, such as AIC. 3. Explain the concept of bias-variance tradeoff. 4. Compare and contrast the methods of ridge regression and LASSO. 5. Apply shrinkage/regularization methods for estimating parameters using statistical software, including selecting a value for the tuning parameter using cross validation. 6. Interpret the results of LASSO in the context of feature selection ::: ### Lesson 6: Dimension Reduction {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Explain the difference between feature selection and dimension reduction in the context of regression. 2. Describe the procedure of principal component regression. 3. Perform principal component regression using statistical software, including using cross validation to select the number of principal components 4. Describe the procedure of partial least squares. 5. Perform partial least squares (PLS) using statistical software, including using cross validation to select an appropriate number of PLS directions. 6. Compare the performance of PCR and PLS to other regression models using cross validation techniques. ::: ### Lesson 7: Nonlinear Regression {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Explain limitations of linear regression. 2. Compare and contrast regression splines, smoothing splines, and natural spline regression. 3. Use statistical software to build a generalized additive model for predicting the response based on multiple predictors. 4. Compare the performance of the nonlinear methods to each other and other regression models using cross validation techniques. ::: ```{=html} </div> </div> </div> <div class="accordion-item"> <button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#unit3-panel" aria-expanded="false" aria-controls="unit3-panel"> <h3 class="accordion-header" id="unit3-header">Unit 3: Supervised Learning Classification</h3> </button> <div id="unit3-panel" class="accordion-collapse collapse" aria-labelledby="unit3-header"> <div class="accordion-body"> ``` ### Lesson 8: Linear Classification {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Describe how logistic regression and linear discriminant analysis fits into the larger framework of statistical learning. 2. Describe the different forms (logit, odds, and probability) of a logistic regression model. 3. Fit a logistic regression model using statistical software. 4. Interpret the output of a logistic regression model, including coefficient and probability estimates. 5. Perform linear discriminant analysis using statistical software. 6. Use tools, such as a receiver operating characteristic curve and/or confusion matrix, to assess the performance of a linear classification model. 7. Compare the performance of linear classification models using cross validation techniques. ::: ### Lesson 9: Nonlinear Classification {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Describe how k-nearest neighbors classification and classification trees fit into the larger framework of statistical learning. 2. Explain the k-nearest neighbors procedure for classification. 3. Perform k-nearest neighbors classification using statistical software, including using cross validation to select the number of neighbors. 4. Build a classification tree using statistical software, visualize the results, and explain the variable importance values. 5. Explain how we guard against overfitting in the context of classification trees. 6. Compare the performance of the nonlinear methods to other classification models using cross validation techniques. ::: ```{=html} </div> </div> </div> <div class="accordion-item"> <button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#unit4-panel" aria-expanded="false" aria-controls="unit4-panel"> <h3 class="accordion-header" id="unit4-header">Unit 4: Advanced Supervised Learning</h3> </button> <div id="unit4-panel" class="accordion-collapse collapse" aria-labelledby="unit4-header"> <div class="accordion-body"> ``` ### Lesson 10: Ensemble Learning {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Compare three classes of ensemble learning techniques: bagging, stacking, and boosting. 2. Describe how random forest and fits into the larger framework of statistical learning. 3. Describe how bagging decision trees is different than random forest. 4. Explain a random forest algorithm. 5. Perform random forest regression and classification using statistical software. 6. Evaluate the performance of random forest using out-of-bag observations and validation data and explain the variable importance values. 7. Compare the performance of the ensemble learning methods to other models using cross validation techniques. ::: ### Lesson 11: Support Vector Machines {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Describe how support vector machines fit into the larger framework of statistical learning. 2. Define the maximal margin classifier in the context of classes that are separable by a linear boundary. 3. Explain how support vector classifiers, or soft margin classifiers, differ from the maximal margin classifier. 4. Explain how support vector machines differ from the support vector classifier. 5. Fit a support vector machine classifier using statistical software. 6. Compare the performance of a support vector machine to other classification models using cross validation techniques. ::: ### Lesson 12: Neural Networks {.unnumbered .unlisted} ::: objectiveblock <i class="bi bi-check2-circle"></i>[Objectives]{.callout-header} Upon completion of this lesson, you should be able to: 1. Describe how neural networks fit into the larger framework of statistical learning. 2. Describe the architecture of single and multilevel neural networks. 3. Explain the role of activation functions in neural networks and provide examples of commonly used activation functions. ::: ```{=html} </div> </div> </div> </div> ``` ## Supplementary Content ::: {.callout-caution appearance="minimal"} **Note!**\ The following resources are **NOT** the course content for STAT 508 offered through World Campus. These are supplementary and optional resources only. ::: ::: {#lessons} :::

About this course

Currently enrolled?

Course Structure

Text and Materials

Course Content

Unit 1: Introduction to Statistical Learning and Exploratory Data Analysis

Lesson 1: Introduction to Statistical Learning, EDA, and Unsupervised Learning

Video: Introduction to Statistical Learning

Lesson 2: Principal Components Analysis

Lesson 3: Clustering

Unit 2: Supervised Learning Regression

Lesson 4: Linear Regression

Lesson 5: Feature Selection and Regularization

Lesson 6: Dimension Reduction

Lesson 7: Nonlinear Regression

Unit 3: Supervised Learning Classification

Lesson 8: Linear Classification

Lesson 9: Nonlinear Classification

Unit 4: Advanced Supervised Learning

Lesson 10: Ensemble Learning

Lesson 11: Support Vector Machines

Lesson 12: Neural Networks

Supplementary Content

Introduction to Data Mining

Exploratory Data Analysis (EDA)

Statistical Learning and Model Selection

Linear Regression

Variable Selection

Regression Shrinkage Methods

Principal Components Analysis

Dimension Reduction Methods

Modeling Non-linear Relationships

Classification

Support Vector Machines

Tree-based Methods