Lesson 2: Statistical Learning and Model Selection
An Overview of Learning Process
Key Learning Goals for this Lesson: |
Textbook reading: Consult Course Schedule |
Statistical learning theory was introduced in the late 1960s but untill 1990s it was simply a problem of function estimation from a given collection of data. In the middle of the 1990s new types of learning algorithms (e.g., support vector machines) based on the developed theory were proposed. This made statistical learning theory not only a tool for the theoretical analysis but also a tool for creating practical algorithms for estimating multidimensional functions.
Statistical learning plays a key role in many areas of science, finance and industry. A few examples are already considered in Lesson 1. Some more examples of the learning problems are:
- Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient.
- Predict the price of a stock in 6 months from now, on the basis of company performance measures and economic data.
- Estimate the amount of glucose in the blood of a diabetic person, from the infrared absorption spectrum of that person’s blood.
- Identify the risk factors for prostate cancer, based on clinical and demographic variables.
The science of learning plays a key role in the fields of statistics, data mining and artificial intelligence, intersecting with areas of engineering and other disciplines. The abstract learning theory of the 1960s established more generalized conditions compared to those discussed in classical statistical paradigms.Understanding of these conditions inspired new algorithmic approaches to function estimation problems.
In essence, a statistical learning problem is learning from the data. In a typical scenario, we have an outcome measurement, usually quantitative (such as a stock price) or categorical (such as heart attack/no heart attack), that we wish to predict based on a set of features (such as diet and clinical measurements). We have a Training Set which is used to observe the outcome and feature measurements for a set of objects. Using this data we build a Prediction Model, or a Statistical Learner , which enables us to predict the outcome for a set of new unseen objects.
A good learner is one that accurately predicts such an outcome.
The examples considered above are all supervised learning.
All statistical learning problems may be constructed so as to minimize expected loss. Mathematically, the problem of learning is that of choosing from a given set of functions, the one that predicts the supervised learning's response in the best possible way. In order to choose the best available response, a risk functional is minimized in a situation where the joint distribution of the predictors and response is unknown and the only available information is obtained from the training data.
The formulation of learning problem is quite general. However, two main types of problems are that of
- Regression Estimation
- Classification
In the current course only these two are considered. The problem of regression estimation is the problem of minimizing the risk functional with the squared error loss function. When the problem is of classification, the loss function is an indicator function. Hence the problem is that of finding a function that minimizes the misclassification error.
There are several aspects of the model building process, or the process of finding an appropriate learning function. In what proportion data is allocated to certain tasks like model building and evaluating model performance, is an important aspect of modeling. How much data should be allocated to the training and test sets? It generally depends on the situation. If the pool of data is small, the data splitting decisions can be critical. Large data sets reduce the criticality of these decisions.
Before evaluating a model's predictive performance in the test data, quantitative assessments of the model using resampling techniques helps to understand how alternative models are expected to perform on new data. Simple visualization, like residual plot in case of a regression, would also help.
It is always a good practice to try out alternative models. There is no single model that will always do better than any other model for all data sets. Because of this, a strong case can be made to try a wide variety of techniques, then determine which model to focus on. Cross-validation as well as performance of a model on the test data help to make the final decision.