1 Introduction to Data Mining
Overview
With rapid advances in information technology, an explosive growth is witnessed in data generation and data collection capabilities across all domains. In the business world, very large databases on commercial transactions have been generated by retailers and e-commerce. A huge amount of scientific data have been generated in various fields as well. One case in point is the human genome project which has aggregated gigabytes of data on the human genetic code. The World Wide Web provides another example with billions of web pages consisting of textual and multimedia information that is used by millions of people. Analyzing huge bodies of data that can be understood and used efficiently remains a challenging problem. Data mining addresses this problem by providing techniques and software to automate the analysis and exploration of large and complex data sets. Research on data mining is being pursued in a wide variety of fields, including statistics, computer science, machine learning, database management, and data visualization, to name a few.
This course on data mining will cover commonly used techniques and applications in this field. Though the focus is on the application of the methods through the software R, considerable effort is devoted to developing the mathematical basis. Data mining and learning techniques developed in fields other than statistics, e.g., machine learning and signal processing, are also introduced. After the completion of the course, students should be able to identify situations concerning the applicability of the techniques, employ the techniques to derive results, interpret the results and comprehend the limitations, if any, of the final outcome.
Objectives Upon successful completion of this lesson, you should be able to: Explain the basic concepts of data mining: supervised vs. unsupervised learning with reference to classification, clustering, regression, etc. Recognize that the formulation of a real-world problem into a statistical learning problem is important although, in this course, we focus on algorithms for solving the already formulated problems. Recognize that a core issue for designing learning algorithms is to balance performance within training data and robustness on unobserved test data. Discuss an overview of several major approaches to classification.
1.1 What is Data Mining?
Data Mining refers to a set of methods applicable to large and complex databases to eliminate the randomness and discover the hidden pattern. Data mining methods are almost always computationally intensive. Data mining is about tools, methodologies, and theories for revealing patterns in data — which is a critical step in knowledge discovery. There are several driving forces for why data mining has become such an important area of study.
The explosive growth of data in a great variety of fields in industry and academia supported by: Cheaper storage devices with unlimited capacities, such as cloud storage Faster communication with faster connection speeds; Better database management systems and software support Rapidly increasing computing power. With such a high volume of varied data available, data mining techniques help to extract information out of the data.
Statistical learning methods include everything, starting with linear regression, and encompassing recently developed complex and computation-intensive pattern recognition methods with roots in computer science. The main objective of learning methods is prediction, though that need not be the only objective. In this course though only prediction methods are considered.
1.2 Examples of Data Mining Applications
In the early 1990’s the phrase ‘data mining’ became popular. Currently, statistical learning, data analytics, data science are the other commonly used terms. Since data has become very cheap and data collection methods almost automated, in many fields, such as business domain, success depends on efficient and intelligent utilization of collected data. In this respect data mining efforts are omnipresent. Following examples are only indicative of a few interesting application areas. As more communication between different disciplines occurs, application areas are likely to evolve, and new ones to emerge.
- Businesses benefit from data mining
Large retailers like Walmart utilize information on store footfall, advertising campaign, and even weather forecast to predict sales and stock up accordingly. Credit card companies mine transaction records for fraudulent use of their cards based on purchase patterns of consumers - They can deny access if your purchase patterns change drastically! - In Genomics research gathers speed using computational methods
The Human Genome Project mounts up piles of data but getting the data to work for humankind to develop a new drug and weed out diseases, will require pattern recognition in the data which is handled in bioinformatics. Scientists use Microarray data to look at the gene expressions and sophisticated data analysis techniques are employed to account for the background noise and normalization of data. - Information retrieval
Terabytes of data are being cumulated on the internet which includes Facebook and X data as well as Instagrams and other social networking sites. This vast repository may be mined, and controlled to some extent, to swerve public opinion in a candidate’s favor (election strategy) or evaluate a product’s performance (marketing and sales strategy) Another aspect of social media is the Multimedia information containing the visual as well as audio data files. How do we manage these types of data efficiently? Mining non-alphanumeric data is not easy. - Communication systems
Speech recognition is one area where important pattern recognition methods were developed and have been transferred to other application areas. Image analysis is another important area of data mining application and facial recognition techniques are a part of security arrangements
1.3 The Nature of The Problem
Even though this is an applied course on data mining and the focus is on data analysis by application of software and interpretation of results, the familiarity of mathematical underpinnings help to understand the applicability and limitations of the methods. Computational techniques have given data mining unprecedented power, but, at the same time, they have increased the chances of blindly applying any technique to any situation, without paying heed to its applicability. The analytical insight does not come with any software application; a software application enhances analytical insight. Blind application of software on a large number of records will not necessarily provide insight into the data; rather it is possible that in the mire of information all grains of truth will be inextricably lost.
Let us start with an overview of the data mining techniques that are going to be considered in this course. The focus is on the problem of Prediction. This is by no means the only problem that data mining can deal with. There are many other topics outside the scope of the current course. Data mining is multi-disciplinary and encompasses methods dealing with scaling up for high-dimensional data and high-speed data streams, distributed data mining, mining in a network setting and many other facets. Within this course, our focus is on Statistical Learning and Prediction.
The diagram here presents the main aspects of a statistical learning model.
In a learning (prediction) problem, there exists a set of features (X) and a response (Y). X is usually a vector. For the purpose of the course, Y is a real number, which is either a quantitative variable or a label for a categorical variable. The predictor is a mathematical function (f) that maps X to Y.
The problem is how to find the function f?
There may be different approaches to solve this problem. For instance, researchers in medical domains form their prediction functions based on individual expertise and domain knowledge. Physicians ask their patients about the symptoms, and then based on their experience, they will identify the disease.
Such a problem of human prediction function is not of interest in this course. We are interested in studying predictors generated by learning algorithms.
The approach considered in this course is purely data-driven. The first step in any model-building process is to understand the data, which can be done graphically or analytically. When data is complex an amalgamation of both visual and analytical process gives the best result. This step is often called the EDA or Exploratory Data Analysis. The second step is to build and evaluate a model (or a set of candidate models) on the data. A standard approach is to take a random sample of the data for model building and use the rest of the data to evaluate the model performance. The part of the sample that is used to build a model is called the Training Sample (training set or training data) and the other part, the Test Sample (test set or test data). The training sample is used to develop the relationship between X and Y and model parameters are estimated based on this data. The test sample is used only when one model among a few strong candidate models is finalized. Repeatedly using the test sample in the model building process negates its utility as a final arbitrator of the model.
The learning algorithms explore the given data set and extract a relationship between X and Y. The output of the learning algorithms is a function mapping X to Y. This is known as the Supervised Learning algorithm. In Unsupervised Learning algorithms, the response Y is not known or not considered in developing the algorithm.
At face value model building appears straightforward. Once the data is available, with the help of software, several techniques are applied in the training data and one final model emerges after looking at its performance in the test data. However, to achieve a reliable and trustworthy predictive model, understanding of the features in the data and the objective of modeling is essential. In fact, the reality is often complicated and formulation of a practical problem into a data mining problem may be the real challenge. Sometimes only raw data is provided for analysis. In other cases, researchers have the freedom to collect the data. Collection of relevant data is costly and requires domain knowledge. Between the raw data and model building, there is a data simplification step that may be known as feature extraction. Very often the raw data is not easy to handle or there are layers of hidden information which will have to be revealed before it is submitted to a learning algorithm.
1.4 Terminology
Notation
Input X: X is often multidimensional.
Each dimension of X is denoted by and is referred to as a feature or an independent (predictor) variable or simply, variable, (depending on what area of study you are from).
Output Y: called the response or dependent variable.
The response is available only when learning is supervised.
1.4.1 The Nature of Data Sets
Quantitative
Measurements or counts, recorded as numerical values, e.g. Height, Temperature, # of Red M&M’s in a bag
Qualitative
Group or categories
- Ordinal
- possesses a natural ordering, e.g. Shirt sizes (S, M, L, XL)
- Nominal
- just name of the categories, e.g. Marital Status, Gender, Color of M&M’s in a bag
Try it! Identify the response and the predictor variables as well as the variable type.
Does a person’s hometown influence the amount they would pay for a single CD?
Response: the amount they would pay (Quantitative)
Predictor: hometown (Qualitative: nominal)
Do men with higher education levels have a higher income?
- Predictors: education level (Qualitative: ordinal), gender (Qualitative: nominal)
Is there a relationship between gender and favorite kind of music?
Response: favorite music (Qualitative: nominal)
Predictor: gender (Qualitative: nominal)
Supervised Learning versus Unsupervised Learning If Y is available in the training data, then the learning method is supervised. If Y is unavailable (or ignored even if available), then the method is unsupervised.
Supervised Learning is categorized into two types:
Regression
The response Y is quantitativeClassification
The response is qualitative, or categorical
Regression (Quantitative) versus Classification (Qualitative)
- Is Y quantitative or qualitative?
When Y is qualitative or categorical, then it is just a label. The set of labels can be designated by \[G\in G={ 1,2 ,\ldots,K}\]
If Y is quantitative, then the learning algorithm is a regression problem. If Y is qualitative, the learning algorithm is a classification problem The algorithm will be a good fit to the data. Since the model is developed using the training data, the model is expected to fit the training data well. Ideally, a learning algorithm will have the following properties
The algorithm will be as robust as possible. A robust algorithm is expected to do well in case of test data; i.e. the algorithm will have high predictive power. A ‘good’ predictive model developed using the training data must also work well on the test data. This may seem to be true by default, but it is not! While fitting the training data the model should not be too close to the data because, in the future, when new data is obtained there is no guarantee that they will be an exact replica of the training data. Hence is the need for the model to be robust.
A simpler model, therefore, tends to be more robust compared to a complex one, in the sense that it may have high predictive power. A complex model may follow the pattern in the training data too closely and hence its performance in the test data may be worse. On the other hand, a simple model does not fit the training data aggressively. Hence there is always a trade-off which is manifested through the following equivalent concepts.
Training error reflects whether the data fits well. Testing error reflects whether the predictor actually works on new data. A model with the smallest training error will not necessarily provide the smallest test error.
- Training error versusTest error
- Bias versus Variance
Bias is a measure of how closely a model resembles reality. If a linear model is proposed when the true relationship between X and Y is quadratic, the proposed model is biased. If the same learning algorithm is applied to multiple independent training data, a different predictor estimate will result. If the average of these predictors is the same as the true value of the statistic in consideration, then the prediction is unbiased. Bias tends to be smaller when a model includes more parameters and complex relationship. Complex models have ‘tuning knobs’ to fine-tune the model, but it is harder to find the right position for more ‘knobs’. Bias is the systematic part of the difference between model and truth.
Variance, on the other hand, is a measure of how much the predictor estimate differs when different training data is used. Variance tends to be higher for more complex models. Finding a balance between bias and variance is the objective of developing an optimum predictive model because the accuracy of a model is affected by both.
An overfitted model follows the training data too closely. It may have low bias but its variance will be high. This indicates that the predictor works very well on the training data but it performs substantially worse on the test data.
An empirical risk is the error rate based on the training data. If the model is more complex, then it tends to yield a smaller empirical risk but at the same time it is less robust, i.e., has higher variance. Some classification method, such as support vector machine, directly trade off empirical risk and model complexity.
- Fitting versus Overfitting
- Empirical Risk vs. Model Complexity
Note that all of the above concepts summarizes one single concept — A learning algorithm has to strike a perfect balance between being complex and robust so that it performs the best possible way in the training as well as in the test sample.
Below is a very interesting figure from “The Elements of Statistical Learning” which tries to explain the above ideas. Note that the static diagram is attempting to capture something that is very dynamic.
The Truth at the center of the blue circle is what the data mining process tries to get at. Unfortunately, the truth is not revealed to the practitioner! What is given is a sample dataset which has an empirical distribution, possibly contained anywhere in the blue circle.
A large (more complex) model space is compared with a smaller one (more restricted). The two yellow circles indicate the range of the estimated models obtained under the two model spaces. Consider the larger model space, the average model obtained is indicated by the center of the large yellow circle. The difference between this center and the truth is the bias for the larger model space. Similarly, the difference between the truth and the center of the smaller yellow circle is the bias for the smaller model space. The smaller model space has a larger bias. On the other hand, the resulting model from the smaller space does not vary as much as that from the larger space, that is, the variance is smaller.
Although the larger model space is on average better (smaller bias), the particular model is more likely to be poor because the variation from the average is high.
1.4.2 The Learning Spectrum
From a historical perspective, there are two ends of the spectrum regarding learning or classification. At one end lie simple models that are very constrained. On the other end, there are very complex models that could be extremely flexible. Over the years research activities have been pushing forward a compromise or balance between model complexity and flexibility. Regularization is added to the complex models on one side, and model extension is made to the simple models on the other. Striking the perfect balance is the name of the game!
1.5 Classification Problems in Real Life
Here are a few interesting examples to illustrate the widespread application of prediction algorithms.
1 - Email Spam
The goal is to predict whether an email is a spam and should be delivered to the Junk folder.
There are more than one method of identifying a mail as a spam. A simple method is discussed.
The raw data comprises only the text part but ignores all images. Text is a simple sequence of words which is the input (X). The goal is to predict the binary response Y: spam or not.
The first step is to process the raw data into a vector, which can be done in several ways. The method followed here is based on the relative frequencies of most common words and punctuation marks in e-mail messages. A set of 57 such words and punctuation marks are pre-selected by researchers. This is where domain knowledge plays an important role.
Given these 57 most commonly occurring words and punctuation marks, then, in every e-mail message we would compute a relative frequency for each word, i.e., the percentage of times this word appears with respect to the total number of words in the email message.
In the current example, 4601 email messages were considered in the training sample. These e-mail messages were identified as either a good e-mail or spam after reading the emails and assuming implicitly that human decision is perfect (an arguable point!). Relative frequency of the 57 most commonly used words and punctuation based on this set of emails was constructed. This is an example of supervised learning as in the training data the response Y is known.
In the future when a new email message is received, the algorithm will analyze the text sequence and compute the relative frequency for these 57 identified words. This is the new input vector to be classified into spam or not through the learning algorithm.
2 - Handwritten Digit Recognition
The goal is to identify images of single digits 0 - 9 correctly.
The raw data comprises images that are scaled segments from five-digit ZIP codes. In the diagram below every green box is one image. The original images are very small, containing only 16 × 16 pixels. For convenience the images below are enlarged, hence the pixelation or ‘boxiness’ of the numbers.
Every image is to be identified as 0 or 1 or 2 … or 9. Since the numbers are handwritten, the task is not trivial. For instance, a ‘5’ sometimes can very much look like a ‘6’, and ‘7’ is sometimes confused with ‘1’.
To the computer, an image is a matrix, and every pixel in the image corresponds to one entry in the matrix. Every entry is an integer ranging from a pixel intensity of 0 (black) to 255 (white). Hence the raw data can be submitted to the computer directly without any feature extraction. The image matrix was scanned row by row and then arranged into a large 256-dimensional vector. This is used as the input to train the classifier. Note that this is also a supervised learning algorithm where Y, the response, is multi-level and can take 10 values.
3 - Image segmentation
Here is a more complex example of an image processing problem. The satellite images are to be identified into man-made or natural regions. For instance, in the aerial images shown below, buildings are labeled as man-made, and the vegetation areas are labeled as natural.
These grayscale images are much larger than the previous example. These images are 512 × 512 pixels and again because these are grayscale images we can present pixel intensity with numbers 0 to 255.
In the previous example of hand-written image identification, because of the small size of the images, no feature extraction was done. However in this problem feature extraction is necessary. A standard method of feature extraction in an image processing problem is to divide images into blocks of pixels or to form a neighborhood around each pixel. As is shown in the following diagram, after dividing the images into blocks of pixels or forming a neighborhood around each pixel, each block may be described by several features. As we have seen in the previous example, grayscale images can be represented by one matrix. Every entry in a greyscale image is an integer ranging from a pixel intensity of 0 (black) to 255 (white). Color images are represented by values of RGB (red, green and blue). Color images, therefore, are represented by 3 such matrices as seen below.
For each block, a few features (or statistics) may be computed using the color vectors for the pixels in the block. This set forms a feature vector for every block.
Examples of features:
- Average of R, G and B values for pixels in one block
- Variance of the brightness of the pixels (brightness is the average of RGB color values). Small variance indicates the block is visually smooth.
The feature vectors for the blocks sometimes are treated as independent samples from an unknown distribution. Ignoring f the spatial dependence among feature vectors results in performance loss. To make the learning algorithm efficient the spatial dependence needs to be exploited. Only then the accuracy in classification will improve.
4 - Speech Recognition
Another interesting example of data mining deals with speech recognition. For instance, if you call the University Park Airport, the system might ask you your flight number, or your origin and destination cities. The system does a very good job recognizing city names. This is a classification problem, in which each city name is a class. The number of classes is very big but finite.
The raw data involves voice amplitude sampled at discrete time points (a time sequence), which may be represented in the waveforms as shown above. In speech recognition, a very popular method is the Hidden Markov Model.
At every time point, one or more features, such as frequencies, are computed. The speech signal essentially becomes a sequence of frequency vectors. This sequence is assumed to be an instance of a hidden Markov model (HMM). An HMM can be estimated using multiple sample sequences under the same class (e.g., city name).
5 - DNA Expression Microarray
Our goal here is to identify disease or tissue types based on the gene expression levels.
For each sample taken from a tissue of a particular disease type, the expression levels of a very large collection of genes are measured. The input data goes through a data cleaning process. Data cleaning may include but is certainly not limited to, normalization, elimination of noise and perhaps log-scale transformations. A large volume of literature exists on the topic of cleaning microarray data.
In the example considered 96 samples were taken from 9 classes or types of tissues. It was expensive to collect the tissue samples, at least in the early days. Therefore, the sample size is often small but the dimensionality of data is very high. Every sample is measured on 4026 genes. very often microarray data analysis has its own challenges with a small number of observations and very large number of features from each observation.
6 - DNA Sequence Classification
Each genome is made up of DNA sequences and each DNA segment has specific biological functions. However there are DNA segments which are non-coding, i.e. they do not have any biological function (or their functionalities are not yet known). One problem in DNA sequencing is to label the sampled segments as coding or non-coding (with a biological function or without).
The raw DNA data comprises sequences of letters, e.g., A, C, G, T for each of the DNA sequences. One method of classification assumes the sequences to be realizations of random processes. Different random processes are assumed for different classes of sequences.
In the above examples on classification, several simple and complex real-life problems are considered. Classification problems are faced in a wide range of research areas. The raw data can come in all sizes, shapes, and varieties. A critical step in data mining is to formulate a mathematical problem from a real problem. In this course, the focus is on learning algorithms. The formulation step is largely left out.