Lesson 1(a): Introduction to Data Mining

Overview Section

With rapid advances in information technology, an explosive growth is witnessed in data generation and data collection capabilities across all domains. In the business world, very large databases on commercial transactions have been generated by retailers and e-commerce. A huge amount of scientific data have been generated in various fields as well. One case in point is the human genome project which has aggregated gigabytes of data on the human genetic code. The World Wide Web provides another example with billions of web pages consisting of textual and multimedia information that is used by millions of people. Analyzing huge bodies of data that can be understood and used efficiently remains a challenging problem. Data mining addresses this problem by providing techniques and software to automate the analysis and exploration of large and complex data sets. Research on data mining is being pursued in a wide variety of fields, including statistics, computer science, machine learning, database management, and data visualization, to name a few.

This course on data mining will cover commonly used techniques and applications in this field. Though the focus is on the application of the methods through the software R, considerable effort is devoted to developing the mathematical basis.  Data mining and learning techniques developed in fields other than statistics, e.g., machine learning and signal processing, are also introduced. After the completion of the course, students should be able to identify situations concerning the applicability of the techniques, employ the techniques to derive results, interpret the results and comprehend the limitations, if any, of the final outcome.

Objectives

Upon successful completion of this lesson, you should be able to:

  • Explain the basic concepts of data mining: supervised vs. unsupervised learning with reference to classification, clustering, regression, etc.
  • Recognize that the formulation of a real-world problem into a statistical learning problem is important although, in this course, we focus on algorithms for solving the already formulated problems.
  • Recognize that a core issue for designing learning algorithms is to balance performance within training data and robustness on unobserved test data.
  • Discuss an overview of several major approaches to classification.