Data mining is a critical step in knowledge discovery involving theories, methodologies, and tools for revealing patterns in data. It is important to understand the rationale behind the methods so that tools and methods have appropriate fit with the data and the objective of pattern recognition. There may be several options for tools available for a dataset.
When a bank receives a loan application, based on the applicant’s profile the bank has to make a decision regarding whether to go ahead with the loan approval or not. Two types of risks are associated with the bank’s decision –
- If the applicant is a good credit risk, i.e. is likely to repay the loan, then not approving the loan to the person results in a loss of business to the bank
- If the applicant is a bad credit risk, i.e. is not likely to repay the loan, then approving the loan to the person results in a financial loss to the bank
Objective of Analysis:
Minimization of risk and maximization of profit on behalf of the bank.
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link to the German Credit data (right-click and "save as" ). A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
Data Files for this case (right-click and "save as" ) :
- German Credit data - german_credit.csv
- Training dataset - Training50.csv
- Test dataset - Test.csv
The following analytical approaches are taken:
- Logistic regression: The response is binary (Good credit risk or Bad) and several predictors are available.
- Discriminant Analysis:
- Tree-based method and Random Forest
Sample R code for Reading a .csv file
read.csv(“C:/Users/sbasu/Desktop/Stat_508/German Credit”, header = TRUE, sep = ",")