GCD.1 - Exploratory Data Analysis (EDA) and Data Pre-processing

GCD.1 - Exploratory Data Analysis (EDA) and Data Pre-processing

Before getting into any sophisticated analysis, the first step is to do an EDA and data cleaning. Since both categorical and continuous variables are included in the data set, appropriate tables and summary statistics are provided.

  Sample R code for creating marginal proportional tables

margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),1)
margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),2)
margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),3)
margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),4)
margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),5)
margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),6)
margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),7)
margin.table(prop.table(table(Duration.in.Current.address, Most.valuable.available.asset, Concurrent.Credits,No.of.Credits.at.this.Bank,Occupation,No.of.dependents,Telephone, Foreign.Worker)),8)

Proportions of applicants belonging to each classification of a categorical variable are shown in the following table (below). The pink shadings indicate that these levels have too few observations and the levels are merged for final analysis.

Predictor (Categorical)   Levels and Proportions
Account Balance No Account None Below 200 DM 200 DM or Above  
(%) 27.4% 26.9% 6.3% 39.4%
Payment Status Delayed Other Credits Paid Up No Problem with Current Credits Previous Credits Paid
(%) 4.0% 4.9% 53.0% 8.8% 29.3%
Savings/ Stock Value None Below 100 DM [100, 500) [500, 1000) Above 1000
  60.3% 10.3% 6.3% 4.8% 18.3%
Length of Current Employment Unemployed <1 Year [1, 4) [4, 7) Above 7
  6.2% 17.2% 33.9% 17.4% 25.3%
Installments % Above 35% (25%, 35%) [20%, 25%) Below 20%  
  13.6% 23.1% 15.7% 47.6%
Occupation Unemployed, unskilled Unskilled Permanent Resident Skilled Executive
  2.2% 20.0% 63.0% 14.8%
Sex and Marital Status Male, Divorced Male, Single Male, Married/Widowed Female
  5.0% 31.0% 54.8% 9.2%
Duration in Current Address <1 Year [1, 4) [4, 7) Above 7
  13.0% 30.8% 14.9% 41.3%
Type of Apartment Free Rented Owned  
  17.9% 71.4% 10.7%
Most Valuable Asset None Car Life Insurance Real Estate
  28.2% 23.2% 33.2% 15.4%
No. of credits at Bank 1 2 or 3 4 or 5 Above 6
  63.3% 33.3% 2.8% 0.06%
Guarantor None Co-applicant Guarantor  
  90.7% 4.1% 5.2%
Concurrent Credits Other Banks Dept. Store None
  13.9% 4.7% 81.4%
No. of Departments 3 or More Less than 3  
  84.5% 15.5%
Telephone Yes No
  40.4% 59.6%
Foreign Worker Yes No
  3.7% 96.3%
Purpose of Credit
New Car Used Car Furniture Radio/TV Appliances Repair Vacation Retraining Business Other
10.3% 18.1% 28% 1.2% 2.2% 5.0% 0.9% 9.7% 1.2% 23.4%

Since most of the predictors are categorical with several levels, the full cross-classification of all variables will lead to zero observations in many cells. Hence we need to reduce the table size. For details of variable names and classification see Appendix 1.

Depending on the cell proportions given in the one-way table above two or more cells are merged for several categorical predictors. We present below the final classification for the predictors that may potentially have any influence on Creditability

  • Account Balance: No account (1), None (No balance) (2), Some Balance (3)
  • Payment Status: Some Problems (1), Paid Up (2), No Problems (in this bank) (3)
  • Savings/Stock Value: None, Below 100 DM, [100, 1000] DM, Above 1000 DM
  • Employment Length: Below 1 year (including unemployed), [1, 4), [4, 7), Above 7
  • Sex/Marital Status: Male Divorced/Single, Male Married/Widowed, Female
  • No of Credits at this bank: 1, More than 1
  • Guarantor: None, Yes
  • Concurrent Credits: Other Banks or Dept Stores, None
  • ForeignWorker variable may be dropped from the study
  • Purpose of Credit: New car, Used car, Home Related, Other

Cross-tabulation of the 9 predictors as defined above with Creditability is shown below. The proportions shown in the cells are column proportions and so are the marginal proportions. For example, 30% of 1000 applicants have no account and another 30% have no balance while 40% have some balance in their account. Among those who have no account 135 are found to be Creditable and 139 are found to be Non-Creditable. In the group with no balance in their account, 40% were found to be on-Creditable whereas in the group having some balance only 1% are found to be Non-Creditable.

  Sample R code for creating K1 x K2 contingency table.

CrossTable(Creditability, Account.Balance, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
CrossTable(Creditability, Payment.Status.of.Previous.Credit, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)
CrossTable(Creditability, Purpose, digits=1, prop.r=F, prop.t=F, prop.chisq=F, chisq=T)

R output

R output

R output

R output

R output

R output

R output

R output

R output

R output

R output

R output

Summary for the continuous variables:

  Sample R code for Descriptive Statistics.

attach(German.Credit) # If the data frame is attached then the column names may be directly called
summry(Duration.of.Credit.Month) # Summary statistics are printed for this variable
brksCredit <- seq(0, 80, 10) # Bins for a nice looking histogram
hist(Duration.of.Credit.Month., breaks=brksCredit, xlab = "Credit Month", ylab = "Frequency", main = " ", cex=0.4) # produces nice looking histogram
boxplot(Duration.of.Credit.Month., bty="n",xlab = "Credit Month", cex=0.4) # For boxplot

Predictors (Continuous) Min Q1 Median Q3 Max Mean SD
Duration of Credit (Month) 4 12 18 24 72 20.9 12.06
Amount of Credit (DM) 250 1366 2320 3972 18420 3271 2822.75
Age (of Applicant) 19 27 33 42 75 35.54 11.35

 

Distribution of the continuous variables:

plots of german credit data

All the three variables show marked positive skewness. Boxplots bear this out even more clearly.

plots of german credit data

In preparation of predictors to use in building a logistic regression model, we consider bivariate association of the response (Creditability) with the categorical predictors.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility