Machine Learning - Project PDF

Supervised Machine Learning
(CART, RANDOM FOREST and NEURAL

NETWORK Modelling)
- Tejas Rajendra Nayak

Contents
1 PROJECT OBJECTIVE 4
2 DATABASE OVERVIEW 5
3 EXPLORATORY DATA ANALYSIS AND DESCRIPTIVE STATISTICS 7
3.1 KEY OBSERVATIONS: 7

3.2 MISSING VALUES CHECK: 7
3.3 PROPORTION OF RESPONDERS VS NON RESPONDERS: 7
3.4 SUMMARY STATISTICS OF THE DATASET: 7
4 CREATION OF THE TEST AND TRAIN DATA SETS 9
5 MODEL BUILDING – CART 10
5.1 THE CART MODEL OUTPUT 10

5.2 THE CART MODEL GRAPHICAL OUTPUT: 13
5.3 PERFORMANCE MEASURES ON TRAINING DATA SET 15
5.4 PERFORMANCE MEASURES ON VALIDATION DATA SET 17
5.5 CART – CONCLUSION 19
6 MODEL BUILDING – RANDOM FOREST 20
6.1 THE INITIAL BUILD & OPTIMAL NO OF TREES 20

6.2 VARIABLE IMPORTANCE 22
6.3 OPTIMAL MTRY VALUE 23
6.6 RANDOM FOREST – CONCLUSION 30
7 MODEL BUILDING – ARTIFICIAL NEURAL NETWORK 31
7.1 CREATION OF DUMMY VARIABLES 31

7.2 SCALING OF THE VARIABLES 32
7.3 BUILDING THE ARTIFICIAL NEURAL NETWORK 32
7.4 THE ARTIFICIAL NEURAL NETWORK – GRAPHICAL REPRESENTATION 33
7.5 HISTOGRAM OF PROBABILITIES IN TRAINING DATASET 33
7.8 ARTIFICIAL NEURAL NETWORK – CONCLUSION 40
8 MODEL COMPARISON 41
9 APPENDIX – SAMPLE SOURCE CODE 42

1 Project Objective
The data provides details from MyBank about a Personal Loan Campaign that was executed by the bank.
20000 customers were targeted with an offer of personal loan on 10% interest rate, out of which 2512
customers responded positively. The data needs to be used to create classification model(s) in order to
predict the response of new set of customers in the future, depending on the attributes available in the
data.
We are expected to build the classification Models using following Supervised Machine Learning
Techniques:
1. Classification and Regression Tree
2. Random Forest
3. Artificial Neural Network
After building the Classification Models we are to compare the three models with each other and
provide our observations.
2 Database overview
Details about the Personal Loan Campaign Dataset are as follows:
Feature
Sr No. Feature Name Type Description
1 CUST_ID Factor Unique Customer ID
Target Field - 1: Responder, 0: Non-
2 TARGET Factor Responder
3 AGE Integer Age of the customer in years
4 GENDER Factor Gender: Male / Female / Other
5 BALANCE Numeric Monthly Average Balance
Occupation: Professional / Salaried / Self
Employed / Self Employed Non
6 OCCUPATION Factor Professional.
7 AGE_BKT Factor Age Bucket
8 SCR Integer Generic Marketing Score
Ability to hold money in the account
9 HOLDING_PERIOD Integer (Range 0 - 31)
10 ACC_TYPE Factor Account Type - Saving / Current
11 ACC_OP_DATE Date Account Open Date
12 LEN_OF_RLTN_IN_MNTH Integer Length of Relationship in Months
13 NO_OF_L_CR_TXNS Integer Number of Credit Transactions
14 NO_OF_L_DR_TXNS Integer Number of Debit Transactions
15 TOT_NO_OF_L_TXNS Integer Total Number of Transactions
NO_OF_BR_CSH_WDL_DR_TXN Number of Branch Cash Withdrawal Debit
16 S Integer Transactions
17 NO_OF_ATM_DR_TXNS Integer Number of ATM Debit Transactions
18 NO_OF_NET_DR_TXNS Integer Number of Net Debit Transactions
Number of Mobile Banking Debit
19 NO_OF_MOB_DR_TXNS Integer Transactions
20 NO_OF_CHQ_DR_TXNS Integer Number of Cheque Debit Transactions
21 FLG_HAS_CC Integer Has Credit Card - 1: Yes, 0: No
22 AMT_ATM_DR Integer Amount Withdrawn from ATM
23 AMT_BR_CSH_WDL_DR Integer Amount cash withdrawn from Branch

24 AMT_CHQ_DR Integer Amount debited by Cheque Transactions
25 AMT_NET_DR Numeric Amount debited by Net Transactions
Amount debited by Mobile Banking
26 AMT_MOB_DR Integer Transactions
27 AMT_L_DR Numeric Total Amount Debited
28 FLG_HAS_ANY_CHGS Integer Has any banking charges

Amount charged by way of the Other Bank
29 AMT_OTH_BK_ATM_USG_CHGS Integer ATM usage
Amount charged by way Minimum Balance

30 AMT_MIN_BAL_NMC_CHGS Integer not maintained
Amount charged by way Inward Cheque
31 NO_OF_IW_CHQ_BNC_TXNS Integer Bounce
Amount charged by way Outward Cheque
32 NO_OF_OW_CHQ_BNC_TXNS Integer Bounce
Average Amount withdrawn per ATM

33 AVG_AMT_PER_ATM_TXN Numeric Transaction
Average Amount withdrawn per Cash

34 AVG_AMT_PER_CSH_WDL_TXN Numeric Withdrawal Transaction
Average Amount debited per Cheque
35 AVG_AMT_PER_CHQ_TXN Numeric Transaction
Average Amount per Net banking
36 AVG_AMT_PER_NET_TXN Numeric Transaction
37 AVG_AMT_PER_MOB_TXN Numeric Average Amount per Mobile Transaction
38 FLG_HAS_NOMINEE Integer Flag: Has Nominee - 1: Yes, 0: No
39 FLG_HAS_OLD_LOAN Integer Flag: Has any earlier loan - 1: Yes, 0: No

40 random Numeric Random Number
3 Exploratory Data Analysis and Descriptive Statistics
3.1 Key observations:
Number of Rows and Columns:
a. The number of rows in the dataset is 20000
b. The number of columns (Features) in the dataset is 40
3.2 Missing values check:

There are NIL Missing values in the dataset.
3.3 Proportion of Responders Vs Non Responders:

a. Total Responder Records: 2512 (12.56%)
b. Total Non-Responder Records: 17,488 (87.44%)
Inference: Non Responder class is highly dominating. We might need to Balance out the dataset to
achieve better classification results.
3.4 Summary Statistics of the dataset:

4 Creation of the TEST and TRAIN Data sets
The provided data-set is divided in TEST & TRAIN data sets, with 30:70 proportion.
The distribution of Responder and Non Responder Classes is verified in both the data sets, and ensured
it’s close to equal.
The below table summarizes the total number of records as well as percentage distribution of
Responder Vs Non Responder Records within each TEST & TRAIN Data-Sets.
DIMENSIONS of the TEST & TRAIN DATA-SETS

Parameter TEST DATA TRAIN DATA
Number of
Records 6000 14000
No of Features 40 40
% Responders 86.93% 87.66%
% Non-Responders 13.07% 12.34%
5 Model Building – CART
Decision Trees are used in data-mining with an objective to create a model that predicts the value of a
target (or dependent variable) based on the values of many inputs (or independent variables).
The Classification and Regression Tree refers to the following decision trees:
a. Classification Trees: where the target variable is categorical and the tree is used to identify the
"class" within which a target variable would likely fall into.
b. Regression Trees: where the target variable is continuous and tree is used to predict its value.
The CART algorithm is structured as a sequence of questions, the answers to which determine what the
next question, if any should be. The result of these questions is a tree like structure where the ends are
terminal nodes at which point there are no more questions.
5.1 The CART Model output
The initial CART Model is built by setting up Control Parameters as follows:
Setting the control parameter inputs for rpart

r.ctrl = rpart.control(minsplit=100, minbucket = 10, cp = 0, xval = 10)
# Build the Model on Training Dataset.

# Exclude Columns Customer ID and Acct Opening Date.
# names(cart.train)
M1 <- rpart(formula = TARGET ~ .,

data = cart.train[,-c(1,11)], method = "class",
# data = cart.train[,-c(1)], method = "class",
control = r.ctrl)
The Model Outcome is as follows:
<
n= 14000
node), split, n, loss, yval, (yprob)

* denotes terminal node
1) root 14000 1728 0 (0.87657143 0.12342857)

2) NO_OF_L_DR_TXNS< 5.5 8419 764 0 (0.90925288 0.09074712)
4) FLG_HAS_CC< 0.5 5853 410 0 (0.92995045 0.07004955)
8) NO_OF_L_CR_TXNS< 16.5 5090 298 0 (0.94145383 0.05854617)
16) HOLDING_PERIOD>=12.5 3435 118 0 (0.96564774 0.03435226) *
17) HOLDING_PERIOD< 12.5 1655 180 0 (0.89123867 0.10876133)
34) AMT_ATM_DR< 13350 1389 116 0 (0.91648668 0.08351332) *
35) AMT_ATM_DR>=13350 266 64 0 (0.75939850 0.24060150)
70) AVG_AMT_PER_CSH_WDL_TXN< 897000 248 51 0 (0.79435484 0.20564516)
140) AMT_NET_DR< 821965.5 223 37 0 (0.83408072 0.16591928) *
141) AMT_NET_DR>=821965.5 25 11 1 (0.44000000 0.56000000) *
71) AVG_AMT_PER_CSH_WDL_TXN>=897000 18 5 1 (0.27777778 0.72222222) *
9) NO_OF_L_CR_TXNS>=16.5 763 112 0 (0.85321101 0.14678899)
18) OCCUPATION=SAL 245 13 0 (0.94693878 0.05306122) *
19) OCCUPATION=PROF,SELF-EMP,SENP 518 99 0 (0.80888031 0.19111969)
38) SCR< 553 370 52 0 (0.85945946 0.14054054)
76) BALANCE>=62991.51 268 21 0 (0.92164179 0.07835821) *
77) BALANCE< 62991.51 102 31 0 (0.69607843 0.30392157)
154) SCR>=152.5 78 14 0 (0.82051282 0.17948718) *
155) SCR< 152.5 24 7 1 (0.29166667 0.70833333) *
39) SCR>=553 148 47 0 (0.68243243 0.31756757)
78) NO_OF_BR_CSH_WDL_DR_TXNS< 1.5 94 16 0 (0.82978723 0.17021277) *
79) NO_OF_BR_CSH_WDL_DR_TXNS>=1.5 54 23 1 (0.42592593 0.57407407) *
5) FLG_HAS_CC>=0.5 2566 354 0 (0.86204209 0.13795791)
10) GENDER=F,M 2547 342 0 (0.86572438 0.13427562)
20) AMT_L_DR< 435905 1258 118 0 (0.90620032 0.09379968) *
21) AMT_L_DR>=435905 1289 224 0 (0.82622188 0.17377812)
42) BALANCE>=82341.66 947 136 0 (0.85638860 0.14361140)
84) NO_OF_L_CR_TXNS< 10.5 697 77 0 (0.88952654 0.11047346) *
85) NO_OF_L_CR_TXNS>=10.5 250 59 0 (0.76400000 0.23600000)
170) HOLDING_PERIOD>=15.5 235 46 0 (0.80425532 0.19574468)
340) ACC_TYPE=CA 105 3 0 (0.97142857 0.02857143) *
341) ACC_TYPE=SA 130 43 0 (0.66923077 0.33076923)
682) BALANCE>=327507.5 77 15 0 (0.80519481 0.19480519) *
683) BALANCE< 327507.5 53 25 1 (0.47169811 0.52830189) *
171) HOLDING_PERIOD< 15.5 15 2 1 (0.13333333 0.86666667) *
43) BALANCE< 82341.66 342 88 0 (0.74269006 0.25730994)
86) NO_OF_ATM_DR_TXNS< 1.5 309 66 0 (0.78640777 0.21359223)
172) AMT_L_DR>=466834.5 297 56 0 (0.81144781 0.18855219) *
173) AMT_L_DR< 466834.5 12 2 1 (0.16666667 0.83333333) *
87) NO_OF_ATM_DR_TXNS>=1.5 33 11 1 (0.33333333 0.66666667) *
11) GENDER=O 19 7 1 (0.36842105 0.63157895) *
3) NO_OF_L_DR_TXNS>=5.5 5581 964 0 (0.82727110 0.17272890)
6) NO_OF_L_CR_TXNS< 48.5 5311 857 0 (0.83863679 0.16136321)
12) NO_OF_L_CR_TXNS>=3.5 5292 838 0 (0.84164777 0.15835223)
24) OCCUPATION=PROF,SAL,SENP 4347 595 0 (0.86312399 0.13687601)
48) SCR< 535.5 2840 311 0 (0.89049296 0.10950704)
96) AMT_ATM_DR< 68300 2822 299 0 (0.89404678 0.10595322)
192) FLG_HAS_CC< 0.5 2009 173 0 (0.91388751 0.08611249)
384) BALANCE>=263.835 1997 166 0 (0.91687531 0.08312469) *
385) BALANCE< 263.835 12 5 1 (0.41666667 0.58333333) *
193) FLG_HAS_CC>=0.5 813 126 0 (0.84501845 0.15498155)
386) AVG_AMT_PER_MOB_TXN< 126207.5 688 79 0 (0.88517442 0.11482558) *
387) AVG_AMT_PER_MOB_TXN>=126207.5 125 47 0 (0.62400000 0.37600000)
774) AMT_NET_DR>=281042 50 7 0 (0.86000000 0.14000000) *
775) AMT_NET_DR< 281042 75 35 1 (0.46666667 0.53333333) *
97) AMT_ATM_DR>=68300 18 6 1 (0.33333333 0.66666667) *
49) SCR>=535.5 1507 284 0 (0.81154612 0.18845388)
98) AVG_AMT_PER_MOB_TXN< 146359 1380 237 0 (0.82826087 0.17173913)
196) LEN_OF_RLTN_IN_MNTH>=44.5 1270 200 0 (0.84251969 0.15748031)
392) AVG_AMT_PER_NET_TXN< 971167 1259 193 0 (0.84670373 0.15329627)
784) AMT_BR_CSH_WDL_DR>=3090 1123 156 0 (0.86108638 0.13891362)
1568) AMT_L_DR< 1053490 612 59 0 (0.90359477 0.09640523) *
1569) AMT_L_DR>=1053490 511 97 0 (0.81017613 0.18982387)
3138) SCR>=793 187 17 0 (0.90909091 0.09090909) *
3139) SCR< 793 324 80 0 (0.75308642 0.24691358)
6278) NO_OF_OW_CHQ_BNC_TXNS< 0.5 308 68 0 (0.77922078 0.22077922)
12556) AMT_L_DR>=1071220 298 60 0 (0.79865772 0.20134228)
25112) AVG_AMT_PER_ATM_TXN< 22975 285 52 0 (0.81754386 0.18245614)
50224) LEN_OF_RLTN_IN_MNTH>=182.5 68 2 0 (0.97058824 0.02941176) *
50225) LEN_OF_RLTN_IN_MNTH< 182.5 217 50 0 (0.76958525 0.23041475)
100450) LEN_OF_RLTN_IN_MNTH< 72 55 1 0 (0.98181818 0.01818182) *
100451) LEN_OF_RLTN_IN_MNTH>=72 162 49 0 (0.69753086 0.30246914)
200902) LEN_OF_RLTN_IN_MNTH>=75.5 151 41 0 (0.72847682 0.27152318) *
200903) LEN_OF_RLTN_IN_MNTH< 75.5 11 3 1 (0.27272727 0.72727273) *
25113) AVG_AMT_PER_ATM_TXN>=22975 13 5 1 (0.38461538 0.61538462) *
12557) AMT_L_DR< 1071220 10 2 1 (0.20000000 0.80000000) *
6279) NO_OF_OW_CHQ_BNC_TXNS>=0.5 16 4 1 (0.25000000 0.75000000) *
785) AMT_BR_CSH_WDL_DR< 3090 136 37 0 (0.72794118 0.27205882)
1570) AVG_AMT_PER_ATM_TXN< 15450 109 20 0 (0.81651376 0.18348624) *
1571) AVG_AMT_PER_ATM_TXN>=15450 27 10 1 (0.37037037 0.62962963) *
393) AVG_AMT_PER_NET_TXN>=971167 11 4 1 (0.36363636 0.63636364) *
197) LEN_OF_RLTN_IN_MNTH< 44.5 110 37 0 (0.66363636 0.33636364)
394) AVG_AMT_PER_NET_TXN>=77545.62 58 9 0 (0.84482759 0.15517241) *
395) AVG_AMT_PER_NET_TXN< 77545.62 52 24 1 (0.46153846 0.53846154) *
99) AVG_AMT_PER_MOB_TXN>=146359 127 47 0 (0.62992126 0.37007874) *
25) OCCUPATION=SELF-EMP 945 243 0 (0.74285714 0.25714286)
50) BALANCE>=43360.9 704 148 0 (0.78977273 0.21022727)
100) HOLDING_PERIOD>=2.5 614 111 0 (0.81921824 0.18078176)
200) AVG_AMT_PER_NET_TXN< 893360.5 583 96 0 (0.83533448 0.16466552)
400) NO_OF_L_CR_TXNS>=5.5 566 86 0 (0.84805654 0.15194346)
800) NO_OF_L_CR_TXNS< 16.5 421 47 0 (0.88836105 0.11163895) *
801) NO_OF_L_CR_TXNS>=16.5 145 39 0 (0.73103448 0.26896552)
1602) AMT_BR_CSH_WDL_DR< 846825 119 24 0 (0.79831933 0.20168067) *
1603) AMT_BR_CSH_WDL_DR>=846825 26 11 1 (0.42307692 0.57692308) *
401) NO_OF_L_CR_TXNS< 5.5 17 7 1 (0.41176471 0.58823529) *
201) AVG_AMT_PER_NET_TXN>=893360.5 31 15 0 (0.51612903 0.48387097) *
101) HOLDING_PERIOD< 2.5 90 37 0 (0.58888889 0.41111111) *
51) BALANCE< 43360.9 241 95 0 (0.60580913 0.39419087)
102) AGE_BKT=<25,>50,26-30,31-35 101 21 0 (0.79207921 0.20792079)
204) BALANCE< 40070.04 89 11 0 (0.87640449 0.12359551) *
205) BALANCE>=40070.04 12 2 1 (0.16666667 0.83333333) *
103) AGE_BKT=36-40,41-45,46-50 140 66 1 (0.47142857 0.52857143)
206) HOLDING_PERIOD>=13.5 36 8 0 (0.77777778 0.22222222) *
207) HOLDING_PERIOD< 13.5 104 38 1 (0.36538462 0.63461538)
414) NO_OF_L_CR_TXNS< 10.5 31 12 0 (0.61290323 0.38709677) *
415) NO_OF_L_CR_TXNS>=10.5 73 19 1 (0.26027397 0.73972603) *
13) NO_OF_L_CR_TXNS< 3.5 19 0 1 (0.00000000 1.00000000) *
7) NO_OF_L_CR_TXNS>=48.5 270 107 0 (0.60370370 0.39629630)
14) OCCUPATION=PROF,SAL 210 66 0 (0.68571429 0.31428571) *
15) OCCUPATION=SELF-EMP,SENP 60 19 1 (0.31666667 0.68333333) *
5.2 The CART Model Graphical Output:
5.3 Performance Measures on Training Data Set
The following model performance measures will be calculated on the training set to gauge the goodness
of the model:
a. Rank Ordering
b. KS
c. Area Under Curve (AUC)
d. Gini Coefficient
e. Classification Error
5.3.1 Model Performance Measure - Rank Ordering
The rank order table is provided below:
Interpretation:
a. The baseline Response Rate is 12.34%, whereas the response rate in top deciles is above 74%.
b. The KS is 38.81% which is close to 40%, indicating it to be a good model.
5.3.2 Model Performance Measure – KS and Area under Curve
The KS & AUC values are provided below:
Measures Values
KS 84.59%
Area Under Curve (AUC) 97.87%
The AUC value more than 90% indicates the excellent performance of the model.
Graphical representation of the Area Under Curve is as follows:

5.3.3 Model Performance Measure – Gini Coefficient
For our given dataset and the model we built, the Gini Coefficient is 52%, indicating a good model.
5.3.4 Model Performance Measure – Confusion Matrix
Confusion Matrix for our given Model is as follows:
predict.class
TARGET 0 1
0 5193 23
1 738 46
Accuracy = (5193 + 43) / (5193 + 46 + 738 + 23) = 87.27%

Classification Error Rate = 1 - Accuracy = 12.73%.
The lower the classification error rate, higher is the model accuracy, thus resulting in a superior model.
The classification error rate can be reduced if there were more independent variables present for
modelling.
5.3.5 Model Performance Measure – Summary
The summary of various model performance measures on our training data-set are as below:
Measures Values
KS 84.59%
Area Under Curve

(AUC) 98.87%
Gini Coefficient 51.69%
Accuracy 87.27%
Classification Error
Rate 12.73%
The model observed to perform as per expectations on majority of the model performance measures,
indicating it to be a good model.
5.4 Performance Measures on Validation Data Set
The following model performance measures will be calculated on the validation data set to gauge the
goodness of the model:
a. Rank Ordering
b. KS
d. Gini Coefficient

Interpretation:
a. The baseline Response Rate is 12.34%, whereas the response rate in top three deciles are 26%,
11%, 9% respectively.
b. With the top 3 deciles, our KS worked up to 24%, indicating it to be a good model.
Measures Values
KS 70.62%
Area Under Curve
(AUC) 90.04%
The AUC value around 77% indicates the good performance of the model.

For our given dataset and the model we built, the Gini Coefficient on testing data set is 54.18%,
indicating a good model.
Confusion Matrix for our given Model on testing dataset is as follows:
predict.class
TARGET 0 1
0 4172 1044
1 256 528
Accuracy = (4172 + 528) / (4172 + 528 + 256 + 1044) = 78.33%
The summary of the various model performance measures on the TEST set are as follows:
Measures Values
KS 70.62%
Area Under Curve
(AUC) 90.04%
Accuracy 78.33%
Rate 21.67%
The model observed to perform above expectations on majority of the model performance measures,
5.5 CART – Conclusion
Comparative Summary of the Random Forest Model on Training and Testing Dataset is as follows:
MEASURES TRAIN Data Values TEST Data Values % Variances
KS 84.59% 70.62% 16.51%
Area Under Curve (AUC) 98.87% 90.04% 8.93%
Gini Coefficient 51.69% 40.04% 22.54%
Accuracy 87.27% 78.33% 10.24%
Classification Error Rate 12.73% 21.67% -70.16%
We observed that most of the Model Performance values for TEST & TRAIN sets are above the maximum
tolerance deviation of +/- 10%. Hence, the model is over-fitting against the actual data available.
The performance metrics on the model performance measures indicates a better prediction making
capabilities of our Random Forest model.
6 Model Building – Random Forest

A Supervised Classification Algorithm, as the name suggests, this algorithm creates the forest with a
number of trees in random order.
In general, the more trees in the forest the more robust the forest looks like. In the same way in the
random forest classifier, the higher the number of trees in the forest gives the high accuracy results.
Some advantages of using Random Forest are as follows:

a. The same random forest algorithm or the random forest classifier can use for both classification
and the regression task.
b. Random forest classifier will handle the missing values.
c. When we have more trees in the forest, random forest classifier won’t over fit the model.
d. Can model the random forest classifier for categorical values also.
6.1 The initial build & Optimal No of Trees
The model is built with dependent variable as TARGET, and considering all independent variables except
Customer ID and Account Opening Date.
> Random_Forest = randomForest( as.factor(TARGET) ~ ., data = RF_Train[, -c(1

, 11)],
+ ntree = 500, mtry = 7, nodesize = 100, importance = TRUE
)
> print(Random_Forest)
#Call:
# randomForest(formula = as.factor(TARGET) ~ ., data = RF_Train[,-c(1, 111)],
ntree = 500, mtry = 7, nodesize = 100, importance = TRUE)
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 7
#
# OOB estimate of error rate: 12.06%
#Confusion matrix:
# 0 1 class.error
#0 12270 2 0.0001629726
#1 1686 42 0.9756944444
Out of Bag Error Rate:
Random Forests algorithm is a classifier based on primarily two methods - bagging and random
subspace method.
Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size
as original" created from random resampling of data with-replacement. Each of these datasets is called
a bootstrap dataset.
Due to "with-replacement" option, every dataset can have duplicate data records and at the same time,
can be missing several data records from original datasets. This is called Bagging.
The algorithm uses m (=sqrt(M) random sub features out of M possible features to create any tree. This
is called random subspace method.
After creating the classifiers (S trees), there is a subset of records which does not include any of the
records part of the classifier tree. This subset, is a set of boostrap datasets which does not contain a
particular record from the original dataset. This set is called out-of-bag examples. There are n such
subsets (one for each data record in original dataset T). OOB classifier is the aggregation of all such
records.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the
training set (compare it with known yi's).
Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error
of random forests, boosted decision trees, and other machine learning models utilizing bootstrap
aggregating to sub-sample data samples used for training.
Out-of-bag estimates help in avoiding the need for an independent validation dataset.
The Out-of-Bag Estimate of Error Rate for our given Random Forest in our case is 12.06%
The graphical output for the OOB estimate of error rate is provided below:
The output in tabular form for the OOB estimate of error rate is provided below:
It is observed that as the number of tress increases, the OOB error rate starts decreasing till it reaches
around 19th tree with OOB = 0.1169 (the minimum value). After this, the OOB doesn’t decrease further
and remains largely steady. Hence, the optimal number of trees would be 18.
6.2 Variable Importance
To understand the important variables in Random Forest, the following measures are generally used:
A. Mean Decrease in Accuracy is based on permutation
• Randomly permuted values of a variable for which importance is to be computed in the OOB
sample
• Compute the Error Rate with permuted values
• Compute decrease in OOB Error rate (Permuted - Not permuted)
• Average the decrease over all the trees
B. Mean Decrease in Gini is computed as “total decrease in node impurities from splitting on the
variable, averaged over all trees”
The variables importance is computed as follows:

# List the importance of the variables.
> impVar <- round(randomForest::importance(Random_Forest), 2)
> impVar[order(impVar[,1], decreasing=TRUE),]
0 1 MeanDecreaseAccuracy MeanDecreaseGini
AMT_L_DR 23.49 -3.49 24.67 39.59
TOT_NO_OF_L_TXNS 21.90 -7.89 23.48 49.71
OCCUPATION 21.64 26.45 29.44 36.28
NO_OF_L_CR_TXNS 21.31 -0.98 23.00 51.98
SCR 20.88 30.83 29.75 59.42
FLG_HAS_CC 19.99 28.04 27.81 23.34
BALANCE 19.50 33.23 34.39 68.85
HOLDING_PERIOD 19.22 22.19 23.36 51.74
NO_OF_L_DR_TXNS 19.16 -7.30 19.79 30.26
AVG_AMT_PER_CSH_WDL_TXN 15.51 4.46 16.95 27.79
AVG_AMT_PER_CHQ_TXN 15.13 -3.05 16.28 25.38
AMT_BR_CSH_WDL_DR 14.81 5.82 16.49 30.81
AMT_CHQ_DR 14.74 -3.77 15.54 26.27
AMT_ATM_DR 13.76 -2.05 14.88 31.99
LEN_OF_RLTN_IN_MNTH 13.67 21.50 21.23 34.32
NO_OF_ATM_DR_TXNS 13.30 -10.40 13.40 13.51
AGE_BKT 12.78 23.06 21.64 30.95
AVG_AMT_PER_ATM_TXN 12.78 -5.71 13.41 30.13
AVG_AMT_PER_NET_TXN 12.09 -0.73 12.97 20.95
GENDER 11.48 11.65 14.66 10.16
AMT_NET_DR 11.32 -0.89 12.01 19.42
AGE 10.98 14.67 17.19 21.75
NO_OF_CHQ_DR_TXNS 10.71 2.54 11.31 11.49
NO_OF_BR_CSH_WDL_DR_TXNS 9.39 7.95 11.67 15.73
AMT_MOB_DR 8.36 0.04 10.04 16.03
AVG_AMT_PER_MOB_TXN 8.28 2.76 9.90 16.34
ACC_TYPE 7.68 -3.59 8.43 4.88
NO_OF_NET_DR_TXNS 6.61 2.70 7.39 7.68
FLG_HAS_ANY_CHGS 6.27 4.46 7.93 3.79
NO_OF_IW_CHQ_BNC_TXNS 4.50 6.22 6.94 3.46
FLG_HAS_OLD_LOAN 4.18 5.51 5.68 1.86
AMT_MIN_BAL_NMC_CHGS 3.37 6.41 6.03 1.92
NO_OF_MOB_DR_TXNS 3.35 -1.58 3.51 2.65
NO_OF_OW_CHQ_BNC_TXNS 3.25 5.67 5.42 2.97
FLG_HAS_NOMINEE 1.62 3.83 2.80 1.57
AMT_OTH_BK_ATM_USG_CHGS -0.78 1.76 0.18 0.49
random -2.13 -1.77 -2.62 15.30
>
6.3 Optimal mtry value
In the random forests literature, the number of variables available for splitting at each tree node is
referred to as the mtry parameter.
The optimum number of variables is obtained using tuneRF function as follows:
#Tuning Random Forest

> tune_RandomForest <- tuneRF(x = RF_Train[,-c(1,2,11)], y=as.factor(RF_Train
$TARGET),
+ mtryStart = 6, # Approx. Sqrt of Tot. No of Variables
+ ntreeTry=51,
+ stepFactor = 1.5,
+ improve = 0.0001,
+ trace=TRUE,
+ plot = TRUE,
+ doBest = TRUE,
+ nodesize = 100,
+ importance=TRUE)
mtry = 19 OOB error = 11.61%

-0.00308642 1e-04
Output: OOB Error Vs Mtry:
Graphical Representation of the Output of OOB Error vs MTRY as below:
As can be seen, the optimum number of Variables is 13.
of the model:
a. Rank Ordering
b. KS
d. Gini Coefficient
Rank Ordering as model performance measures helps:

a. Assess the ability of the model to relatively rank the customers
b. How well the model separates the Responder Class from the Non-Responder
c. Track the utility of the model on an ongoing basis
Interpretation:
a. The baseline Response Rate is 12.34%, whereas the response rate in top two deciles is 78% and
32% respectively.
b. With top 2 deciles, the KS is 81%, indicating it to be a fantastic model.

Measures Values
KS 81.01%
Area Under Curve
(AUC) 96.47%
The AUC value above 96.5% indicates the great performance of the model.

The Gini Coefficient is the ratio of the area between the line of perfect equality and the observed Lorenz
curve to the area between the line of perfect equality and the line of perfect inequality. The higher the
coefficient, the more unequal the distribution is.
Gini coefficient can be straight away derived from the AUC ROC number.
Gini above 70% is a good model.
For our given dataset and the model we built, the Gini Coefficient is 72.76%, indicating a robust model.
A confusion matrix is an N X N matrix, where N is the number of classes being predicted. For the
problem in hand, we have N=2, and hence we get a 2 X 2 matrix.
Few performance parameters we can obtain with the help of confusion matrix are as follows:
a. Accuracy: the proportion of the total number of predictions that were correct.
b. Positive Predictive Value or Precision: the proportion of positive cases that were correctly
identified.
c. Negative Predictive Value: the proportion of negative cases that were correctly identified.
d. Sensitivity or Recall: the proportion of actual positive cases which are correctly identified.
e. Specificity: the proportion of actual negative cases which are correctly identified.
Accuracy = (12271 + 123) / (12271 + 123 + 1603 + 1) = 88.54%.
The lower the classification error rate, higher the model accuracy, resulting in a better model. The
classification error rate can be reduced if there were more independent variables were present for
modeling.
The summary of the various model performance measures on the training set are as follows:
Measures Values
KS 81.01%
Area Under Curve
(AUC) 96.47%
Accuracy 88.54%
Rate 11.46%
a. Rank Ordering
b. KS
d. Gini Coefficient
Interpretation:
- The baseline Response Rate is 12.34%, whereas the response rate in top three deciles is 58%,
- With top 3 deciles, the KS is 64%, indicating it to be a good model.
Measures Values
KS 63.70%
Area Under Curve
(AUC) 89.33%
The AUC value around 90% indicates the good performance of the model.

For our given dataset and the model we built, the Gini Coefficient on testing data set is 67.51%,
indicating a great model.
Confusion Matrix for our given Model on TEST dataset is as follows:
Accuracy = (5213 + 37) / (5213 + 37 + 747 + 3) = 87.5%
Classification Error Rate = 1 - Accuracy = 12.5%
Measures Values
KS 63.70%
Area Under Curve
(AUC) 89.33%
Accuracy 87.50%
Rate 12.50%
6.6 Random Forest – Conclusion
Comparative Summary of the Random Forest Model on Training and Testing Dataset is as follows:
MEASURES TRAIN Data Values TEST Data Values % Variances

KS 81.01% 63.70% 21.37%
Area Under
Curve (AUC) 96.47% 89.33% 7.40%
Gini
Coefficient 72.76% 67.51% 7.22%
Accuracy 88.54% 87.50% 1.17%
Classification
Error Rate 11.46% 12.50% -9.08%
It can be observed that most of the Model Performance values for TRAIN & TEST Data-Sets are within
the maximum tolerance deviation of +/- 10%. Hence, the model is not over-fitting.
At an overall level, the model observed to perform above expectations on the various model
performance measures such as KS value > 60, Gini coefficient value > 60 etc.
The excellent performance on the model performance measures indicates a fantastic prediction making
capabilities of the developed Random Forest model.
7 Model Building – Artificial Neural Network
Artificial neural networks (ANNs) are statistical models directly inspired by, and partially modeled on
biological neural networks. They are capable of modeling and processing nonlinear relationships
between inputs and outputs in parallel.
Artificial neural networks are characterized by containing adaptive weights along paths between
neurons that can be tuned by a learning algorithm that learns from observed data in order to improve
the model. In addition to the learning algorithm itself, one must choose an appropriate cost function.
The cost function is what’s used to learn the optimal solution to the problem being solved. This involves
determining the best values for all of the model parameters, with neuron path adaptive weights being
the primary target, along with algorithm tuning parameters such as the learning rate. It’s usually done
through optimization techniques such as gradient descent or stochastic gradient descent.
These optimization techniques basically try to make the ANN solution be as close as possible to the
optimal solution, which when successful means that the ANN is able to solve the intended problem with
high performance.
7.1 Creation of Dummy variables
Create Dummy Variables for Categorical Variables.
# Create Dummy Variables for Categorical Variables

#
# Gender
Gender_Martrix <- model.matrix(~GENDER - 1, data = Neural_Input)
Neural_Input <- data.frame(Neural_Input, Gender_Martrix)
#
# Occupation
occ.matrix <- model.matrix(~ OCCUPATION - 1, data = Neural_Input)
Neural_Input <- data.frame(Neural_Input, occ.matrix)
#
# AGE_BKT
AGEBKT.matrix <- model.matrix(~ AGE_BKT - 1, data = Neural_Input)
Neural_Input <- data.frame(Neural_Input, AGEBKT.matrix)
#
# ACC_TYPE
ACCTYP.matrix <- model.matrix(~ ACC_TYPE - 1, data = Neural_Input)
Neural_Input <- data.frame(Neural_Input, ACCTYP.matrix)
#
dim(Neural_Input)
## [1] 20000 56
7.2 Scaling of the variables
#
# Scaling the Dataset and Variables
#
#names(NN.train.data)
x <- subset(NN.train_Data,
select = c("AGE", "BALANCE", "SCR",
"HOLDING_PERIOD", "LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS", "NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS", "NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS", "NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS", "NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC", "AMT_ATM_DR", "AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR", "AMT_NET_DR", "AMT_MOB_DR",
"AMT_L_DR", "FLG_HAS_ANY_CHGS", "AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS", "NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS", "AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN", "AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN", "AVG_AMT_PER_MOB_TXN",
"FLG_HAS_NOMINEE", "FLG_HAS_OLD_LOAN", "random",
"GENDERF", "GENDERM", "GENDERO", "OCCUPATIONPROF",
"OCCUPATIONSAL", "OCCUPATIONSELF.EMP", "OCCUPATIONSENP",
"AGE_BKT.25", "AGE_BKT.50", "AGE_BKT26.30", "AGE_BKT31.35",
"AGE_BKT36.40", "AGE_BKT41.45", "AGE_BKT46.50",
"ACC_TYPECA", "ACC_TYPESA" ) )
#
nn.devscaled <- scale(x)
nn.devscaled <- cbind(NN.train.data[2], nn.devscaled)
# names(nn.devscaled)
7.3 Building the Artificial Neural Network

7.4 The Artificial Neural Network – Graphical Representation
7.5 Histogram of Probabilities in Training Dataset

of the model:
a. Rank Ordering
b. KS
d. Gini Coefficient
Interpretation:
18% and 18% respectively.
- With top 4 deciles, the KS is 39%, which is close to a good fitness model indicator.
Measures Values
KS 39.13%
Area Under Curve
(AUC) 72.90%
The AUC value of 72.90% indicates the good performance of the model.
Graphical representation of the Area Under Curve (AUC) is as follows:
The Gini Coefficient is the ratio of the area between the line of perfect equality and the observed Lorenz
curve to the area between the line of perfect equality and the line of perfect inequality. The higher the
coefficient, the more unequal the distribution is.
Gini coefficient can be straight away derived from the AUC ROC number.
Gini above 60% is a good model.
For our given dataset and the model we built, the Gini Coefficient is 54.74%, hence there is scope for
improvement.
## predict.class
## TARGET 0 1
## 0 11981 291
## 1 1296 432
Accuracy = (11981 + 432) / (11981 + 291 + 1296 + 432) = 88.66%
The lower the classification error rate, higher the model accuracy, resulting in a better model. The
classification error rate can be reduced if there were more independent variables were present for
modeling.
Measures Values
KS 39.13%
Area Under Curve
(AUC) 72.90%
Accuracy 88.66%
Rate 11.34%
The model observed to perform at par expectations on majority of the model performance measures,
indicating it to be a good model, with scope for improvement.
a. Rank Ordering
b. KS
d. Gini Coefficient

Interpretation:
- With top 4 deciles, the KS is 35%, indicating scope for improvement.
Measures Values
KS 35.45%
Area Under Curve
(AUC) 72.14%
The AUC value around 72% indicates the good performance of the model, with further scope for
improvement.
Graphical representation of the Area Under Curve (AUC) is as follows:

For our given dataset and the model we built, the Gini Coefficient on testing data set is 54.62%.
Confusion Matrix for our given Model on testing dataset is as follows:
Accuracy = (5065 + 145) / (5065 + 151 + 145 + 639) = 86.83%
Measures Values
KS 35.45%
Area Under Curve
(AUC) 72.14%
Accuracy 86.83%
Rate 13.17%
The model observed to perform “normal” on majority of the model performance measures, indicating it
to be a good model, with scope for improvement.
7.8 Artificial Neural Network – Conclusion
Comparative Summary of the Neural Network Model on TRAIN and TEST Dataset is as follows:
TRAIN Data TEST Data

MEASURES Values Values % Variances
KS 39.13% 35.45% 9.40%
Area Under Curve (AUC) 72.90% 72.14% 1.04%

Gini Coefficient 54.74% 54.62% 0.22%
Accuracy 88.66% 86.83% 2.07%
Classification Error Rate 11.34% 13.17% -16.15%
It can be observed that most of the Model Performance values for Training & Testing sets are within the
maximum tolerance deviation of +/- 10%. Hence, the model is not over-fitting.
At an overall level, the model observed to perform “normal” on the various model performance
measures such as KS value close to 40, Gini coefficient value > 50 etc.
The good performance on the model performance measures indicates good prediction making
capabilities of the developed Neural Network model.
8 Model Comparison
The main objective of the project was to develop a predictive model to predict if MyBank customers will
respond positively to a promotion or an offer using tools of Machine Learning. In this context, the key
parameter for model evaluation was ‘Accuracy’, i.e., the proportion of the total number of predictions
that were correct (i.e. % of the customers that were correctly predicted).
The predictive models was be developed using the following Machine Learning techniques:
a. Classification Tree – CART
b. Random Forest
c. Neural Network
The snap shot of the performance of all the models on accuracy, over-fitting and other model
performance measures is provided below:
MODEL
Techniques ARTIFICIAL NEURAL
=> CART RANDOM FOREST NETWORK
TRAIN TEST TRAIN TEST TRAIN TEST
Data Data % Data Data % Data Data %
MEASURES Values Values Variances Values Values Variances Values Values Variances
KS 84.59% 70.62% 16.51% 81.01% 63.70% 21.37% 39.13% 35.45% 9.40%
Area Under
Curve (AUC) 98.87% 90.04% 8.93% 96.47% 89.33% 7.40% 72.90% 72.14% 1.04%
Gini
Coefficient 51.69% 40.04% 22.54% 72.76% 67.51% 7.22% 54.74% 54.62% 0.22%
Accuracy 87.27% 78.33% 10.24% 88.54% 87.50% 1.17% 88.66% 86.83% 2.07%
Classification
Error Rate 12.73% 21.67% -70.16% 11.46% 12.50% -9.08% 11.34% 13.17% -16.15%
Interpretation:
- The CART method has given poor performance compared to Random Forest and ANN. Looking
at the percentage deviation between Training and Testing Dataset, it looks like the Model is over
fit.
- The Random Forest method has the best performance among all the three models. The
percentage deviation between Training and Testing Dataset also is reasonably under control,
suggesting a robust model.
- Neural Network is marginally behind the performance compared to Random Forest, however,
better than CART. However, the percentage deviation between Training and Testing Data set is
minimal among three models.
- Overall of all the methods, we are able to achieve maximum accuracy in the train data with a
ANN Model however Random Forest Model is marginally behind in terms of Train Data but it
edges over ANN in the Test data. Hence we cannot say with finality that any of the two
models between Random Forest & ANN scores over each other, however since the Test and
Train data variations are low in ANN we can suggest it to be the overall winner.
9 Appendix – Sample Source Code
install.packages("lattice")
install.packages("ggplot2")
install.packages("caret")
install.packages("rpart")
install.packages("rpart.plot")
install.packages("randomForest")
install.packages("plotrix")
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(plotrix)
setwd("C:/Users/tejas/Downloads/BACP/Module 4 - Machine Learning/Week 4 - Mini Project")

getwd()
# Read Input File
PLX_SELL=read.csv("PL_XSELL.csv")
attach(PLX_SELL)
table(TARGET)
prop.table(table(TARGET))
pie3D(prop.table((table(PLX_SELL$TARGET))), main="Respondent Vs Non Respondent - Input Data set",

labels=c("Non Resp", "Resp"), col = c("Light Green", "Yellow") )
prop.table(table(GENDER))
prop.table(table(GENDER,TARGET),margin = 1)
#Creating Training and Testing Datasets

set.seed(111)
Train_Index <- createDataPartition(TARGET, p = .7, list = FALSE, times = 1)
Train_Data <- PLX_SELL[Train_Index,]
Test_Data <- PLX_SELL[-Train_Index,]
dim(Train_Data)
dim(Test_Data)
#Check if distribution of partition data is correct

prop.table((table(Train_Data$TARGET)))
#0.8749286 0.1250714
prop.table((table(Test_Data$TARGET)))
#0.8731667 0.1268333
par(mfrow=c(1,2))
pie3D(prop.table((table(Train_Data$TARGET))), main='Resp Vs Non Resp - Training Data set',
#explode=0.1,
labels=c("Non Resp", "Resp"), col = c("Light Blue", "Yellow") )
pie3D(prop.table((table(Test_Data$TARGET))), main='Resp Vs Non Resp - Testing Data set',

#explode=0.1,
labels=c("Non Resp", "Resp"), col = c("Light Green", "Blue") )
# MODEL BUILDING - CART.

# Setting the control paramter inputs for rpart.
Train_Cart <- Train_Data
Test_Cart <- Test_Data
r.ctrl = rpart.control(minsplit=100, minbucket = 10, cp = 0, xval = 10)

names(Train_Cart)
m1 <- rpart(formula = TARGET ~ ., data = Train_Cart[,-c(1,11)], method = "class",
control = r.ctrl)
m1
install.packages("rattle")
install.packages("RColorBrewer")
library(rattle)
library(RColorBrewer)
fancyRpartPlot(m1)
printcp(m1)
plotcp(m1)
# Pruning the Tree

## Pruning Code
Prune_Tree<- prune(m1, cp= 0.0017 ,"CP")
printcp(Prune_Tree)
fancyRpartPlot(Prune_Tree, uniform=TRUE, main="Pruned Classification Tree")
fancyRpartPlot(Prune_Tree,
uniform = TRUE, main = "Final Tree", palettes = c("Blues", "Oranges"))
# Measure Model Performance
Train_Cart$predict.class <- predict(Prune_Tree, Train_Cart, type="class")

Train_Cart$predict.score <- predict(Prune_Tree, Train_Cart, type="prob")
# head(Train_Cart)
## deciling code
decile <- function(x) {
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
{deciles[i*10] <- quantile(x, i, na.rm=T) }
return
(ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[9], 9, 10 )))))))))) }
class(Train_Cart$predict.score)
## deciling
Train_Cart$deciles <- decile(Train_Cart$predict.score[,2])
# View(Train_Cart)
## Ranking code
install.packages("data.table")
install.packages("scales")
library(data.table)
library(scales)
temp_data = data.table(Train_Cart)
rank <- temp_data[,
list(cnt = length(TARGET),
cnt_resp = sum(TARGET),
cnt_non_resp = sum(TARGET == 0)),
by=deciles][order(-deciles)]
rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
View(rank)
install.packages("ROCR")
install.packages("ineq")
library(ROCR)
library(ineq)
predict <- prediction(Train_Cart$predict.score[,2], Train_Cart$TARGET)
perform <- performance(predict, "tpr", "fpr")
plot(perform)
KS <- max(attr(perform, 'y.values')[[1]]-attr(perform, 'x.values')[[1]])

auc <- performance(predict,"auc");
auc <- as.numeric(auc@y.values)
gini = ineq(Train_Cart$predict.score[,2], type="Gini")

with(Train_Cart, table(TARGET, predict.class))
auc
KS
gini
View(rank)
## Syntax to get the node path
Tree_Path <- path.rpart(Prune_Tree, nodes = c(2, 12))

nrow(Test_Cart)
## Scoring Holdout sample
Test_Cart$predict.class <- predict(Prune_Tree, Test_Cart, type="class")
Test_Cart$predict.score <- predict(Prune_Tree, Test_Cart, type="prob")
Test_Cart$deciles <- decile(Test_Cart$predict.score[,2])
# View(Test_Cart)
# Ranking code
temp_data = data.table(Test_Cart)
h_rank <- temp_data[ , list(cnt = length(TARGET),

h_rank$rrate <- round(h_rank$cnt_resp / h_rank$cnt,4);
h_rank$cum_resp <- cumsum(h_rank$cnt_resp)
h_rank$cum_non_resp <- cumsum(h_rank$cnt_non_resp)
h_rank$cum_rel_resp <- round(h_rank$cum_resp / sum(h_rank$cnt_resp),4);
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp/sum(h_rank$cnt_non_resp),4);
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp)*100;
h_rank$rrate <- percent(h_rank$rrate)
h_rank$cum_rel_resp <- percent(h_rank$cum_rel_resp)
h_rank$cum_rel_non_resp <- percent(h_rank$cum_rel_non_resp)
View(h_rank)
predict <- prediction(Test_Cart$predict.score[,2], Test_Cart$TARGET)

gini = ineq(Test_Cart$predict.score[,2], type="Gini")
with(Test_Cart, table(TARGET, predict.class))
auc
KS
gini
#=========================================================================
# Oversampling the data (using ROSE Algorithm)
#=========================================================================
install.packages("ROSE")
library(ROSE)
table(Train_Cart$TARGET)
N=12272*2
N
Train_Cart.over <- ovun.sample(TARGET~.,data=Train_Cart, method="over",N=24544)$data
table(Train_Cart.over$TARGET)
#
#=========================================================================
# MODEL BUILDING - CART - With Oversampling
#=========================================================================
# Setting the control parameter inputs for rpart.
r.ctrl = rpart.control (minsplit=100, minbucket = 20, cp = 0, xval = 10)

names(Train_Cart.over)
m1 <- rpart(formula = TARGET ~ .,

data = Train_Cart.over[,-c(1,11,41,42,43)], method = "class",
# data = Train_Cart.over[,-c(1)], method = "class",
control = r.ctrl)
m1
## install.packages("rattle")
## install.packages("RColorBrewer")
library(rattle)
## install.packages
library(RColorBrewer)
# fancyRpartPlot(m1)
printcp(m1)
plotcp(m1)
## Pruning the Tree

## Pruning Code
Prune_Tree<- prune(m1, cp= 0.00051 ,"CP")
printcp(Prune_Tree)
fancyRpartPlot(Prune_Tree, uniform=TRUE, main="Pruned Classification Tree")
fancyRpartPlot(Prune_Tree, uniform = TRUE,

main = "Final Tree", palettes = c("Blues", "Oranges"))
# Measure Model Performance
Train_Cart.over$predict.class <- predict(Prune_Tree, Train_Cart.over, type="class")
Train_Cart.over$predict.score <- predict(Prune_Tree, Train_Cart.over, type="prob")
# head(Train_Cart.over)
## deciling code
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T) }
return (
ifelse(x<deciles[9], 9, 10 )))))))))) }
class(Train_Cart.over$predict.score)
## deciling
Train_Cart.over$deciles <- decile(Train_Cart.over$predict.score[,2])
View(Train_Cart.over)
## Ranking code
##install.packages("data.table")
##install.packages("scales")
library(data.table)
library(scales)
temp_data = data.table(Train_Cart.over)
rank <- temp_data[,

list( cnt = length(TARGET),
cnt_non_resp = sum(TARGET == 0)) ,
rank$rrate <- round(rank$cnt_resp / rank$cnt,4);

rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
View(rank)
# install.packages("ROCR")
# install.packages("ineq")
library(ROCR)
library(ineq)
predict <- prediction(Train_Cart.over$predict.score[,2],

Train_Cart.over$TARGET)
plot(perform)

gini = ineq(Train_Cart.over$predict.score[,2], type="Gini")
with(Train_Cart.over, table(TARGET, predict.class))
auc
KS
gini
View(rank)
## Syntax to get the node path

Tree_Path <- path.rpart(Prune_Tree, node = c(2, 12))
nrow(Test_Cart)
## Scoring Holdout sample
Test_Cart$predict.class <- predict(Prune_Tree, Test_Cart, type="class")

Test_Cart$predict.score <- predict(Prune_Tree, Test_Cart, type="prob")
Test_Cart$deciles <- decile(Test_Cart$predict.score[,2])
# View(Test_Cart)
# Ranking code
temp_data = data.table(Test_Cart)
h_rank <- temp_data [,

list(cnt = length(TARGET),
h_rank$rrate <- round(h_rank$cnt_resp / h_rank$cnt,4);
h_rank$cum_rel_non_resp <- round(h_rank$cum_non_resp / sum(h_rank$cnt_non_resp),4);
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp)*100;
View(h_rank)
predict <- prediction(Test_Cart$predict.score[,2], Test_Cart$TARGET)

gini = ineq(Test_Cart$predict.score[,2], type="Gini")
with(Test_Cart, table(TARGET, predict.class))
auc
KS
gini
##=========================================================================
# MODEL BUILDING - RANDOM FOREST #
#=========================================================================
# install.packages("randomForest")
library(randomForest)
# # Copy datasets for Random_Forest
RF_Train <- Train_Data

RF_Test <- Test_Data
dim(RF_Train)
dim(RF_Test)
names(RF_Train)
str(RF_Train)
# Random Forest
Random_Forest = randomForest( as.factor(TARGET) ~ ., data = RF_Train[, -c(1, 11)],
ntree = 500, mtry = 7, nodesize = 100, importance = TRUE )
print(Random_Forest)
dev.off()
plot(Random_Forest, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
title(main="Error Rates Random Forest PLX_SELL Training data")
Random_Forest$err.rate
# List the importance of the variables.
impVar <- round(randomForest::importance(Random_Forest), 2)
impVar[order(impVar[,1], decreasing=TRUE),]
# Tuning Random Forest

tune_RandomForest <- tuneRF(x = RF_Train[,-c(1,2,11)], y=as.factor(RF_Train$TARGET),
mtryStart = 6, # Approx. Sqrt of Tot. No of Variables
ntreeTry=51,
stepFactor = 1.5,
improve = 0.0001,
trace=TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 100,
importance=TRUE)
#
# Scoring syntax
RF_Train$predict.class <- predict(tune_RandomForest, RF_Train, type="class")
RF_Train$predict.score <- predict(tune_RandomForest, RF_Train, type="prob")
# head(RF_Train)
# class(RF_Train$predict.score)
# deciling code
decile <- function(x){ deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){ deciles[i*10] <- quantile(x, i, na.rm=T) }
return ( ifelse(x<deciles[1], 1,
ifelse(x<deciles[9], 9, 10 )))))))))) }
RF_Train$deciles <- decile(RF_Train$predict.score[,2])
library(data.table)
temp_data = data.table(RF_Train)
rank <- temp_data[, list(cnt = length(TARGET),

rank$rrate <- round (rank$cnt_resp / rank$cnt,2);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp);
library(scales)

View(rank)
# rank
# Baseline Response Rate

sum(RF_Train$TARGET) / nrow(RF_Train)
library(ROCR)
predict <- prediction(RF_Train$predict.score[,2], RF_Train$TARGET)

plot(perform)
KS
## Area Under Curve
auc
## Gini Coefficient
library(ineq)
gini = ineq(RF_Train$predict.score[,2], type="Gini")
gini
## Classification Error
with(RF_Train, table(TARGET, predict.class))
#
#=========================================================================
# Model Performance on Testing Data Set
#=========================================================================
# Rank Order
#
# Scoring syntax
RF_Test$predict.class <- predict(tune_RandomForest, RF_Test, type="class")

RF_Test$predict.score <- predict(tune_RandomForest, RF_Test, type="prob")
RF_Test$deciles <- decile(RF_Test$predict.score[,2])
temp_data = data.table(RF_Test)
h_rank <- temp_data[, list( cnt = length(TARGET),
h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp);
library(scales)
View(h_rank)
h_rank
pred1 <- prediction(RF_Test$predict.score[,2], RF_Test$TARGET)
perf1 <- performance (pred1, "tpr", "fpr")
plot(perf1)
KS1 <- max(attr(perf1, 'y.values')[[1]]-attr(perf1, 'x.values')[[1]])
KS1
## Area Under Curve
auc1 <- performance(pred1,"auc");

auc1 <- as.numeric(auc1@y.values)
auc1
## Gini Coefficient
library(ineq)
gini1 = ineq(RF_Test$predict.score[,2], type="Gini")
gini1
## Classification Error
with(RF_Test, table(TARGET, predict.class))
#
#=========================================================================
# MODEL BUILDING - NEURAL NETWORK
#=========================================================================
install.packages("neuralnet")
library(neuralnet)
#
# Read Input File
# detach(PLX_SELL)
Neural_Input=read.csv("PL_XSELL.csv")
attach(Neural_Input)
dim(Neural_Input)
str(Neural_Input)
#
# Create Dummy Variables for Categorical Variables
#
# Gender
Gender_Martrix <- model.matrix(~GENDER - 1, data = Neural_Input)

Neural_Input <- data.frame(Neural_Input, Gender_Martrix)
#
# Occupation
occ.matrix <- model.matrix(~ OCCUPATION - 1, data = Neural_Input)
Neural_Input <- data.frame(Neural_Input, occ.matrix)
#
# AGE_BKT
AGEBKT.matrix <- model.matrix(~ AGE_BKT - 1, data = Neural_Input)
Neural_Input <- data.frame(Neural_Input, AGEBKT.matrix)
#
# ACC_TYPE
ACCTYP.matrix <- model.matrix(~ ACC_TYPE - 1, data = Neural_Input)
Neural_Input <- data.frame(Neural_Input, ACCTYP.matrix)
names(Neural_Input)
#
dim(Neural_Input)
#Creating Training and Testing Datasets

set.seed(111)
Train_Index <- createDataPartition(TARGET, p = .7, list = FALSE, times = 1)
NN.Train_Data <- Neural_Input[Train_Index,]

NN.Test_Data <- Neural_Input[-Train_Index,]
dim(NN.Train_Data)
dim(NN.Test_Data)
#
# Scaling the Dataset and Variables
#
#names(NN.Train_Data)
x <- subset(NN.Train_Data,
select = c("AGE", "BALANCE", "SCR",
"HOLDING_PERIOD", "LEN_OF_RLTN_IN_MNTH",
"NO_OF_L_CR_TXNS", "NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS", "NO_OF_BR_CSH_WDL_DR_TXNS",
"NO_OF_ATM_DR_TXNS", "NO_OF_NET_DR_TXNS",
"NO_OF_MOB_DR_TXNS", "NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC", "AMT_ATM_DR", "AMT_BR_CSH_WDL_DR",
"AMT_CHQ_DR", "AMT_NET_DR", "AMT_MOB_DR",
"AMT_L_DR", "FLG_HAS_ANY_CHGS", "AMT_OTH_BK_ATM_USG_CHGS",
"AMT_MIN_BAL_NMC_CHGS", "NO_OF_IW_CHQ_BNC_TXNS",
"NO_OF_OW_CHQ_BNC_TXNS", "AVG_AMT_PER_ATM_TXN",
"AVG_AMT_PER_CSH_WDL_TXN", "AVG_AMT_PER_CHQ_TXN",
"AVG_AMT_PER_NET_TXN", "AVG_AMT_PER_MOB_TXN",
"ACC_TYPECA", "ACC_TYPESA" ) )
#
nn.devscaled <- scale(x)
nn.devscaled <- cbind(NN.Train_Data[2], nn.devscaled)
# names(nn.devscaled)
nn2 <- neuralnet(formula = TARGET ~ AGE + BALANCE + SCR +

HOLDING_PERIOD + LEN_OF_RLTN_IN_MNTH + NO_OF_L_CR_TXNS +
NO_OF_L_DR_TXNS + TOT_NO_OF_L_TXNS + NO_OF_BR_CSH_WDL_DR_TXNS +
NO_OF_ATM_DR_TXNS + NO_OF_NET_DR_TXNS + NO_OF_MOB_DR_TXNS +
NO_OF_CHQ_DR_TXNS + FLG_HAS_CC + AMT_ATM_DR +
AMT_BR_CSH_WDL_DR + AMT_CHQ_DR + AMT_NET_DR +
AMT_MOB_DR + AMT_L_DR + FLG_HAS_ANY_CHGS +
AMT_OTH_BK_ATM_USG_CHGS + AMT_MIN_BAL_NMC_CHGS +
NO_OF_IW_CHQ_BNC_TXNS + NO_OF_OW_CHQ_BNC_TXNS +
AVG_AMT_PER_ATM_TXN + AVG_AMT_PER_CSH_WDL_TXN +
AVG_AMT_PER_CHQ_TXN + AVG_AMT_PER_NET_TXN +
AVG_AMT_PER_MOB_TXN + FLG_HAS_NOMINEE + FLG_HAS_OLD_LOAN +
random + GENDERF + GENDERM + GENDERO + OCCUPATIONPROF +
OCCUPATIONSAL + OCCUPATIONSELF.EMP + OCCUPATIONSENP +
AGE_BKT.25 + AGE_BKT.50 + AGE_BKT26.30 + AGE_BKT31.35 +
AGE_BKT36.40 + AGE_BKT41.45 + AGE_BKT46.50 +
ACC_TYPECA + ACC_TYPESA ,
data = nn.devscaled, hidden = 3, err.fct = "sse",
linear.output = FALSE, lifesign = "full",
lifesign.step = 10, threshold = 0.1, stepmax = 2000 )
plot (nn2)
# Assigning the Probabilities to Dev Sample
NN.Train_Data$Prob = nn2$net.result[[1]]
# The distribution of the estimated probabilities
quantile(NN.Train_Data$Prob, c(0,1,5,10,25,50,75,90,95,98,99,100)/100)
hist(NN.Train_Data$Prob)
## deciling
NN.Train_Data$deciles <- decile(NN.Train_Data$Prob)
#
# Ranking code
##install.packages("data.table")
library(data.table)
library(scales)
temp_data = data.table(NN.Train_Data)
rank <- temp_data[, list( cnt = length(TARGET),

rank$rrate <- round (rank$cnt_resp / rank$cnt,2);

rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp)
library(scales)
View(rank)
# Assigning 0 / 1 class based on certain threshold
NN.Train_Data$Class = ifelse(NN.Train_Data$Prob>0.5,1,0)
with (NN.Train_Data, table(TARGET, as.factor(Class)))
# We can use the confusionMatrix function of the caret package
# install.packages("caret")
library(caret)
confusionMatrix(NN.Train_Data$TARGET, NN.Train_Data$Class)
## Error Computation
sum((NN.Train_Data$TARGET - NN.Train_Data$Prob)^2)/2
## Other Model Performance Measures
library(ROCR)
str(NN.Train_Data)
detach(package:neuralnet)
pred3 <- prediction(NN.Train_Data$Prob, NN.Train_Data$TARGET)
perf3 <- performance(pred3, "tpr", "fpr")
plot(perf3)

KS3
auc3
library(ineq)
gini3 = ineq(NN.Train_Data$Prob, type="Gini")
auc3
KS3
gini3
# Scoring another dataset using the Neural Net Model Object
# To score we will use the compute function

#
x <- subset (NN.Test_Data,
select = c("AGE", "BALANCE", "SCR", "HOLDING_PERIOD",
"LEN_OF_RLTN_IN_MNTH", "NO_OF_L_CR_TXNS", "NO_OF_L_DR_TXNS",
"TOT_NO_OF_L_TXNS", "NO_OF_BR_CSH_WDL_DR_TXNS", "NO_OF_ATM_DR_TXNS",
"NO_OF_NET_DR_TXNS", "NO_OF_MOB_DR_TXNS", "NO_OF_CHQ_DR_TXNS",
"FLG_HAS_CC", "AMT_ATM_DR", "AMT_BR_CSH_WDL_DR", "AMT_CHQ_DR",
"AMT_NET_DR", "AMT_MOB_DR", "AMT_L_DR", "FLG_HAS_ANY_CHGS",
"AMT_OTH_BK_ATM_USG_CHGS", "AMT_MIN_BAL_NMC_CHGS",
"NO_OF_IW_CHQ_BNC_TXNS", "NO_OF_OW_CHQ_BNC_TXNS",
"AVG_AMT_PER_ATM_TXN", "AVG_AMT_PER_CSH_WDL_TXN",
"AVG_AMT_PER_CHQ_TXN", "AVG_AMT_PER_NET_TXN", "AVG_AMT_PER_MOB_TXN",
"ACC_TYPECA", "ACC_TYPESA") )
x.scaled <- scale(x)
install.packages("neuralnet")
library(neuralnet)
compute.output = compute(nn2, x.scaled)
# compute.output
NN.Test_Data$Predict.score = compute.output$net.result
# View(NN.Test_Data)
names(NN.Train_Data)
quantile(NN.Test_Data$Predict.score,
c(0,1,5,10,25,50,75,90,95,99,100)/100)
NN.Test_Data$deciles <- decile(NN.Test_Data$Predict.score)
library(data.table)
temp_data = data.table(NN.Test_Data)
h_rank <- temp_data[, list( cnt = length(TARGET),

h_rank$rrate <- round (h_rank$cnt_resp / h_rank$cnt,2);
h_rank$ks <- abs(h_rank$cum_rel_resp - h_rank$cum_rel_non_resp);
# library(scales)
View(h_rank)
#
#===============================================================
# Working on Holdout Sample
#===============================================================
# Assigning the Probabilities to Holdout Sample
NN.Test_Data$Prob = compute.output$net.result
# Assigning 0 / 1 class based on certain threshold
NN.Test_Data$Class = ifelse(NN.Test_Data$Prob>0.5,1,0)
with( NN.Test_Data, table(TARGET, as.factor(Class) ))
# We can use the confusionMatrix function of the caret package
#install.packages("caret")
library(caret)
confusionMatrix(NN.Test_Data$TARGET, NN.Test_Data$Class)
## Error Computation
sum((NN.Test_Data$TARGET - NN.Test_Data$Prob)^2)/2
## Other Model Performance Measures
library(ROCR)
# str(NN.Test_Data)
detach(package:neuralnet)
pred4 <- prediction(NN.Test_Data$Prob, NN.Test_Data$TARGET)
perf4 <- performance(pred4, "tpr", "fpr")
plot(perf4)
KS4

auc4
library(ineq)
gini4 = ineq(NN.Test_Data$Prob, type="Gini")
auc4
KS4
gini4

Machine Learning - Project PDF

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Machine Learning - Project PDF

Caricato da

Copyright:

Formati disponibili

Supervised Machine Learning

(CART, RANDOM FOREST and NEURAL

- Tejas Rajendra Nayak

3 EXPLORATORY DATA ANALYSIS AND DESCRIPTIVE STATISTICS 7

3.1 KEY OBSERVATIONS: 7

4 CREATION OF THE TEST AND TRAIN DATA SETS 9

5 MODEL BUILDING – CART 10

5.1 THE CART MODEL OUTPUT 10

6 MODEL BUILDING – RANDOM FOREST 20

6.1 THE INITIAL BUILD & OPTIMAL NO OF TREES 20

7 MODEL BUILDING – ARTIFICIAL NEURAL NETWORK 31

7.1 CREATION OF DUMMY VARIABLES 31

9 APPENDIX – SAMPLE SOURCE CODE 42

23 AMT_BR_CSH_WDL_DR Integer Amount cash withdrawn from Branch

28 FLG_HAS_ANY_CHGS Integer Has any banking charges

Amount charged by way Minimum Balance

Average Amount withdrawn per ATM

Average Amount withdrawn per Cash

37 AVG_AMT_PER_MOB_TXN Numeric Average Amount per Mobile Transaction

38 FLG_HAS_NOMINEE Integer Flag: Has Nominee - 1: Yes, 0: No

39 FLG_HAS_OLD_LOAN Integer Flag: Has any earlier loan - 1: Yes, 0: No

3.2 Missing values check:

3.3 Proportion of Responders Vs Non Responders:

3.4 Summary Statistics of the dataset:

DIMENSIONS of the TEST & TRAIN DATA-SETS

5.1 The CART Model output

The initial CART Model is built by setting up Control Parameters as follows:

Setting the control parameter inputs for rpart

# Build the Model on Training Dataset.

M1 <- rpart(formula = TARGET ~ .,

The Model Outcome is as follows:

node), split, n, loss, yval, (yprob)

1) root 14000 1728 0 (0.87657143 0.12342857)

5.3.1 Model Performance Measure - Rank Ordering

The rank order table is provided below:

5.3.2 Model Performance Measure – KS and Area under Curve

The KS & AUC values are provided below:

Graphical representation of the Area Under Curve is as follows:

5.3.4 Model Performance Measure – Confusion Matrix

Confusion Matrix for our given Model is as follows:

Accuracy = (5193 + 43) / (5193 + 46 + 738 + 23) = 87.27%

5.3.5 Model Performance Measure – Summary

Area Under Curve

Gini Coefficient 51.69%

5.4 Performance Measures on Validation Data Set

5.4.1 Model Performance Measure - Rank Ordering

The rank order table is provided below:

5.4.2 Model Performance Measure – KS and Area under Curve

The KS & AUC values are provided below:

Graphical representation of the Area Under Curve is as follows:

5.4.4 Model Performance Measure – Confusion Matrix

Confusion Matrix for our given Model on testing dataset is as follows:

Accuracy = (4172 + 528) / (4172 + 528 + 256 + 1044) = 78.33%

Classification Error Rate = 1 - Accuracy = 21.67%.

5.4.5 Model Performance Measure – Summary

5.5 CART – Conclusion

6 Model Building – Random Forest

Some advantages of using Random Forest are as follows:

6.1 The initial build & Optimal No of Trees

> Random_Forest = randomForest( as.factor(TARGET) ~ ., data = RF_Train[, -c(1

6.2 Variable Importance

The variables importance is computed as follows:

6.3 Optimal mtry value