Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Assignment 2 1646112
Decision Tree:
For the bank data that consists of multiple information – age, balance, job, education, deposit etc., the
python code is written for the decision tree portion.
The below image shows the data output for the above three questions:
After cleaning the data and removing all unknown cells, the number of observations that are left is: 2675
The below image shows the output of the code after data cleaning:
After data cleaning, the dummy variables are created for columns – Marital, Education, Housing, Loan,
Contact, Poutcome and Deposit such that the number of dummy variables for each of the column is one
less than the types of outcome. For example: Marital has three possible outcomes like divorced, single
and married. Therefore, two dummy variables are created for Married categorical variable.
Thus, this step brings us to a total of 20 columns in the dataset and the sample of the dataset is shown
in the picture of the output attached below:
For defining the categorical variables into dummy variables, for k possible values of the variable, we
need to create (k-1) dummy variables to ensure the variable is completely defined. Therefore, one
dummy variable for each column is dropped.
The decision tree for the bank dataset will look like this:
Balance
?
Mediu High
Low
m >$2500
Any
Job?
loan?
Manag
Student Retired Yes No
emnet
Payment
Days?
<100 >100
Then the data is split for training and testing with 30% of the data to be used for testing the model.
Then using the sklearn - Decision Tree Classifier, the decision tree is built.
The decision tree model works by learning the functioning and training on 70% of the data and it takes
into account the data from columns - age, marital, education, balance, housing, loan, contact, day,
month, duration, campaign, pdays, previous, poutcome and deposit. And then it returns the confusion
matrix. The confusion matrix displays the number of observations for which the prediction of the model
was same as the actual data. It also provides the number of observations for which the model predicted
differently.
Based on that the classification report is generated that gives the accuracy, precision, f1-score and the
support values for the model.
And the below image of the output shows the confusion matrix, the classification report.
Now for the neural network portion, the following variables are used to predict the default – education,
job, balance, loan, deposit, housing. The neural network for the code is as follows:
Education
Job 1
Balance
Loan 2
Deposit
3
Default
Housing
Input Layer Hidden Layer Output Layer
The accuracy of the neural network is a little less than the accuracy of the decision tree that is because
in the neural network code, a limited number of variables are used to predict the default, whereas in the
decision tree, all the variables are considered.