4 Exploratory Data Analysis.

(3.
12) Exercise:
1. Download Haberman Cancer Survival dataset from Kaggle. You may have to create a Kaggle account to donwload data.
(https://www.kaggle.com/gilsousa/habermans-survival-data-set)
2. Perform a similar alanlaysis as above on this dataset with the following sections:
High level statistics of the dataset: number of points, numer of features, number of classes, data-points per class.
Explain our objective.
Perform Univaraite analysis(PDF, CDF, Boxplot, Voilin plots) to understand which features are useful towards
classification.
Perform Bi-variate analysis (scatter plots, pair-plots) to see if combinations of features are useful in classfication.
Write your observations in english as crisply and unambigously as possible. Always quantify your results.
Haberman’s Cancer Survival Dataset Info:

The dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's
Billings Hospital on the survival of patients who had undergone surgery for breast cancer.
Attribute Information:
Age of patient at the time of operation.
Patient’s year of operation (year — 1900).
Number of positive axillary nodes detected.
Survival status :
1 = the patient survived 5 years or longer
2 = the patient died within 5 years
Objective
To analyse and predict a patient survival who had undergone surgery of breast cancer.
In [1]: # importing the essential libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
In [40]: #printing first 10 rows of dataframe
haberman = pd.read_csv("haberman.csv")
haberman.head(10)
Out[40]:
age year nodes status
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
5 33 58 10 1
6 33 60 0 1
7 34 59 0 2
8 34 66 9 2
9 34 58 30 1
In [41]: # data-points and features

print(haberman.shape)
(306, 4)
In [42]: #column names in our dataset

print(haberman.columns)
Index(['age', 'year', 'nodes', 'status'], dtype='object')
In [43]: #Count of patients according to survival status

haberman['status'].value_counts()
Out[43]: 1 225
2 81
Name: status, dtype: int64
Observation(s):
1. value_counts() displays the number of data points in each class.

2. Out of 306 patients, 225 patients survived more than 5 years and 81 patients survived less than 5 years.
3. This is an imbalanced dataset.
In [44]: #printing dataframe information
haberman.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
age 306 non-null int64
year 306 non-null int64
nodes 306 non-null int64
status 306 non-null int64
dtypes: int64(4)
memory usage: 9.7 KB
Observation(s):
1. No missing values in dataset.

2. All columns are of integer datatype.
In [45]: #Understanding dataframe data
haberman.describe()
Out[45]:
age year nodes status
count 306.000000 306.000000 306.000000 306.000000
mean 52.457516 62.852941 4.026144 1.264706
std 10.803452 3.249405 7.189654 0.441899
min 30.000000 58.000000 0.000000 1.000000
25% 44.000000 60.000000 0.000000 1.000000
50% 52.000000 63.000000 1.000000 1.000000
75% 60.750000 65.750000 4.000000 2.000000
max 83.000000 69.000000 52.000000 2.000000
Observation(s):
1. count - Total number of values in each column.

2. mean - Mean of total values in respective columns.
3. std - Standard deviation of values in respective columns.
4. min - Minimum value in respective columns.
5. 25% / 50% / 75% - percentile value.
6. max - Maximum value in respective columns.
2-D Scatter Plot (Bi-variate analysis)

Scatter plot for age and status relation analysis
In [46]: #2-D scatter plot:
x = haberman.plot(kind='scatter', x='age', y='status') ;

x.set_title('2-D scatter plot')
plt.show()
In [48]: sns.set_style("whitegrid");
sns.FacetGrid(haberman, hue="status", size=4) \
.map(plt.scatter, "age", "status") \
.add_legend();
plt.suptitle('2D Scatter plot(colored)')
plt.show();
Observation(s):
1. Patients whose age is less than 40 are slightly tend to live more than 5 years
2. Survival status is independent of age if the patient's age is more than 40 years.
Pair-plot
Pair plot for bivariate analysis
In [50]: # pairwise scatter plot: Pair-Plot

plt.close();
sns.set_style("whitegrid");
sns.pairplot(haberman, hue="status", size=3,vars=["age", "year", "nodes"]);
plt.suptitle('Pair plots')
plt.show()
Observation(s):
1. Major overlapping is oberved,the patients who did not survive more than 5 years are mostly in age range 45-65.
Histogram, PDF, CDF (Univariate analysis)

In [26]: sns.FacetGrid(haberman,hue='status',height = 5) \
.map(sns.distplot,'age') \
.add_legend();
plt.suptitle('PDF of age');
plt.show();
Observation(s):
1. Overlapping is mostly there, survival chances are irrespective of age. But we can say the patients having age between 30
to 40 have more chances of survival comapre to patients whose ages are more than 40 years.
2. Only age cannot decide survival status.
.map(sns.distplot,'year') \
.add_legend();
plt.suptitle('PDF of year');
plt.show();
Observation(s):
1. Major overlapping is observed.

2. Operation year alone cannot decide survival chances.
3. In year 1960 and 1965, status 2 spike is observed, means more number of operations were unsuccessful.
.map(sns.distplot,'nodes') \
.add_legend();
plt.suptitle('PDF of nodes');
plt.show();
Observation(s):
1. Patients with 0 or 1 node have more chances of survival.

2. If nodes are more than 25 then survival chances are very less.
PDF for dataset
In [31]: status_1 = haberman[haberman['status']==1]

counts_1, bin_edges_1 = np.histogram(status_1['nodes'], bins=10,
density = True)
pdf_1 = counts_1/(sum(counts_1))
print(pdf_1);
print(bin_edges_1);
cdf_1 = np.cumsum(pdf_1)
plt.plot(bin_edges_1[1:],pdf_1);
plt.plot(bin_edges_1[1:], cdf_1,label='survived')
plt.xlabel('nodes')
status_2 = haberman[haberman['status']==2]
counts_2, bin_edges_2 = np.histogram(status_2['nodes'], bins=10,
density = True)
pdf_2 = counts_2/(sum(counts_2))
print(pdf_2);
print(bin_edges_2);
cdf_2 = np.cumsum(pdf_2)
plt.plot(bin_edges_2[1:],pdf_2);
plt.plot(bin_edges_2[1:], cdf_2,label='not survived')
plt.xlabel('nodes')
plt.suptitle('CDF of nodes')
plt.legend()
plt.show()
[0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444

0.00888889 0. 0. 0.00444444]
[ 0. 4.6 9.2 13.8 18.4 23. 27.6 32.2 36.8 41.4 46. ]
[0.56790123 0.14814815 0.13580247 0.04938272 0.07407407 0.
0.01234568 0. 0. 0.01234568]
[ 0. 5.2 10.4 15.6 20.8 26. 31.2 36.4 41.6 46.8 52. ]
Observation(s):
1. 83.55% patients who survived are having nodes in range 0 to 4.6
Box plot
In [32]: sns.boxplot(x='status',y='age', data=haberman)
plt.suptitle('Box plot for age');
plt.show()
Box plot 2 (year)
In [33]: sns.boxplot(x='status',y='year', data=haberman)

plt.suptitle('Box plot for year');
plt.show()
Box plot 3(nodes)
In [34]: sns.boxplot(x='status',y='nodes', data=haberman)

plt.suptitle('Box plot for nodes');
plt.show()
Violin plots
Violin plot 1 (Age)
In [35]: sns.violinplot(x="status", y="age", data=haberman, size=8)

plt.suptitle('Violin plot for age');
plt.show()
Violin plot 2 (year)
In [36]: sns.violinplot(x="status", y="year", data=haberman, size=8)

plt.suptitle('Violin plot for year');
plt.show()
Violin plot 3 (nodes)
In [37]: sns.violinplot(x="status", y="nodes", data=haberman, size=8)

plt.suptitle('Violin plot for nodes');
plt.show()
Observations for box and violin plots:
1. Patients in age group 45 to 65 are the most who died within 5 years.
2. Only age cannot decide the survival status.
3. There is much overlapping in this plot, but vaguely we can say in year 1958 to 1960 and 1963 to 1965 signifiacant
number of patients died.
4. Patients with survival status 1 have less nodes compare to status 2, means patients with more nodes have less chances
of survival.
5. Most of the patients who survived have zero nodes, but also there are many numbers of patients having zero nodes and
they died within 5 years, absence of nodes does not always guarantee survival.
Contour plot (Multivariate analysis)

Contour plot for realation of operation year and age analysis
In [38]: sns.jointplot(x="year", y="age", data=haberman, kind="kde");

plt.suptitle('Contour plot year vs age');
plt.show();
Observation(s):
1. In year 1958 - 1964, operations done mostly on patients having age 45 to 55.
Conclusions:
1. Survival chances are lesser if the number of positive axillary nodes are more, but also absence of positive axillary nodes
will not give guaranty of survival.
2. Age alone cannot decide the survival chance, although patients less than 35 years have more survival chances.
3. Operation year parameter doesn't play major role deciding survival chance.
4. The dataset is imbalanced and overlapping is there in many factors so the survival status cannot be implied directly.

4 Exploratory Data Analysis.

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

4 Exploratory Data Analysis.

Caricato da

Copyright:

Formati disponibili

(3.

Haberman’s Cancer Survival Dataset Info:

In [1]: # importing the essential libraries

In [40]: #printing first 10 rows of dataframe

In [41]: # data-points and features

In [42]: #column names in our dataset

Index(['age', 'year', 'nodes', 'status'], dtype='object')

In [43]: #Count of patients according to survival status

1. value_counts() displays the number of data points in each class.

In [44]: #printing dataframe information

1. No missing values in dataset.

In [45]: #Understanding dataframe data

count 306.000000 306.000000 306.000000 306.000000

mean 52.457516 62.852941 4.026144 1.264706

std 10.803452 3.249405 7.189654 0.441899

min 30.000000 58.000000 0.000000 1.000000

25% 44.000000 60.000000 0.000000 1.000000

50% 52.000000 63.000000 1.000000 1.000000

75% 60.750000 65.750000 4.000000 2.000000

max 83.000000 69.000000 52.000000 2.000000

1. count - Total number of values in each column.

2-D Scatter Plot (Bi-variate analysis)

In [46]: #2-D scatter plot:

x = haberman.plot(kind='scatter', x='age', y='status') ;

In [50]: # pairwise scatter plot: Pair-Plot

Histogram, PDF, CDF (Univariate analysis)

1. Major overlapping is observed.

1. Patients with 0 or 1 node have more chances of survival.

PDF for dataset

In [31]: status_1 = haberman[haberman['status']==1]

[0.83555556 0.08 0.02222222 0.02666667 0.01777778 0.00444444

1. 83.55% patients who survived are having nodes in range 0 to 4.6

Box plot 2 (year)

In [33]: sns.boxplot(x='status',y='year', data=haberman)

Box plot 3(nodes)

In [34]: sns.boxplot(x='status',y='nodes', data=haberman)

In [35]: sns.violinplot(x="status", y="age", data=haberman, size=8)

Violin plot 2 (year)

In [36]: sns.violinplot(x="status", y="year", data=haberman, size=8)

Violin plot 3 (nodes)

In [37]: sns.violinplot(x="status", y="nodes", data=haberman, size=8)

Observations for box and violin plots:

Contour plot (Multivariate analysis)

In [38]: sns.jointplot(x="year", y="age", data=haberman, kind="kde");

Potrebbero piacerti anche