Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Pandas is a python library providing easy to use data structures and data analytics tools.
What is numpy?
Numpy is a package in python that provides functions for mathematical calculations. Its main
function is in array processing.
//This will read .csv file and get the data in to a structure which can be used for analysis.
Train_original=train.copy()
train.columns
train.shape
// This will show number of rows and columns in this data structure
train['Loan_status'].value_counts()
Whatever value we have, we can always say .plot.bar() to plot a bar graph
Ex:
train['Loan_Status'].value_counts(normalize=True).plot.bar()
// This will draw a boxplot with one bar for each education qualification
Gender=pd.crossbar(train[‘Gender’],train[‘Loan_Status’])
Above is the most important function for bivariate analysis. Crossbar creates a cross matrix. If we
print it shows as below:
print(Gender)
Loan_Status N Y
Gender
Female 37 75
train.groupby('Loan_Status')['ApplicantIncome'].mean().plot.bar()
//Above is a very useful function. It groups the training data with one category for each Loan_status
and mean of all applications income belonging to each category is calculated and the same is
plotted.
#THis is how we drop all those rows where there are null / empty values
df=train.dropna()
train['Total_Income']=train['ApplicantIncome']+train['CoapplicantIncome']
bins=[0,2500,4000,6000,81000]
group=['low','medium','high','very high']
# we create one more colum by assigning a group value to each Total Income
train['Total_Income_bin']=pd.cut(train['Total_Income'],bins,labels=group)
# If we nee to put a crosstab between this new column and Loan_Status, we can do this:
IncomeBin_LoanStatus=pd.crosstab(train['Total_Income_bin'],train['Loan_Status'])
IncomeBin_LoanStatus.div(IncomeBin_LoanStatus.sum(1).astype(float),
axis=0).plot(kind="bar",stacked=True)
# Below Is the way to check null values in any field and sum them up
train.isnull().sum()
train[‘Gender’].fillna(train[‘Gender’].mode()[0],inplace=True)
train=train.drop(‘Load_ID’,axis=1)
#How to replace categorical variables with dummy variables for logical regression?
X = train.drop('Loan_Status',1) # X contains all columns except Loan_Status
y = train.Loan_Status # y contains only Loan_Status
X=pd.get_dummies(X)
What is sklearn?
scikit-learn (sklearn) is an open source library for Python which can be used to build models for data
analytics. It is too vast and learning it will take a long time.
x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size =0.3) Here, x_train is the independent
variables set of training data. X_cv is the independent variables validation data. Y_train is the
dependednt variable (Loan_Status) train data.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=1,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)