Sei sulla pagina 1di 3

What is Pandas?

Pandas is a python library providing easy to use data structures and data analytics tools.

What is numpy?

Numpy is a package in python that provides functions for mathematical calculations. Its main
function is in array processing.

What are “object” data types in pandas?

They are categorical data types (Ex: Married = Y / N)

Some of the pandas functions I started using are listed below:


import pandas as pd
train=pd.read_csv("C:/AnalyticsProjects/LoanPrediction/train_u6lujuX_CVtuZ9i.csv")

//This will read .csv file and get the data in to a structure which can be used for analysis.

Train_original=train.copy()

//This will create a copy of the original pandas data.

train.columns

//This will display all the column names.

train.shape

// This will show number of rows and columns in this data structure

train['Loan_status'].value_counts()

//This will display frequency count of each category in Loan_status variable.

Ex:, it can say Y 13 N 10  Means 13 approved and 10 rejected loans.

Whatever value we have, we can always say .plot.bar() to plot a bar graph

Ex:

train['Loan_Status'].value_counts(normalize=True).plot.bar()

// Above will plot a graph similar to below:

In place of .bar() we can use .box() to get box plot.

More boxplots can be be obtained using this:

train.boxplot(column=‘ ApplicantIncome’, by =’Education’)

// This will draw a boxplot with one bar for each education qualification

Gender=pd.crossbar(train[‘Gender’],train[‘Loan_Status’])
Above is the most important function for bivariate analysis. Crossbar creates a cross matrix. If we
print it shows as below:

print(Gender)

Loan_Status N Y

Gender

Female 37 75

Male 150 339

train.groupby('Loan_Status')['ApplicantIncome'].mean().plot.bar()
//Above is a very useful function. It groups the training data with one category for each Loan_status
and mean of all applications income belonging to each category is calculated and the same is
plotted.

#THis is how we drop all those rows where there are null / empty values

df=train.dropna()

#This is how we create a new column

train['Total_Income']=train['ApplicantIncome']+train['CoapplicantIncome']

#This is how we create bins

bins=[0,2500,4000,6000,81000]

group=['low','medium','high','very high']

# we create one more colum by assigning a group value to each Total Income

train['Total_Income_bin']=pd.cut(train['Total_Income'],bins,labels=group)

# If we nee to put a crosstab between this new column and Loan_Status, we can do this:
IncomeBin_LoanStatus=pd.crosstab(train['Total_Income_bin'],train['Loan_Status'])

IncomeBin_LoanStatus.div(IncomeBin_LoanStatus.sum(1).astype(float),
axis=0).plot(kind="bar",stacked=True)

#How to replace some values in the data? Here is it


train['Dependents'].replace('3+', 3,inplace=True)

# Below Is the way to check null values in any field and sum them up

train.isnull().sum()

# below is a way to fill null with mode (highest occurance)

train[‘Gender’].fillna(train[‘Gender’].mode()[0],inplace=True)

# How to drop columns from table?

train=train.drop(‘Load_ID’,axis=1)

#How to replace categorical variables with dummy variables for logical regression?
X = train.drop('Loan_Status',1) # X contains all columns except Loan_Status
y = train.Loan_Status # y contains only Loan_Status

X=pd.get_dummies(X)

What is sklearn?

scikit-learn (sklearn) is an open source library for Python which can be used to build models for data
analytics. It is too vast and learning it will take a long time.

Let us use some functions below:

from sklearn.model_selection import train_test_split

x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size =0.3)  Here, x_train is the independent
variables set of training data. X_cv is the independent variables validation data. Y_train is the
dependednt variable (Loan_Status) train data.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=1,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Potrebbero piacerti anche