Analytics Cheetsheet

What is Pandas?
Pandas is a python library providing easy to use data structures and data analytics tools.
What is numpy?
Numpy is a package in python that provides functions for mathematical calculations. Its main
function is in array processing.
What are “object” data types in pandas?
They are categorical data types (Ex: Married = Y / N)
Some of the pandas functions I started using are listed below:

import pandas as pd
train=pd.read_csv("C:/AnalyticsProjects/LoanPrediction/train_u6lujuX_CVtuZ9i.csv")
//This will read .csv file and get the data in to a structure which can be used for analysis.
Train_original=train.copy()
//This will create a copy of the original pandas data.
train.columns
//This will display all the column names.
train.shape
// This will show number of rows and columns in this data structure
train['Loan_status'].value_counts()
//This will display frequency count of each category in Loan_status variable.
Ex:, it can say Y 13 N 10  Means 13 approved and 10 rejected loans.
Whatever value we have, we can always say .plot.bar() to plot a bar graph
Ex:
train['Loan_Status'].value_counts(normalize=True).plot.bar()
// Above will plot a graph similar to below:
In place of .bar() we can use .box() to get box plot.
More boxplots can be be obtained using this:
train.boxplot(column=‘ ApplicantIncome’, by =’Education’)
// This will draw a boxplot with one bar for each education qualification
Gender=pd.crossbar(train[‘Gender’],train[‘Loan_Status’])
Above is the most important function for bivariate analysis. Crossbar creates a cross matrix. If we
print it shows as below:
print(Gender)
Loan_Status N Y
Gender
Female 37 75
Male 150 339
train.groupby('Loan_Status')['ApplicantIncome'].mean().plot.bar()
//Above is a very useful function. It groups the training data with one category for each Loan_status
and mean of all applications income belonging to each category is calculated and the same is
plotted.
#THis is how we drop all those rows where there are null / empty values
df=train.dropna()
#This is how we create a new column
train['Total_Income']=train['ApplicantIncome']+train['CoapplicantIncome']
#This is how we create bins
bins=[0,2500,4000,6000,81000]
group=['low','medium','high','very high']
# we create one more colum by assigning a group value to each Total Income
train['Total_Income_bin']=pd.cut(train['Total_Income'],bins,labels=group)
# If we nee to put a crosstab between this new column and Loan_Status, we can do this:
IncomeBin_LoanStatus=pd.crosstab(train['Total_Income_bin'],train['Loan_Status'])
IncomeBin_LoanStatus.div(IncomeBin_LoanStatus.sum(1).astype(float),
axis=0).plot(kind="bar",stacked=True)
#How to replace some values in the data? Here is it

train['Dependents'].replace('3+', 3,inplace=True)
# Below Is the way to check null values in any field and sum them up
train.isnull().sum()
# below is a way to fill null with mode (highest occurance)
train[‘Gender’].fillna(train[‘Gender’].mode()[0],inplace=True)
# How to drop columns from table?
train=train.drop(‘Load_ID’,axis=1)
#How to replace categorical variables with dummy variables for logical regression?
X = train.drop('Loan_Status',1) # X contains all columns except Loan_Status
y = train.Loan_Status # y contains only Loan_Status
X=pd.get_dummies(X)
What is sklearn?
scikit-learn (sklearn) is an open source library for Python which can be used to build models for data
analytics. It is too vast and learning it will take a long time.
Let us use some functions below:
from sklearn.model_selection import train_test_split
x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size =0.3)  Here, x_train is the independent
variables set of training data. X_cv is the independent variables validation data. Y_train is the
dependednt variable (Loan_Status) train data.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=1,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Analytics Cheetsheet

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Analytics Cheetsheet

Caricato da

Copyright:

Formati disponibili

What is Pandas?

What are “object” data types in pandas?

They are categorical data types (Ex: Married = Y / N)

Some of the pandas functions I started using are listed below:

//This will create a copy of the original pandas data.

//This will display all the column names.

//This will display frequency count of each category in Loan_status variable.

Ex:, it can say Y 13 N 10  Means 13 approved and 10 rejected loans.

// Above will plot a graph similar to below:

In place of .bar() we can use .box() to get box plot.

More boxplots can be be obtained using this:

train.boxplot(column=‘ ApplicantIncome’, by =’Education’)

Male 150 339

#This is how we create a new column

#This is how we create bins

#How to replace some values in the data? Here is it

# below is a way to fill null with mode (highest occurance)

# How to drop columns from table?

Let us use some functions below:

from sklearn.model_selection import train_test_split

Potrebbero piacerti anche