Eda Notes

EDA
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to

summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell
us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
the data, and possibly formulate hypotheses that could lead to new data collection and
experiments.
NOTE: To find the total no of rows which are missing in a column

cars.isnull().sum()
Problem
Analyze CARS.CSV file.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# import cars dataset - this is downloaded from kaggle

# here dataframe by the name 'cars' is created
cars = pd.read_csv("f:/eda/cars.csv")
# display top 5 records

cars.head()
# display which column has how many missing values (or null values)
cars.isnull().sum()
# show how many rows and cols in the dataset

cars.shape
# dataset information
cars.info()
# remove unimportant cols: MSRP, Invoice
cars = cars.drop(['MSRP', 'Invoice'], axis=1)
# remove duplicate rows if any

# keep the first row and remove other duplicate rows of that row
cars = cars.drop_duplicates(keep='first')
# remove the rows having missing values (in Cylinders)

cars.dropna(inplace=True)
# shape of the dataset -- no. of rows, cols

cars.shape
# summary statistics
# if std is 0, that colum should be removed from analysis
cars.describe()
# remove any space in column names

cars.columns = cars.columns.str.replace(' ', '')
# sort the data w.r.t a column -- here we sort on 'MPG_City'

cars_sort = cars.sort_values(by='MPG_City', ascending=False)
cars_sort.head()
'''
iloc[] -> gives integer location based indexing / selection
cars.iloc[0] -> gives 0th row
cars.iloc[-1] -> gives last row
cars.iloc[0:5] -> select 0 to 4th rows
cars.iloc[:, 0] -> select all rows and 0th col
cars.iloc[:, 0:5] -> select all rows and cols from 0 to 4th
cars.iloc[0,2,4], [1,3,5]] -> select 0,2,4 rows with 1,3,5 cols
'''
# select first 10 rows in MPG_City column
x = cars.iloc[0:10, 8]
x
# draw histogram for MPG_City in the intervals of 10

num_bins = 10
plt.hist(cars['MPG_City'], num_bins, color='green', edgecolor='black')
# probability density function -- represents values in 0 to 1 range

# use seaborn - give a tuple to the cars array.
sns.distplot(cars[('MPG_City')], bins=10, color='green')
# draw histogram for Length in the intervals of 10

num_bins = 10
plt.hist(cars['Length'], num_bins)
# probability density function for Length

sns.distplot(cars[('Length')], bins=10)
'''
To select only numeric type of columns, we can give their names or datatypes as:
cars[['EngineSize','Cylinders','Horsepower','MPG_City','MPG_Highway','Weight','Wheelbase'
,'Length']]
cars.select_dtypes(include=['float64', 'int64'])
'''
cars_num = cars.select_dtypes(include=['float64', 'int64'])
cars_num
# display histograms for all numeric columns

cars_num.hist(bins=20)
'''
create correlations among numeric cols. method= pearson / kendall
correlation with itself is 1
negative value -> if one value increases, the other one decreases.
Ex: cars_num.corr(method ='pearson') # correlations among all numeric cols
Ex: cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City
'''
cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City
# draw pair plots to show correlations

sns.pairplot(cars_num, vars = ["EngineSize", "Cylinders", "Horsepower", "MPG_City",
"MPG_Highway"], hue='MPG_City')
# thin lines or dots very much closer (not spread apart) represent good correlations.
# box plots can be drawn only for categorical variables
box1 = sns.boxplot(x='Type', y='MPG_City', data=cars)
# SUV gives less mileage and Hybrid gives more mileasge
# sedan and wagon show more variation
box2 = sns.boxplot(x='Origin', y='MPG_City', data=cars)

# The origin of car is Asia gives slightly more mileage
box3 = sns.boxplot(x='DriveTrain', y='MPG_City', data=cars)

# Front drive train is giving more mileage
'''
regression plot is useful to show regression line
if many points are nearer to the line, then there is better correlation
regression line makes the difference between regression plot and correlation plot
'''
sns.regplot(cars['Length'], cars['MPG_City'])
Task on eda
Do EDA On titanic dataset

Eda Notes

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Eda Notes

Caricato da

Copyright:

Formati disponibili

EDA

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to

NOTE: To find the total no of rows which are missing in a column

Analyze CARS.CSV file.

# import cars dataset - this is downloaded from kaggle

# display top 5 records

# show how many rows and cols in the dataset

# remove duplicate rows if any

# remove the rows having missing values (in Cylinders)

# shape of the dataset -- no. of rows, cols

# remove any space in column names

# sort the data w.r.t a column -- here we sort on 'MPG_City'

# draw histogram for MPG_City in the intervals of 10

# probability density function -- represents values in 0 to 1 range

# draw histogram for Length in the intervals of 10

# probability density function for Length

# display histograms for all numeric columns

# draw pair plots to show correlations

box2 = sns.boxplot(x='Origin', y='MPG_City', data=cars)

box3 = sns.boxplot(x='DriveTrain', y='MPG_City', data=cars)

Do EDA On titanic dataset

Potrebbero piacerti anche