Sei sulla pagina 1di 4

EDA

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to


summarize their main characteristics, often with visual methods.

A statistical model can be used or not, but primarily EDA is for seeing what the data can tell
us beyond the formal modeling or hypothesis testing task.

Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
the data, and possibly formulate hypotheses that could lead to new data collection and
experiments.

NOTE: To find the total no of rows which are missing in a column


cars.isnull().sum()

Problem

Analyze CARS.CSV file.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import cars dataset - this is downloaded from kaggle


# here dataframe by the name 'cars' is created
cars = pd.read_csv("f:/eda/cars.csv")

# display top 5 records


cars.head()

# display which column has how many missing values (or null values)
cars.isnull().sum()

# show how many rows and cols in the dataset


cars.shape

# dataset information
cars.info()
# remove unimportant cols: MSRP, Invoice
cars = cars.drop(['MSRP', 'Invoice'], axis=1)

# remove duplicate rows if any


# keep the first row and remove other duplicate rows of that row
cars = cars.drop_duplicates(keep='first')

# remove the rows having missing values (in Cylinders)


cars.dropna(inplace=True)

# shape of the dataset -- no. of rows, cols


cars.shape

# summary statistics
# if std is 0, that colum should be removed from analysis
cars.describe()

# remove any space in column names


cars.columns = cars.columns.str.replace(' ', '')

# sort the data w.r.t a column -- here we sort on 'MPG_City'


cars_sort = cars.sort_values(by='MPG_City', ascending=False)
cars_sort.head()

'''
iloc[] -> gives integer location based indexing / selection
cars.iloc[0] -> gives 0th row
cars.iloc[-1] -> gives last row
cars.iloc[0:5] -> select 0 to 4th rows
cars.iloc[:, 0] -> select all rows and 0th col
cars.iloc[:, 0:5] -> select all rows and cols from 0 to 4th
cars.iloc[0,2,4], [1,3,5]] -> select 0,2,4 rows with 1,3,5 cols
'''
# select first 10 rows in MPG_City column
x = cars.iloc[0:10, 8]
x

# draw histogram for MPG_City in the intervals of 10


num_bins = 10
plt.hist(cars['MPG_City'], num_bins, color='green', edgecolor='black')

# probability density function -- represents values in 0 to 1 range


# use seaborn - give a tuple to the cars array.
sns.distplot(cars[('MPG_City')], bins=10, color='green')

# draw histogram for Length in the intervals of 10


num_bins = 10
plt.hist(cars['Length'], num_bins)

# probability density function for Length


sns.distplot(cars[('Length')], bins=10)

'''
To select only numeric type of columns, we can give their names or datatypes as:
cars[['EngineSize','Cylinders','Horsepower','MPG_City','MPG_Highway','Weight','Wheelbase'
,'Length']]
cars.select_dtypes(include=['float64', 'int64'])

'''
cars_num = cars.select_dtypes(include=['float64', 'int64'])
cars_num

# display histograms for all numeric columns


cars_num.hist(bins=20)

'''
create correlations among numeric cols. method= pearson / kendall
correlation with itself is 1
negative value -> if one value increases, the other one decreases.
Ex: cars_num.corr(method ='pearson') # correlations among all numeric cols
Ex: cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City
'''
cars_num.corr(method ='pearson')['MPG_City'] # correlations between other cols and
MPG_City

# draw pair plots to show correlations


sns.pairplot(cars_num, vars = ["EngineSize", "Cylinders", "Horsepower", "MPG_City",
"MPG_Highway"], hue='MPG_City')
# thin lines or dots very much closer (not spread apart) represent good correlations.
# box plots can be drawn only for categorical variables
box1 = sns.boxplot(x='Type', y='MPG_City', data=cars)
# SUV gives less mileage and Hybrid gives more mileasge
# sedan and wagon show more variation

box2 = sns.boxplot(x='Origin', y='MPG_City', data=cars)


# The origin of car is Asia gives slightly more mileage

box3 = sns.boxplot(x='DriveTrain', y='MPG_City', data=cars)


# Front drive train is giving more mileage

'''
regression plot is useful to show regression line
if many points are nearer to the line, then there is better correlation
regression line makes the difference between regression plot and correlation plot
'''
sns.regplot(cars['Length'], cars['MPG_City'])

Task on eda

Do EDA On titanic dataset

Potrebbero piacerti anche