Sei sulla pagina 1di 8

5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Cleaning Dirty Data


Home
with
Solutions
Pandas
Courses
& Python
Why Us Contact

       Previous
(877) 629-
Next
5631

Get My
Custom
Training
Consultation
First Name*

Last Name*

Company Name*
Pandas is a popular Python library used for data science and
analysis. Used in conjunction with other data science toolsets
like SciPy, NumPy, and Matplotlib, a modeler can create end- Email*
to-end analytic work ows to solve business problems.

While you can do a lot of really powerful things with Python Phone Number*
and data analysis, your analysis is only ever as good as your
dataset. And many datasets have missing, malformed, or How many people
erroneous data. It’s often unavoidable–anything from need training?*
incomplete reporting to technical glitches can cause “dirty” -- Please Select --
data.
Submit

Thankfully, Pandas provides a robust library of functions to


help you clean up, sort through, and make sense of your
datasets, no matter what state they’re in. For our example,
we’re going to use a dataset of 5,000 movies scraped from
IMDB. It contains information on the actors, directors, budget, DevelopIntelligence
has been in the
and gross, as well as the IMDB rating and release year. In
technical/software
practice, you’ll be using much larger datasets consisting of
development
potentially millions of rows, but this is a good sample dataset
learning and
to start with. training industry for

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 1/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Unfortunately, some of the elds in this dataset aren’t lled in nearly 20 years.
and some of them have default values such as 0 or NaN (Not a We’ve provided
Number). learning solutions to
more than 48,000
Home Solutions Courses Why Us
engineers, Contact
across
220 organizations
worldwide.

No good. Let’s go through some Pandas hacks you can use to


clean up your dirty data.

Getting started
To get started with Pandas, rst you will need to have it
installed. You can do so by running:

$ pip install pandas

Then we need to load the data we downloaded into Pandas.


You can do this with a few Python commands:

import pandas as pd

data = pd.read_csv(‘movie_metadata.csv’)

Make sure you have your movie dataset in the same folder as
you’re running the Python script. If you have it stored
elsewhere, you’ll need to change the read_csv parameter to
point to the le’s location.

Look at your data


To check out the basic structure of the data we just read in,
you can use the head() command to print out the rst ve
rows. That should give you a general idea of the structure of
the dataset.

data.head()

When we look at the dataset either in Pandas or in a more


traditional program like Excel, we can start to note down the
problems, and then we’ll come up with solutions to x those
problems.

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 2/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Pandas has some selection methods which you can use to


slice and dice the dataset based on your queries. Let’s go
through some quick examples before moving on:

Home
Look at the some basic stats for theSolutions
‘imdb_score’Courses Why Us Contact

column: data.imdb_score.describe()
Select a column: data[‘movie_title’]
Select the rst 10 rows of a column: data[‘duration’]
[:10]
Select multiple columns: data[[‘budget’,’gross’]]
Select all movies over two hours long:
data[data[‘duration’] > 120]

Deal with missing data


One of the most common problems is missing data. This could
be because it was never lled out properly, the data wasn’t
available, or there was a computing error. Whatever the
reason, if we leave the blank values in there, it will cause
errors in analysis later on. There are a couple of ways to deal
with missing data:

Add in a default value for the missing data


Get rid of (delete) the rows that have missing data
Get rid of (delete) the columns that have a high incidence
of missing data

We’ll go through each of those in turn.

Add default values


First of all, we should probably get rid of all those nasty NaN
values. But what to put in its place? Well, this is where you’re
going to have to eyeball the data a little bit. For our example,
let’s look at the ‘country’ column. It’s straightforward enough,
but some of the movies don’t have a country provided so the
data shows up as NaN. In this case, we probably don’t want to
assume the country, so we can replace it with an empty string
or some other default value.

data.country = data.country.fillna(‘’)

This replaces the NaN entries in the ‘country’ column with the
empty string, but we could just as easily tell it to replace with a
default name such as “None Given”. You can nd more
information on llna() in the Pandas documentation.
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 3/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

With numerical data like the duration of the movie, a


calculation like taking the mean duration can help us even the
dataset out. It’s not a great measure, but it’s an estimate of
what the duration could be based on the other data. That way
Home Solutions Courses Why Us Contact
we don’t have crazy numbers like 0 or NaN throwing o our
analysis.

data.duration =
data.duration.fillna(data.duration.mean())

Remove incomplete rows


Let’s say we want to get rid of any rows that have a missing
value. It’s a pretty aggressive technique, but there may be a
use case where that’s exactly what you want to do.

Dropping all rows with any NA values is easy:

data.dropna()

Of course, we can also drop rows that have all NA values:

data.dropna(how=’all’)

We can also put a limitation on how many non-null values


need to be in a row in order to keep it (in this example, the
data needs to have at least 5 non-null values):

data.dropna(thresh=5)

Let’s say for instance that we don’t want to include any movie
that doesn’t have information on when the movie came out:

data.dropna(subset=[‘title_year’])

The subset parameter allows you to choose which columns


you want to look at. You can also pass it a list of column
names here.

Deal with error-prone columns


We can apply the same kind of criteria to our columns. We just
need to use the parameter axis=1 in our code. That means to
operate on columns, not rows. (We could have used axis=0 in
our row examples, but it is 0 by default if you don’t enter
anything.)

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 4/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Drop the columns with that are all NA values:

data.dropna(axis=1, how=’all’)

Home
Drop all columns with any NA values: Solutions Courses Why Us Contact

data.dropna(axis=1, how=’any’)

The same threshold and subset parameters from above apply


as well. For more information and examples, visit the Pandas
documentation.

Normalize data types


Sometimes, especially when you’re reading in a CSV with a
bunch of numbers, some of the numbers will read in as strings
instead of numeric values, or vice versa. Here’s a way you can
x that and normalize your data types:

data = pd.read_csv(‘movie_metadata.csv’, dtype=


{‘duration’: int})

This tells Pandas that the column ‘duration’ needs to be an


integer value. Similarly, if we want the release year to be a
string and not a number, we can do the same kind of thing:

data = pd.read_csv(‘movie_metadata.csv’, dtype=


{title_year: str})

Keep in mind that this data reads the CSV from disk again, so
make sure you either normalize your data types rst or dump
your intermediary results to a le before doing so.

Change casing
Columns with user-provided data are ripe for corruption.
People make typos, leave their caps lock on (or o ), and add
extra spaces where they shouldn’t.

To change all our movie titles to uppercase:

data[‘movie_title’].str.upper()

Similarly, to get rid of trailing whitespace:

data[‘movie_title’].str.strip()

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 5/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

We won’t be able to cover correcting spelling mistakes in this


tutorial, but you can read up on fuzzy matching for more
information.

Home Solutions Courses Why Us Contact


Rename columns
Finally, if your data was generated by a computer program, it
probably has some computer-generated column names, too.
Those can be hard to read and understand while working, so if
you want to rename a column to something more user-
friendly, you can do it like this:

data.rename(columns = {‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})

Here we’ve renamed ‘title_year’ to ‘release_date’ and


‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is
not an in-place operation, you’ll need to save the DataFrame
by assigning it to a variable.

data = data.rename(columns =
{‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})

Save your results


When you’re done cleaning your data, you may want to export
it back into CSV format for further processing in another
program. This is easy to do in Pandas:

data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)

More resources
Of course, this is only the tip of the iceberg. With variations in
user environments, languages, and user input, there are many
ways that a potential dataset may be dirty or corrupted. At this
point you should have learned some of the most common
ways to clean your dataset with Pandas and Python.

For more resources on Pandas and data cleaning, see these


additional resources:

Pandas documentation
Messy Data Tutorial
Kaggle Datasets
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 6/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Python for Data Analysis (“The Pandas Book”)

Bio Latest Posts


Home Solutions Courses Why Us Contact
Al Nelson
Al is a geek about all things tech. He's a
professional technical writer and software
developer who loves writing for tech
businesses and cultivating happy users. You
can nd him on the web at
http://www.alnelsonwrites.com or on Twitter
as @musegarden.

     
Share This
Article 

August 10th, 2017 | Python, Uncategorized

ABOUT US TRAINING OPTIONS COURSES BY LET’S DISCUSS


CATEGORY
 Testimonials  Hands-On DevelopIntelligence
Training Courses  All Courses (IT has been in the
 Be Smart Blog
Training) technical/software
 Bootcamp
 Partners development
Courses  Agile Training
learning and
 Giving Back
 Learn to Code  Apache Training training industry
 Join The Team Workshops for nearly 20 years.
 Google Training
We’ve provided
 In the News  Expert-Led
 Java EE Training learning solutions
Prototyping
 2017 Developer to more than
 JavaScript
Report  Courses by Skill 48,000 engineers,
Training
Level across 220
 Tutorials
 JBoss Training organizations
 Courses by Role
 MySQL Training worldwide.

 Ruby Training

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 7/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

 Scala Training Let's discuss


your training
 SpringSource
Training goals.
Home Solutions Courses Why Us Contact
 Zend Training

Copyright; 2017, Develop Intelligence | All Rights Reserved.

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 8/8

Potrebbero piacerti anche