Sei sulla pagina 1di 40

PANDAS

Lesson 1
Create Data - We begin by creating our own data set for analysis. This prevents the end user
reading this tutorial from having to download any files to replicate the results below. We will export
this data set to a text file so that you can get some experience pulling data from a text file.
Get Data - We will learn how to read in the text file. The data consist of baby names and the number
of baby names born in the year 1880.
Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean I
mean we will take a look inside the contents of the text file and look for any anomalities. These can
include missing data, inconsistencies in the data, or any other data that seems out of place. If any
are found we will then have to make decisions on what to do with these records.
Analyze Data - We will simply find the most popular name in a specific year.
Present Data - Through tabular data and a graph, clearly show the end user what is the most
popular name in a specific year.
The pandas library is used for all the data analysis excluding a small piece of the data presentation
section. The matplotlib library will only be needed for the data presentation section. Importing the
libraries is the first step we will take in the lesson.

# Import all libraries needed for the tutorial

# General syntax to import specific functions in a library:


##from (library) import (specific library function)
from pandas import DataFrame, read_csv

# General syntax to import a library but no functions:


##import (library) as (give the library a nickname/alias)
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number

# Enable inline plotting


%matplotlib inline

print 'Python version ' + sys.version

print 'Pandas version ' + pd.__version__


Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul
2) [MSC v.1500 64 bit (AMD64)]
Pandas version 0.15.2

1 2013, 12:37:5

Create Data
The data set will consist of 5 baby names and the number of births recorded for that year (1880).

# The inital set of baby names and bith rates


names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

To merge these two lists together we will use the zip function.

zip?

BabyDataSet = zip(names,births)
BabyDataSet
[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

We are basically done creating the data set. We now will use the pandas library to export this data
set into a csv file.
df will be a DataFrame object. You can think of this object holding the contents of the BabyDataSet
in a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contents
inside df.
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df

Names Births
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578

Names Births
4 Mel
973
Export the dataframe to a csv file. We can name the file births1880.csv. The function to_csv will be
used to export the file. The file will be saved in the same location of the notebook unless specified
otherwise.
In [7]:
df.to_csv?

The only parameters we will use is index and header. Setting these parameters to True will prevent
the index and header names from being exported. Change the values of these parameters to get a
better understanding of their use.
In [8]:
df.to_csv('births1880.csv',index=False,header=False)

Get Data
To pull in the csv file, we will use the pandas function read_csv. Let us take a look at this function
and what inputs it takes.
In [9]:
read_csv?

Even though this functions has many parameters, we will simply pass it the location of the text file.
Location = C:\Users\ENTER_USER_NAME.xy\startups\births1880.csv
Note: Depending on where you save your notebooks, you may need to modify the location above.
In [10]:
Location = r'C:\Users\david\notebooks\pandas\births1880.csv'
df = pd.read_csv(Location)

Notice the r before the string. Since the slashes are special characters, prefixing the string with a r
will escape the whole string.
In [11]:
df

Out[11]:
Bob 968
0 Jessica 155
1 Mary 77
2 John 578

Bob 968
3 Mel
973
This brings us the our first problem of the exercise. The read_csv function treated the first record in
the csv file as the header names. This is obviously not correct since the text file did not provide us
with header names.
To correct this we will pass the header parameter to the read_csv function and set it to None
(means null in python).
In [12]:
df = pd.read_csv(Location, header=None)
df

Out[12]:
0
1
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel
973
If we wanted to give the columns specific names, we would have to pass another paramter called
names. We can also omit the header parameter.
In [13]:
df = pd.read_csv(Location, names=['Names','Births'])
df

Out[13]:
Names Births
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel
973
You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these are
part of the index of the dataframe. You can think of the index as the primary key of a sql table with
the exception that an index is allowed to have duplicates.
[Names, Births] can be though of as column headers similar to the ones found in an Excel
spreadsheet or sql database.

Delete the csv file now that we are done using it.
In [14]:
import os
os.remove(Location)

Prepare Data
The data we have consists of baby names and the number of births in the year 1880. We already
know that we have 5 records and none of the records are missing (non-null values).
The Names column at this point is of no concern since it most likely is just composed of alpha
numeric strings (baby names). There is a chance of bad data in this column but we will not worry
about that at this point of the analysis. The Births column should just contain integers representing
the number of babies born in a specific year with a specific name. We can check if the all the data is
of the data type integer. It would not make sense to have this column have a data type of float. I
would not worry about any possible outliers at this point of the analysis.
Realize that aside from the check we did on the "Names" column, briefly looking at the data inside
the dataframe should be as far as we need to go at this stage of the game. As we continue in the
data analysis life cycle we will have plenty of opportunities to find any issues with the data set.
In [15]:
# Check data type of the columns
df.dtypes

Out[15]:
Names
object
Births
int64
dtype: object

In [16]:
# Check data type of Births column
df.Births.dtype

Out[16]:
dtype('int64')

As you can see the Births column is of type int64, thus no floats (decimal numbers) or alpha numeric
characters will be present in this column.

Analyze Data
To find the most popular name or the baby name with the higest birth rate, we can do one of the
following.

Sort the dataframe and select the top row


Use the max() attribute to find the maximum value
In [17]:

# Method 1:
Sorted = df.sort(['Births'], ascending=False)

Sorted.head(1)

Out[17]:
Names Births
4 Mel
973
In [18]:
# Method 2:
df['Births'].max()

Out[18]:
973

Present Data
Here we can plot the Births column and label the graph to show the end user the highest point on
the graph. In conjunction with the table, the end user has a clear picture that Mel is the most popular
baby name in the data set.
plot() is a convinient attribute where pandas lets you painlessly plot the data in your dataframe. We
learned how to find the maximum value of the Births column in the previous section. Now to find the
actual baby name of the 973 value looks a bit tricky, so lets go over it.
Explain the pieces:
df['Names'] - This is the entire list of baby names, the entire Names column
df['Births'] - This is the entire list of Births in the year 1880, the entire Births column
df['Births'].max() - This is the maximum value found in the Births column
[df['Births'] == df['Births'].max()] IS EQUAL TO [Find all of the records in the Births column where it is
equal to 973]
df['Names'][df['Births'] == df['Births'].max()] IS EQUAL TO Select all of the records in the Names
column WHERE [The Births column is equal to 973]
An alternative way could have been to use the Sorted dataframe:
Sorted['Names'].head(1).value
The str() function simply converts an object into a string.
In [19]:
# Create graph
df['Births'].plot()

# Maximum value in the data set


MaxValue = df['Births'].max()

# Name associated with the maximum value


MaxName = df['Names'][df['Births'] == df['Births'].max()].values

# Text to display on graph


Text = str(MaxValue) + " - " + MaxName

# Add text to graph


plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),
xycoords=('axes fraction', 'data'), textcoords='offset point
s')

print "The most popular name"


df[df['Births'] == df['Births'].max()]
#Sorted.head(1) can also be used
The most popular name

Out[19]:
Names Births
4 Mel
973

Lesson 2
In [1]:
# The usual preamble
import pandas as pd
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)

figsize(15, 5)

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a
subset of the of 311 service requests from NYC Open Data.
In [2]:
complaints = pd.read_csv('../data/311-service-requests.csv')

2.1 What's even in it? (the summary)


When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll show
you a summary. This includes all the columns, and how many non-null values there are in each
column.
In [3]:
complaints

Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 52 columns):
Unique Key
111069
Created Date
111069
Closed Date
60270
Agency
111069
Agency Name
111069
Complaint Type
111069
Descriptor
111068
Location Type
79048
Incident Zip
98813
Incident Address
84441
Street Name
84438

non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values

Cross Street 1
84728 non-null values
Cross Street 2
84005 non-null values
Intersection Street 1
19364 non-null values
Intersection Street 2
19366 non-null values
Address Type
102247 non-null values
City
98860 non-null values
Landmark
95 non-null values
Facility Type
110938 non-null values
Status
111069 non-null values
Due Date
39239 non-null values
Resolution Action Updated Date
96507 non-null values
Community Board
111069 non-null values
Borough
111069 non-null values
X Coordinate (State Plane)
98143 non-null values
Y Coordinate (State Plane)
98143 non-null values
Park Facility Name
111069 non-null values
Park Borough
111069 non-null values
School Name
111069 non-null values
School Number
111052 non-null values
School Region
110524 non-null values
School Code
110524 non-null values
School Phone Number
111069 non-null values
School Address
111069 non-null values
School City
111069 non-null values
School State
111069 non-null values
School Zip
111069 non-null values
School Not Found
38984 non-null values
School or Citywide Complaint
0 non-null values
Vehicle Type
99 non-null values
Taxi Company Borough
117 non-null values
Taxi Pick Up Location
1059 non-null values
Bridge Highway Name
185 non-null values
Bridge Highway Direction
185 non-null values
Road Ramp
184 non-null values
Bridge Highway Segment
223 non-null values
Garage Lot Name
49 non-null values
Ferry Direction
37 non-null values
Ferry Terminal Name
336 non-null values
Latitude
98143 non-null values
Longitude
98143 non-null values
Location
98143 non-null values
dtypes: float64(5), int64(1), object(46)

2.2 Selecting columns and rows


To select a column, we index with the name of the column, like this:
In [4]:
complaints['Complaint Type']

Out[4]:
0
1

Noise - Street/Sidewalk
Illegal Parking

2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
5
Noise - Commercial
6
Blocked Driveway
7
Noise - Commercial
8
Noise - Commercial
9
Noise - Commercial
10
Noise - House of Worship
11
Noise - Commercial
12
Illegal Parking
13
Noise - Vehicle
14
Rodent
...
111054
Noise - Street/Sidewalk
111055
Noise - Commercial
111056
Street Sign - Missing
111057
Noise
111058
Noise - Commercial
111059
Noise - Street/Sidewalk
111060
Noise
111061
Noise - Commercial
111062
Water System
111063
Water System
111064
Maintenance or Facility
111065
Illegal Parking
111066
Noise - Street/Sidewalk
111067
Noise - Commercial
111068
Blocked Driveway
Name: Complaint Type, Length: 111069, dtype: object
To get the first 5 rows of a dataframe, we can use a slice: df[:5].

This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to
look at the contents and get a feel for this dataset.
In [5]:
complaints[:5]

Out[5]:

10

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
N
e
1
11
w
0
00
Y
(40
/
/ /
o
.70
3
33
r N
9
82
1
11
k oi
075
/
/ /
C se L
01
U UUUUUUUUU
4 32
22
99
PA2 2 1
7
it - o
36
AJ
1 n nnnnnnnnn
0 59
60
Str
01
r s0 0 2 Q 1 Q
3
y St u
11 9
DA
0 s sssssssss
. 32
51 N
ee
AA
es 1 1 Q U 9 U
.
N P re d
16 S
N N D MN
4 p p p p p p p p p p N NN NNNNNNNN7 02,
83 Y
t/S
VV
ci 3 3 U E 7 E
7
0 a ol et T
49 T
a a R Aa
2 e e e e e e e e e e Na a a a a a aa a a a 0 90 P
ide
EE
i g1 0 E E 3 E
9
N ic /S al
3S R
NNE I N
0 ci ci ci ci ci ci ci ci ci ci N NN NNNNNNNN8 73.
62 D
wa
NN
nn 0 2 E N 8 N
1
e id ki
2T E
SC
2 fi fi fi fi fi fi fi fi fi fi
2 79
5:
lk
UU
ce: : N S 9 S
6
De n
RE
SA
7 e eeeeeeeee
7 16
10
EE
t d0 3 S
0
ew g
ET
d ddddddddd
5 03
8
85
4
p al
E
95
:
: :
ak
T
77
4
41
rt
97
1
17
m
21)
A
AA
e
M
MM
n
t
21
N Ill C
15 5 5 5
B M P 1 0 1 U UUUUUUUUU
4 - (40
N
Str
O
Q 2 Q
60 N e e o
1 8 8 8 9 N N L A N r 0 N 5 0 n n n n n n n n n n N NN NNNNNNNN0 7 .72
Y
ee
p
U 0 U
15 / a w g m 3 A A P S a a O S a e / a Q 0 s s s s s s s s s s Na a a a a a aa a a a . 3 10
P
t/S
e
E 1 E
9 3 N Y al m 7 V V L T N N C P N c 3 N U 9 p p p p p p p p p p N NN NNNNNNNN7 . 40
D
ide
n
E 9 E
31
oP e
8E E A R
KE i 1 E 3 e eeeeeeeee
2 9 53

11

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
6/
r ar r wa N N C E
F T n / E N 4 8 ci N ci ci ci ci ci ci ci ci ci
1 0 56
92
k ki ci lk U U E E
A H c 2 N S 9 4 fi S fi fi fi fi fi fi fi fi fi
0 9 28
80
C n al
EE T
C
t 0 S
e eeeeeeeee
4 4 30
1
it g O
E
1
d ddddddddd
1 5 5, 3
y v
3
3 73.
0
P e
1
90
2
ol r
0
94
:
ic ni
:
53
0
e g
0
06
1
D h
1
79
:
e t
:
17
0
p P
0
65)
4
a a
4
A
rt r
A
M m ki
M
e n
n g
t
11 N L
4 WW
11
- (40
N Cl
1 M
UM UUUUUUUUU
4
20 0 e o
0 BEE
P 00
7 .84
oi ub
AN
2 A1 nAnnnnnnnnn
0
6/ / w u
6 RSS
r C/ /
2
3 33
se /B 1
DE
MN 0 s N s s s s s s s s s
.
5 3 3 NY d
0 OT T
el 3 3
4
. 29
ar/ 0
N N D WN
A H 0 p H p p p p p p p p p N NN NNNNNNNN8
9 1 1 Yo M
BA1 1
co1 1
6
9 75
2
C Re 0
a a R Ya
N A 1 e A e e e e e e e e e Na a a a a a aa a a a 4
4 / / Pr
u
RD7 7
i s/ /
5
3 46
o sta 3
N N E ON
H T 0 ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN3
1 2 2 Dk si
O W1 2
ne 2 2
3
9 65
m ur 2
SR
A T 8 fi T fi fi fi fi fi fi fi fi fi
3
3 0 0 C c/
AASS
cd0 0
1
1 13,
m an
SK
T A8 eAeeeeeeeee
3
9 1 1 it P
DYT T
t 11
4er t
T N
dNddddddddd
0
33 y a
W RR
33
4 73.

12

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
0 0 P ci rt
A EE
10A
93
2 2 ol al y
Y EE
02N
91
: : ic
TT
: :
43
04 e
03
71
00 D
09
91
: : e
: :
34
23 p
24
82)
42 a
42
A A rt
AA
MM m
MM
e
n
t
11 N
A
11
C
(40
00 e
M
00
C
WWO
.77
/ / w
S
/ / 0
a
EEL
- 80
3 3 YN
T
B
337 M
UM UUUUUUUUU
4
2
r/
SSU
P
7 08
1 1 o oi
E
L N
1 1 MA
nAnnnnnnnnn
0
6
T Str T T M
rC
92
3 74
/ / r se
1
R
OE
/ / AN
sNsssssssss
.
5
N
r ee 7 7 B
el
82
. 46
22 k0
D N N C WN 2 2 N H
p H p p p p p p p p p N NN NNNNNNNN7
9
Y
u t/S 2 2 U
co
92
9 37
3 0 0 CV
0
A a a K Ya
00HA
e A e e e e e e e e e Na a a a a a aa a a a 7
5
P
c ide S S S
is
77
8 2, 1 1 it e
2
M N N F ON 1 1 A T
ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN8
7
D
k wa T T A
ne
32
0 73.
3 3 y hi
3
A
AR
33T T
fi T fi fi fi fi fi fi fi fi fi
0
2
H lk R R V
cd
07
2 98
0 0 P cl
V
CK
00T A
eAeeeeeeeee
0
1
o
EEE
t
1 02
1 2 ol e
E
E
92A N
dNddddddddd
9
r
EEN
3 13
: : ic
N
: : N
n
TTU
49
52 e
U
52
E
02
61 D
E
61

13

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
: : e
: :
39
24 p
21
75)
38 a
30
A A rt
AA
MM m
MM
e
n
t
1
D C
A
11
0
e o
D
10
(40
/
p n
A
/ /
.80
3
a di
M
33
WW
76
1
rt ti
C
01
EEL
1
90
/
m o
L
/ /
SSE
B
0 M
UM UUUUUUUUU
4 92
22
e n
A
22
7
TTN
L N P
MA
nAnnnnnnnnn
0 70
60
n A
Y
00
92
3
D R Va 1 1 1 O
OE
e
AN
sNsssssssss
. 49
51
t
tt
T
11
93
.
N O o ca 0 2 2 X N N C WN Nn
NH
p H p p p p p p p p p N NN NNNNNNNN8 51,
93
o r
O
33
83
9
4 a H d nt 0 4 4 A a a K Y a / d
HA
e A e e e e e e e e e Na a a a a a aa a a a 0 00
f
a
N
00
85
4
N M e Lo 2 S S V N N F ON Ai
AT
ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN7 73.
91
H ct
P
11
14
7
H nt t 7 T T E
AR n
T T
fi T fi fi fi fi fi fi fi fi fi
6 94
3:
e in
O
: :
55
3
RRN
CK
g
T A
eAeeeeeeeee
9 73
05
al g
W
55
8
EEU
E
AN
dNddddddddd
1 87
3
t
R
E
39
7
EEE
N
03
:
h o
L
: :
TT
49
4
a d
L
45
14
4
n e
J
44
33)
A
d n
R
AA
M M ts
B
MM

14

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
e
O
n
U
t
L
al
E
H
V
y
A
gi
R
e
D
n
e
We can combine these to get the first 5 rows of a column:
In [6]:
complaints['Complaint Type'][:5]

Out[6]:
0
Noise - Street/Sidewalk
1
Illegal Parking
2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
Name: Complaint Type, dtype: object

and it doesn't matter which direction we do it in:


In [7]:

15

complaints[:5]['Complaint Type']

Out[7]:
0
Noise - Street/Sidewalk
1
Illegal Parking
2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
Name: Complaint Type, dtype: object

2.3 Selecting multiple columns


What if we just want to know the complaint type and the borough, but not the rest of the information?
Pandas makes it really easy to select a subset of the columns: just index with list of columns you
want.
In [8]:
complaints[['Complaint Type', 'Borough']]

Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 2 columns):
Complaint Type
111069 non-null values
Borough
111069 non-null values
dtypes: object(2)

That showed us a summary, and then we can look at the first 10 rows:
In [9]:
complaints[['Complaint Type', 'Borough']][:10]

Out[9]:
Complaint Type
Borough
0 Noise - Street/Sidewalk QUEENS
1 Illegal Parking
QUEENS
2 Noise - Commercial
MANHATTAN
3 Noise - Vehicle
MANHATTAN
4 Rodent
MANHATTAN
5 Noise - Commercial
QUEENS
6 Blocked Driveway
QUEENS
7 Noise - Commercial
QUEENS
8 Noise - Commercial
MANHATTAN
9 Noise - Commercial
BROOKLYN

16

2.4 What's the most common complaint


type?
This is a really easy question to answer! There's a .value_counts() method that we can use:
In [10]:
complaints['Complaint Type'].value_counts()

Out[10]:
HEATING
14200
GENERAL CONSTRUCTION
7471
Street Light Condition
7117
DOF Literature Request
5797
PLUMBING
5373
PAINT - PLASTER
5149
Blocked Driveway
4590
NONCONST
3998
Street Condition
3473
Illegal Parking
3343
Noise
3321
Traffic Signal Condition
3145
Dirty Conditions
2653
Water System
2636
Noise - Commercial
2578
...
Opinion for the Mayor
Window Guard
DFTA Literature Request
Legal Services Provider Complaint
Open Flame Permit
Snow
Municipal Parking Facility
X-Ray Machine/Equipment
Stalled Sites
DHS Income Savings Requirement
Tunnel Condition
Highway Sign - Damaged
Ferry Permit
Trans Fat
DWD
Length: 165, dtype: int64

2
2
2
2
1
1
1
1
1
1
1
1
1
1
1

If we just wanted the top 10 most common complaints, we can do this:

17

In [11]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

Out[11]:
HEATING
GENERAL CONSTRUCTION
Street Light Condition
DOF Literature Request
PLUMBING
PAINT - PLASTER
Blocked Driveway
NONCONST
Street Condition
Illegal Parking
dtype: int64

14200
7471
7117
5797
5373
5149
4590
3998
3473
3343

But it gets better! We can plot them!


In [12]:
complaint_counts[:10].plot(kind='bar')

Out[12]:
<matplotlib.axes.AxesSubplot at 0x7ba2290>
.warning{
color: rgb( 240, 20, 20 )
}

18

Lesson 3
Get Data - Our data set will consist of an Excel file containing customer counts per date. We will
learn how to read in the excel file for processing.
Prepare Data - The data is an irregular time series having duplicate dates. We will be challenged in
compressing the data and coming up with next years forecasted customer count.
Analyze Data - We use graphs to visualize trends and spot outliers. Some built in computational
tools will be used to calculate next years forecasted customer count.
Present Data - The results will be plotted.
NOTE: Make sure you have looked through all previous lessons, as the knowledge learned in
previous lessons will be needed for this exercise.
In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys

%matplotlib inline

In [2]:
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul
2) [MSC v.1500 64 bit (AMD64)]
Pandas version: 0.15.2

1 2013, 12:37:5

We will be creating our own test data for analysis.


In [3]:
# set seed
np.seed(111)

# Function to generate test data


def CreateDataSet(Number=1):

Output = []

19

for i in range(Number):

# Create a weekly (mondays) date range


rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')

# Create random data


data = np.randint(low=25,high=1000,size=len(rng))

# Status pool
status = [1,2,3]

# Make a random list of statuses


random_status = [status[np.randint(low=0,high=len(status))] for i in
range(len(rng))]

# State pool
states = ['GA','FL','fl','NY','NJ','TX']

# Make a random list of states


random_states = [states[np.randint(low=0,high=len(states))] for i in
range(len(rng))]

Output.extend(zip(random_states, random_status, data, rng))

return Output

Now that we have a function to generate our test data, lets create some data and stick it into a
dataframe.
In [4]:
dataset = CreateDataSet(4)

20

df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','St
atusDate'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 836 entries, 0 to 835
Data columns (total 4 columns):
State
836 non-null object
Status
836 non-null int64
CustomerCount
836 non-null int64
StatusDate
836 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 32.7+ KB

In [5]:
df.head()

Out[5]:
State Status CustomerCount StatusDate
0 GA 1
877
2009-01-05
1 FL 1
901
2009-01-12
2 fl
3
749
2009-01-19
3 FL 3
111
2009-01-26
4 GA 1
300
2009-02-02
We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. We
simply do this to show you how to read and write to Excel files.
We do not write the index values of the dataframe to the Excel file, since they are not meant to be
part of our initial test data set.
In [6]:
# Save results to excel
df.to_excel('Lesson3.xlsx', index=False)
print 'Done'
Done

Grab Data from Excel


We will be using the read_excel function to read in data from an Excel file. The function allows you
to read in specfic tabs by name or location.
In [7]:
pd.read_excel?

21

Note: The location on the Excel file will be in the same folder as the notebook, unless
specified otherwise.
In [8]:
# Location of file
Location = r'C:\Users\david\notebooks\pandas\Lesson3.xlsx'

# Parse a specific sheet


df = pd.read_excel(Location, 0, index_col='StatusDate')
df.dtypes

Out[8]:
State
Status
CustomerCount
dtype: object

object
int64
int64

In [9]:
df.index

Out[9]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2009-01-05, ..., 2012-12-31]
Length: 836, Freq: None, Timezone: None

In [10]:
df.head()

Out[10]:
State Status CustomerCount
StatusDate
2009-01-05 GA
2009-01-12 FL
2009-01-19 fl
2009-01-26 FL
2009-02-02 GA

1
1
3
3
1

877
901
749
111
300

Prepare Data
This section attempts to clean up the data for analysis.
1. Make sure the state column is all in upper case
2. Only select records where the account status is equal to "1"

22

3. Merge (NJ and NY) to NY in the state column


4. Remove any outliers (any odd results in the data set)
Lets take a quick look on how some of the State values are upper case and some are lower case
In [11]:
df['State'].unique()

Out[11]:
array([u'GA', u'FL', u'fl', u'TX', u'NY', u'NJ'], dtype=object)

To convert all the State values to upper case we will use the upper() function and the dataframe's
apply attribute. The lambda function simply will apply the upper function to each value in the State
column.
In [12]:
# Clean State Column, convert to upper case
df['State'] = df.State.apply(lambda x: x.upper())

In [13]:
df['State'].unique()

Out[13]:
array([u'GA', u'FL', u'TX', u'NY', u'NJ'], dtype=object)

In [14]:
# Only grab where Status == 1
mask = df['Status'] == 1
df = df[mask]

To turn the NJ states to NY we simply...


[df.State == 'NJ'] - Find all records in the State column where they are equal to NJ.
df.State[df.State == 'NJ'] = 'NY' - For all records in the State column where they are equal to NJ,
replace them with NY.
In [15]:
# Convert NJ to NY
mask = df.State == 'NJ'
df['State'][mask] = 'NY'

Now we can see we have a much cleaner data set to work with.
In [16]:

23

df['State'].unique()

Out[16]:
array([u'GA', u'FL', u'NY', u'TX'], dtype=object)

At this point we may want to graph the data to check for any outliers or inconsistencies in the data.
We will be using the plot() attribute of the dataframe.
As you can see from the graph below it is not very conclusive and is probably a sign that we need to
perform some more data preparation.
In [17]:
df['CustomerCount'].plot(figsize=(15,5));

If we take a look at the data, we begin to realize that there are multiple values for the same State,
StatusDate, and Status combination. It is possible that this means the data you are working with is
dirty/bad/inaccurate, but we will assume otherwise. We can assume this data set is a subset of a
bigger data set and if we simply add the values in the CustomerCount column per State,
StatusDate, and Status we will get the Total Customer Count per day.
In [18]:
sortdf = df[df['State']=='NY'].sort(axis=0)
sortdf.head(10)

Out[18]:
State Status CustomerCount
StatusDate
2009-01-19 NY 1
522
2009-02-23 NY 1
710
2009-03-09 NY 1
992
2009-03-16 NY 1
355
2009-03-23 NY 1
728
2009-03-30 NY 1
863
2009-04-13 NY 1
520
2009-04-20 NY 1
820
2009-04-20 NY 1
937
2009-04-27 NY 1
447
Our task is now to create a new dataframe that compresses the data so we have daily customer
counts per State and StatusDate. We can ignore the Status column since all the values in this
column are of value 1. To accomplish this we will use the dataframe's functions groupby and sum().
Note that we had to use reset_index . If we did not, we would not have been able to group by both
the State and the StatusDate since the groupby function expects only columns as inputs. The
reset_index function will bring the index StatusDate back to a column in the dataframe.
In [19]:
# Group by State and StatusDate

24

Daily = df.reset_index().groupby(['State','StatusDate']).sum()
Daily.head()

Out[19]:
Status CustomerCount
State StatusDate
FL 2009-01-12 1
901
2009-02-02 1
653
2009-03-23 1
752
2009-04-06 2
1086
2009-06-08 1
649
The State and StatusDate columns are automatically placed in the index of the Daily dataframe.
You can think of the index as the primary key of a database table but without the constraint of
having unique values. Columns in the index as you will see allow us to easily select, plot, and
perform calculations on the data.
Below we delete the Status column since it is all equal to one and no longer necessary.
In [20]:
del Daily['Status']
Daily.head()

Out[20]:
CustomerCount
State StatusDate
FL 2009-01-12 901
2009-02-02 653
2009-03-23 752
2009-04-06 1086
2009-06-08 649
In [21]:
# What is the index of the dataframe
Daily.index

Out[21]:
MultiIndex(levels=[[u'FL', u'GA', u'NY', u'TX'], [2009-01-05 00:00:00, 2009-0
1-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00,
2009-03-09 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:
00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04
-27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00,
2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:
00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08
-10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00,
2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:

25

00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10


-26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00,
2009-12-07 00:00:00, 2009-12-14 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:
00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02
-15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00,
2010-03-15 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-26 00:
00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05
-24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-28 00:00:00,
2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:
00:00, 2010-08-09 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09
-06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00,
2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:
00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11
-29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00,
2011-01-17 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:
00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-03
-21 00:00:00, 2011-03-28 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00,
2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:
00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-06 00:00:00, ...]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, ...], [1, 3, 7, 9, 17, 19, 20, 21, 23, 25, 27, 28, 29, 30, 31, 35, 3
8, 40, 41, 44, 45, 46, 47, 48, 49, 52, 54, 56, 57, 59, 60, 62, 66, 68, 69, 70
, 71, 72, 75, 76, 77, 78, 79, 85, 88, 89, 92, 96, 97, 99, 100, 101, 103, 104,
105, 108, 109, 110, 112, 114, 115, 117, 118, 119, 125, 126, 127, 128, 129, 13
1, 133, 134, 135, 136, 137, 140, 146, 150, 151, 152, 153, 157, 0, 3, 7, 22, 2
3, 24, 27, 28, 34, 37, 42, 47, 50, 55, 58, 66, 67, 69, ...]],
names=[u'State', u'StatusDate'])

In [22]:
# Select the State index
Daily.index.levels[0]

Out[22]:
Index([u'FL', u'GA', u'NY', u'TX'], dtype='object')

In [23]:
# Select the StatusDate index
Daily.index.levels[1]

Out[23]:
<class 'pandas.tseries.index.DatetimeIndex'>

26

[2009-01-05, ..., 2012-12-10]


Length: 161, Freq: None, Timezone: None

Lets now plot the data per State.


As you can see by breaking the graph up by the State column we have a much clearer picture on
how the data looks like. Can you spot any outliers?
In [24]:
Daily.loc['FL'].plot()
Daily.loc['GA'].plot()
Daily.loc['NY'].plot()
Daily.loc['TX'].plot();

We can also just plot the data on a specific date, like 2012. We can now clearly see that the data for
these states is all over the place. since the data consist of weekly customer counts, the variability of
the data seems suspect. For this tutorial we will assume bad data and proceed.
In [25]:
Daily.loc['FL']['2012':].plot()
Daily.loc['GA']['2012':].plot()
Daily.loc['NY']['2012':].plot()
Daily.loc['TX']['2012':].plot();

We will assume that per month the customer count should remain relatively steady. Any data outside
a specific range in that month will be removed from the data set. The final result should have smooth
graphs with no spikes.
StateYearMonth - Here we group by State, Year of StatusDate, and Month of StatusDate.
Daily['Outlier'] - A boolean (True or False) value letting us know if the value in the CustomerCount
column is ouside the acceptable range.
We will be using the attribute transform instead of apply. The reason is that transform will keep the
shape(# of rows and columns) of the dataframe the same and apply will not. By looking at the
previous graphs, we can realize they are not resembling a gaussian distribution, this means we
cannot use summary statistics like the mean and stDev. We use percentiles instead. Note that we
run the risk of eliminating good data.
In [26]:

27

# Calculate Outliers
StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.
get_level_values(1).year, Daily.index.get_level_values(1).month])
Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['Custom
erCount'] > Daily['Upper'])

# Remove Outliers
Daily = Daily[Daily['Outlier'] == False]

The dataframe named Daily will hold customer counts that have been aggregated per day. The
original data (df) has multiple records per day. We are left with a data set that is indexed by both the
state and the StatusDate. The Outlier column should be equal to False signifying that the record is
not an outlier.
In [27]:
Daily.head()

Out[27]:
CustomerCount Lower Upper Outlier
State StatusDate
FL 2009-01-12 901
450.5 1351.5 False
2009-02-02 653
326.5 979.5 False
2009-03-23 752
376.0 1128.0 False
2009-04-06 1086
543.0 1629.0 False
2009-06-08 649
324.5 973.5 False
We create a separate dataframe named ALL which groups the Daily dataframe by StatusDate. We
are essentially getting rid of the State column. The Max column represents the maximum customer
count per month. The Max column is used to smooth out the graph.
In [28]:
# Combine all markets

# Get the max customer count by Date


ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_value
s(1)).sum())
ALL.columns = ['CustomerCount'] # rename column

28

# Group by Year and Month


YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])

# What is the max customer count per Year and Month


ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())
ALL.head()

Out[28]:
CustomerCount Max
StatusDate
2009-01-05 877
901
2009-01-12 901
901
2009-01-19 522
901
2009-02-02 953
953
2009-02-23 710
953
As you can see from the ALL dataframe above, in the month of January 2009, the maximum
customer count was 901. If we had used apply, we would have got a dataframe with (Year and
Month) as the index and just the Max column with the value of 901.
There is also an interest to gauge if the current customer counts were reaching certain goals the
company had established. The task here is to visually show if the current customer counts are
meeting the goals listed below. We will call the goals BHAG (Big Hairy Annual Goal).

12/31/2011 - 1,000 customers


12/31/2012 - 2,000 customers
12/31/2013 - 3,000 customers

We will be using the date_range function to create our dates.


Definition: date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False,
name=None, closed=None)
Docstring: Return a fixed frequency datetime index, with day (calendar) as the default frequency
By choosing the frequency to be A or annual we will be able to get the three target dates from
above.
In [29]:
date_range?
Object `date_range` not found.

In [30]:
# Create the BHAG dataframe

29

data = [1000,2000,3000]
idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')
BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])
BHAG

Out[30]:
BHAG
2011-12-31 1000
2012-12-31 2000
2013-12-31 3000
Combining dataframes as we have learned in previous lesson is made simple using the concat
function. Remember when we choose axis = 0 we are appending row wise.
In [31]:
# Combine the BHAG and the ALL data set
combined = pd.concat([ALL,BHAG], axis=0)
combined = combined.sort(axis=0)
combined.tail()

Out[31]:
BHAG CustomerCount Max
2012-11-19 NaN 136
1115
2012-11-26 NaN 1115
1115
2012-12-10 NaN 1269
1269
2012-12-31 2000 NaN
NaN
2013-12-31 3000 NaN
NaN
In [32]:
fig, axes = plt.subplots(figsize=(12, 7))

combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')
combined['Max'].plot(color='blue', label='All Markets')
plt.legend(loc='best');

There was also a need to forecast next year's customer count and we can do this in a couple of
simple steps. We will first group the combined dataframe by Year and place the maximum customer
count for that year. This will give us one row per Year.
In [33]:

30

# Group by Year and then get the max value per year
Year = combined.groupby(lambda x: x.year).max()
Year

Out[33]:
BHAG CustomerCount Max
2009 NaN 2452
2452
2010 NaN 2065
2065
2011 1000 2711
2711
2012 2000 2061
2061
2013 3000 NaN
NaN
In [34]:
# Add a column representing the percent change per year
Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)
Year

Out[34]:
BHAG CustomerCount Max YR_PCT_Change
2009 NaN 2452
2452 NaN
2010 NaN 2065
2065 -0.157830
2011 1000 2711
2711 0.312833
2012 2000 2061
2061 -0.239764
2013 3000 NaN
NaN NaN
To get next year's end customer count we will assume our current growth rate remains constant. We
then will increase this years customer count by that amount and that will be our forecast for next
year.
In [35]:
(1 + Year.ix[2012,'YR_PCT_Change']) * Year.ix[2012,'Max']

Out[35]:
1566.8465510881595

Present Data
Create individual Graphs per State.
In [36]:
# First Graph
ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')

31

# Last four Graphs


fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
fig.subplots_adjust(hspace=1.0) ## Create space between plots

Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,0])
Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,1])
Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,0])
Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,1])

# Add titles
axes[0,0].set_title('Florida')
axes[0,1].set_title('Georgia')
axes[1,0].set_title('Texas')
axes[1,1].set_title('North East');

32

Lesson 4
In this lesson were going to go back to the basics. We will be working with a small data set so that
you can easily understand what I am trying to explain. We will be adding columns, deleting columns,
and slicing the data many different ways. Enjoy!
In [1]:
# Import libraries
import pandas as pd
import sys

In [2]:
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul
2) [MSC v.1500 64 bit (AMD64)]
Pandas version: 0.15.2

1 2013, 12:37:5

In [3]:
# Our small data set
d = [0,1,2,3,4,5,6,7,8,9]

# Create dataframe
df = pd.DataFrame(d)
df

Out[3]:
0
00
11
22
33
44
55
66
77
88
99
In [4]:
# Lets change the name of the column

33

df.columns = ['Rev']
df

Out[4]:
Rev
00
11
22
33
44
55
66
77
88
99
In [5]:
# Lets add a column
df['NewCol'] = 5
df

Out[5]:
Rev NewCol
00 5
11 5
22 5
33 5
44 5
55 5
66 5
77 5
88 5
99 5
In [6]:
# Lets modify our new column
df['NewCol'] = df['NewCol'] + 1
df

Out[6]:
Rev NewCol
00 6
11 6
22 6
33 6

34

Rev NewCol
44 6
55 6
66 6
77 6
88 6
99 6
In [7]:
# We can delete columns
del df['NewCol']
df

Out[7]:
Rev
00
11
22
33
44
55
66
77
88
99
In [8]:
# Lets add a couple of columns
df['test'] = 3
df['col'] = df['Rev']
df

Out[8]:
Rev test col
00 3 0
11 3 1
22 3 2
33 3 3
44 3 4
55 3 5
66 3 6
77 3 7
88 3 8
99 3 9
In [9]:

35

# If we wanted, we could change the name of the index


i = ['a','b','c','d','e','f','g','h','i','j']
df.index = i
df

Out[9]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
e4 3 4
f5 3 5
g6 3 6
h7 3 7
i 8 3 8
j 9 3 9
We can now start to select pieces of the dataframe using loc.
In [10]:
df.loc['a']

Out[10]:
Rev
0
test
3
col
0
Name: a, dtype: int64

In [11]:
# df.loc[inclusive:inclusive]
df.loc['a':'d']

Out[11]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
In [12]:
# df.iloc[inclusive:exclusive]
# Note: .iloc is strictly integer position based. It is available from [versi
on 0.11.0] (http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-11-0
-april-22-2013)

36

df.iloc[0:3]

Out[12]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
We can also select using the column name.
In [13]:
df['Rev']

Out[13]:
a
0
b
1
c
2
d
3
e
4
f
5
g
6
h
7
i
8
j
9
Name: Rev, dtype: int64

In [14]:
df[['Rev', 'test']]

Out[14]:
Rev test
a0 3
b1 3
c2 3
d3 3
e4 3
f5 3
g6 3
h7 3
i 8 3
j 9 3
In [15]:
# df['ColumnName'][inclusive:exclusive]
df['Rev'][0:3]

Out[15]:

37

a
0
b
1
c
2
Name: Rev, dtype: int64

In [16]:
df['col'][5:]

Out[16]:
f
5
g
6
h
7
i
8
j
9
Name: col, dtype: int64

In [17]:
df[['col', 'test']][:3]

Out[17]:
col test
a0 3
b1 3
c2 3
There is also some handy function to select the top and bottom records of a dataframe.
In [18]:
# Select top N number of records (default = 5)
df.head()

Out[18]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
e4 3 4
In [19]:
# Select bottom N number of records (default = 5)
df.tail()

Out[19]:
Rev test col
f5 3 5
g6 3 6

38

Rev test col


h7 3 7
i 8 3 8
j 9 3 9

39

Potrebbero piacerti anche