2 Pandas

PANDAS
Lesson 1
Create Data - We begin by creating our own data set for analysis. This prevents the end user
reading this tutorial from having to download any files to replicate the results below. We will export
this data set to a text file so that you can get some experience pulling data from a text file.
Get Data - We will learn how to read in the text file. The data consist of baby names and the number
of baby names born in the year 1880.
Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean I
mean we will take a look inside the contents of the text file and look for any anomalities. These can
include missing data, inconsistencies in the data, or any other data that seems out of place. If any
are found we will then have to make decisions on what to do with these records.
Analyze Data - We will simply find the most popular name in a specific year.
Present Data - Through tabular data and a graph, clearly show the end user what is the most
popular name in a specific year.
The pandas library is used for all the data analysis excluding a small piece of the data presentation
section. The matplotlib library will only be needed for the data presentation section. Importing the
libraries is the first step we will take in the lesson.
# Import all libraries needed for the tutorial
# General syntax to import specific functions in a library:

##from (library) import (specific library function)
from pandas import DataFrame, read_csv
# General syntax to import a library but no functions:

##import (library) as (give the library a nickname/alias)
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
# Enable inline plotting

%matplotlib inline
print 'Python version ' + sys.version
print 'Pandas version ' + pd.__version__

Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul
2) [MSC v.1500 64 bit (AMD64)]
Pandas version 0.15.2
1 2013, 12:37:5
Create Data
The data set will consist of 5 baby names and the number of births recorded for that year (1880).
# The inital set of baby names and bith rates

names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]
To merge these two lists together we will use the zip function.
zip?
BabyDataSet = zip(names,births)
BabyDataSet
[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]
We are basically done creating the data set. We now will use the pandas library to export this data
set into a csv file.
df will be a DataFrame object. You can think of this object holding the contents of the BabyDataSet
in a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contents
inside df.
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df
Names Births
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578
Names Births
4 Mel
973
Export the dataframe to a csv file. We can name the file births1880.csv. The function to_csv will be
used to export the file. The file will be saved in the same location of the notebook unless specified
otherwise.
In [7]:
df.to_csv?
The only parameters we will use is index and header. Setting these parameters to True will prevent
the index and header names from being exported. Change the values of these parameters to get a
better understanding of their use.
In [8]:
df.to_csv('births1880.csv',index=False,header=False)
Get Data
To pull in the csv file, we will use the pandas function read_csv. Let us take a look at this function
and what inputs it takes.
In [9]:
read_csv?
Even though this functions has many parameters, we will simply pass it the location of the text file.
Location = C:\Users\ENTER_USER_NAME.xy\startups\births1880.csv
Note: Depending on where you save your notebooks, you may need to modify the location above.
In [10]:
Location = r'C:\Users\david\notebooks\pandas\births1880.csv'
df = pd.read_csv(Location)
Notice the r before the string. Since the slashes are special characters, prefixing the string with a r
will escape the whole string.
In [11]:
df
Out[11]:
Bob 968
0 Jessica 155
1 Mary 77
2 John 578
Bob 968
3 Mel
973
This brings us the our first problem of the exercise. The read_csv function treated the first record in
the csv file as the header names. This is obviously not correct since the text file did not provide us
with header names.
To correct this we will pass the header parameter to the read_csv function and set it to None
(means null in python).
In [12]:
df = pd.read_csv(Location, header=None)
df
Out[12]:
0
1
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel
973
If we wanted to give the columns specific names, we would have to pass another paramter called
names. We can also omit the header parameter.
In [13]:
df = pd.read_csv(Location, names=['Names','Births'])
df
Out[13]:
Names Births
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel
973
You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these are
part of the index of the dataframe. You can think of the index as the primary key of a sql table with
the exception that an index is allowed to have duplicates.
[Names, Births] can be though of as column headers similar to the ones found in an Excel
spreadsheet or sql database.
Delete the csv file now that we are done using it.
In [14]:
import os
os.remove(Location)
Prepare Data
The data we have consists of baby names and the number of births in the year 1880. We already
know that we have 5 records and none of the records are missing (non-null values).
The Names column at this point is of no concern since it most likely is just composed of alpha
numeric strings (baby names). There is a chance of bad data in this column but we will not worry
about that at this point of the analysis. The Births column should just contain integers representing
the number of babies born in a specific year with a specific name. We can check if the all the data is
of the data type integer. It would not make sense to have this column have a data type of float. I
would not worry about any possible outliers at this point of the analysis.
Realize that aside from the check we did on the "Names" column, briefly looking at the data inside
the dataframe should be as far as we need to go at this stage of the game. As we continue in the
data analysis life cycle we will have plenty of opportunities to find any issues with the data set.
In [15]:
# Check data type of the columns
df.dtypes
Out[15]:
Names
object
Births
int64
dtype: object
In [16]:
# Check data type of Births column
df.Births.dtype
Out[16]:
dtype('int64')
As you can see the Births column is of type int64, thus no floats (decimal numbers) or alpha numeric
characters will be present in this column.
Analyze Data
To find the most popular name or the baby name with the higest birth rate, we can do one of the
following.
Sort the dataframe and select the top row

Use the max() attribute to find the maximum value
In [17]:
# Method 1:
Sorted = df.sort(['Births'], ascending=False)
Sorted.head(1)
Out[17]:
Names Births
4 Mel
973
In [18]:
# Method 2:
df['Births'].max()
Out[18]:
973
Present Data
Here we can plot the Births column and label the graph to show the end user the highest point on
the graph. In conjunction with the table, the end user has a clear picture that Mel is the most popular
baby name in the data set.
plot() is a convinient attribute where pandas lets you painlessly plot the data in your dataframe. We
learned how to find the maximum value of the Births column in the previous section. Now to find the
actual baby name of the 973 value looks a bit tricky, so lets go over it.
Explain the pieces:
df['Names'] - This is the entire list of baby names, the entire Names column
df['Births'] - This is the entire list of Births in the year 1880, the entire Births column
df['Births'].max() - This is the maximum value found in the Births column
[df['Births'] == df['Births'].max()] IS EQUAL TO [Find all of the records in the Births column where it is
equal to 973]
df['Names'][df['Births'] == df['Births'].max()] IS EQUAL TO Select all of the records in the Names
column WHERE [The Births column is equal to 973]
An alternative way could have been to use the Sorted dataframe:
Sorted['Names'].head(1).value
The str() function simply converts an object into a string.
In [19]:
# Create graph
df['Births'].plot()
# Maximum value in the data set

MaxValue = df['Births'].max()
# Name associated with the maximum value

MaxName = df['Names'][df['Births'] == df['Births'].max()].values
# Text to display on graph

Text = str(MaxValue) + " - " + MaxName
# Add text to graph

plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),
xycoords=('axes fraction', 'data'), textcoords='offset point
s')
print "The most popular name"

df[df['Births'] == df['Births'].max()]
#Sorted.head(1) can also be used
The most popular name
Out[19]:
Names Births
4 Mel
973
Lesson 2
In [1]:
# The usual preamble
import pandas as pd
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
figsize(15, 5)
We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a
subset of the of 311 service requests from NYC Open Data.
In [2]:
complaints = pd.read_csv('../data/311-service-requests.csv')
2.1 What's even in it? (the summary)

When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll show
you a summary. This includes all the columns, and how many non-null values there are in each
column.
In [3]:
complaints
Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 52 columns):
Unique Key
111069
Created Date
111069
Closed Date
60270
Agency
111069
Agency Name
111069
Complaint Type
111069
Descriptor
111068
Location Type
79048
Incident Zip
98813
Incident Address
84441
Street Name
84438
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
Cross Street 1
84728 non-null values
Cross Street 2
Intersection Street 1
Intersection Street 2
Address Type
City
Landmark
95 non-null values
Facility Type
Status
Due Date
Resolution Action Updated Date
Community Board
Borough
X Coordinate (State Plane)
Y Coordinate (State Plane)
Park Facility Name
Park Borough
School Name
School Number
School Region
School Code
School Phone Number
School Address
School City
School State
School Zip
School Not Found
School or Citywide Complaint
0 non-null values
Vehicle Type
99 non-null values
Taxi Company Borough
117 non-null values
Taxi Pick Up Location
Bridge Highway Name
185 non-null values
Bridge Highway Direction
185 non-null values
Road Ramp
184 non-null values
Bridge Highway Segment
223 non-null values
Garage Lot Name
49 non-null values
Ferry Direction
37 non-null values
Ferry Terminal Name
336 non-null values
Latitude
Longitude
Location
dtypes: float64(5), int64(1), object(46)
2.2 Selecting columns and rows

To select a column, we index with the name of the column, like this:
In [4]:
complaints['Complaint Type']
Out[4]:
0
1
Noise - Street/Sidewalk
Illegal Parking
2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
5
Noise - Commercial
6
Blocked Driveway
7
Noise - Commercial
8
Noise - Commercial
9
Noise - Commercial
10
Noise - House of Worship
11
Noise - Commercial
12
Illegal Parking
13
Noise - Vehicle
14
Rodent
...
111054
111055
Noise - Commercial
111056
Street Sign - Missing
111057
Noise
111058
Noise - Commercial
111059
111060
Noise
111061
Noise - Commercial
111062
Water System
111063
Water System
111064
Maintenance or Facility
111065
Illegal Parking
111066
111067
Noise - Commercial
111068
Blocked Driveway
Name: Complaint Type, Length: 111069, dtype: object
To get the first 5 rows of a dataframe, we can use a slice: df[:5].
This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to
look at the contents and get a feel for this dataset.
In [5]:
complaints[:5]
Out[5]:
10
R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
N
e
1
11
w
0
00
Y
(40
/
/ /
o
.70
3
33
r N
9
82
1
11
k oi
075
/
/ /
C se L
01
U UUUUUUUUU
4 32
22
99
PA2 2 1
7
it - o
36
AJ
1 n nnnnnnnnn
0 59
60
Str
01
r s0 0 2 Q 1 Q
3
y St u
11 9
DA
0 s sssssssss
. 32
51 N
ee
AA
es 1 1 Q U 9 U
.
N P re d
16 S
N N D MN
4 p p p p p p p p p p N NN NNNNNNNN7 02,
83 Y
t/S
VV
ci 3 3 U E 7 E
7
0 a ol et T
49 T
a a R Aa
2 e e e e e e e e e e Na a a a a a aa a a a 0 90 P
ide
EE
i g1 0 E E 3 E
9
N ic /S al
3S R
NNE I N
0 ci ci ci ci ci ci ci ci ci ci N NN NNNNNNNN8 73.
62 D
wa
NN
nn 0 2 E N 8 N
1
e id ki
2T E
SC
2 fi fi fi fi fi fi fi fi fi fi
2 79
5:
lk
UU
ce: : N S 9 S
6
De n
RE
SA
7 e eeeeeeeee
7 16
10
EE
t d0 3 S
0
ew g
ET
d ddddddddd
5 03
8
85
4
p al
E
95
:
: :
ak
T
77
4
41
rt
97
1
17
m
21)
A
AA
e
M
MM
n
t
21
N Ill C
15 5 5 5
B M P 1 0 1 U UUUUUUUUU
4 - (40
N
Str
O
Q 2 Q
60 N e e o
1 8 8 8 9 N N L A N r 0 N 5 0 n n n n n n n n n n N NN NNNNNNNN0 7 .72
Y
ee
p
U 0 U
15 / a w g m 3 A A P S a a O S a e / a Q 0 s s s s s s s s s s Na a a a a a aa a a a . 3 10
P
t/S
e
E 1 E
9 3 N Y al m 7 V V L T N N C P N c 3 N U 9 p p p p p p p p p p N NN NNNNNNNN7 . 40
D
ide
n
E 9 E
31
oP e
8E E A R
KE i 1 E 3 e eeeeeeeee
2 9 53
11
R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
6/
r ar r wa N N C E
F T n / E N 4 8 ci N ci ci ci ci ci ci ci ci ci
1 0 56
92
k ki ci lk U U E E
A H c 2 N S 9 4 fi S fi fi fi fi fi fi fi fi fi
0 9 28
80
C n al
EE T
C
t 0 S
e eeeeeeeee
4 4 30
1
it g O
E
1
d ddddddddd
1 5 5, 3
y v
3
3 73.
0
P e
1
90
2
ol r
0
94
:
ic ni
:
53
0
e g
0
06
1
D h
1
79
:
e t
:
17
0
p P
0
65)
4
a a
4
A
rt r
A
M m ki
M
e n
n g
t
11 N L
4 WW
11
- (40
N Cl
1 M
UM UUUUUUUUU
4
20 0 e o
0 BEE
P 00
7 .84
oi ub
AN
2 A1 nAnnnnnnnnn
0
6/ / w u
6 RSS
r C/ /
2
3 33
se /B 1
DE
MN 0 s N s s s s s s s s s
.
5 3 3 NY d
0 OT T
el 3 3
4
. 29
ar/ 0
N N D WN
A H 0 p H p p p p p p p p p N NN NNNNNNNN8
9 1 1 Yo M
BA1 1
co1 1
6
9 75
2
C Re 0
a a R Ya
N A 1 e A e e e e e e e e e Na a a a a a aa a a a 4
4 / / Pr
u
RD7 7
i s/ /
5
3 46
o sta 3
N N E ON
H T 0 ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN3
1 2 2 Dk si
O W1 2
ne 2 2
3
9 65
m ur 2
SR
A T 8 fi T fi fi fi fi fi fi fi fi fi
3
3 0 0 C c/
AASS
cd0 0
1
1 13,
m an
SK
T A8 eAeeeeeeeee
3
9 1 1 it P
DYT T
t 11
4er t
T N
dNddddddddd
0
33 y a
W RR
33
4 73.
12
R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
0 0 P ci rt
A EE
10A
93
2 2 ol al y
Y EE
02N
91
: : ic
TT
: :
43
04 e
03
71
00 D
09
91
: : e
: :
34
23 p
24
82)
42 a
42
A A rt
AA
MM m
MM
e
n
t
11 N
A
11
C
(40
00 e
M
00
C
WWO
.77
/ / w
S
/ / 0
a
EEL
- 80
3 3 YN
T
B
337 M
UM UUUUUUUUU
4
2
r/
SSU
P
7 08
1 1 o oi
E
L N
1 1 MA
nAnnnnnnnnn
0
6
T Str T T M
rC
92
3 74
/ / r se
1
R
OE
/ / AN
sNsssssssss
.
5
N
r ee 7 7 B
el
82
. 46
22 k0
D N N C WN 2 2 N H
p H p p p p p p p p p N NN NNNNNNNN7
9
Y
u t/S 2 2 U
co
92
9 37
3 0 0 CV
0
A a a K Ya
00HA
e A e e e e e e e e e Na a a a a a aa a a a 7
5
P
c ide S S S
is
77
8 2, 1 1 it e
2
M N N F ON 1 1 A T
ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN8
7
D
k wa T T A
ne
32
0 73.
3 3 y hi
3
A
AR
33T T
fi T fi fi fi fi fi fi fi fi fi
0
2
H lk R R V
cd
07
2 98
0 0 P cl
V
CK
00T A
eAeeeeeeeee
0
1
o
EEE
t
1 02
1 2 ol e
E
E
92A N
dNddddddddd
9
r
EEN
3 13
: : ic
N
: : N
n
TTU
49
52 e
U
52
E
02
61 D
E
61
13
R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
: : e
: :
39
24 p
21
75)
38 a
30
A A rt
AA
MM m
MM
e
n
t
1
D C
A
11
0
e o
D
10
(40
/
p n
A
/ /
.80
3
a di
M
33
WW
76
1
rt ti
C
01
EEL
1
90
/
m o
L
/ /
SSE
B
0 M
UM UUUUUUUUU
4 92
22
e n
A
22
7
TTN
L N P
MA
nAnnnnnnnnn
0 70
60
n A
Y
00
92
3
D R Va 1 1 1 O
OE
e
AN
sNsssssssss
. 49
51
t
tt
T
11
93
.
N O o ca 0 2 2 X N N C WN Nn
NH
p H p p p p p p p p p N NN NNNNNNNN8 51,
93
o r
O
33
83
9
4 a H d nt 0 4 4 A a a K Y a / d
HA
e A e e e e e e e e e Na a a a a a aa a a a 0 00
f
a
N
00
85
4
N M e Lo 2 S S V N N F ON Ai
AT
ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN7 73.
91
H ct
P
11
14
7
H nt t 7 T T E
AR n
T T
fi T fi fi fi fi fi fi fi fi fi
6 94
3:
e in
O
: :
55
3
RRN
CK
g
T A
eAeeeeeeeee
9 73
05
al g
W
55
8
EEU
E
AN
dNddddddddd
1 87
3
t
R
E
39
7
EEE
N
03
:
h o
L
: :
TT
49
4
a d
L
45
14
4
n e
J
44
33)
A
d n
R
AA
M M ts
B
MM
14
R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
e
O
n
U
t
L
al
E
H
V
y
A
gi
R
e
D
n
e
We can combine these to get the first 5 rows of a column:
In [6]:
complaints['Complaint Type'][:5]
Out[6]:
0
1
Illegal Parking
2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
Name: Complaint Type, dtype: object
and it doesn't matter which direction we do it in:

In [7]:
15
complaints[:5]['Complaint Type']
Out[7]:
0
1
Illegal Parking
2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
Name: Complaint Type, dtype: object
2.3 Selecting multiple columns

What if we just want to know the complaint type and the borough, but not the rest of the information?
Pandas makes it really easy to select a subset of the columns: just index with list of columns you
want.
In [8]:
complaints[['Complaint Type', 'Borough']]
Out[8]:
Complaint Type
Borough
dtypes: object(2)
That showed us a summary, and then we can look at the first 10 rows:
In [9]:
complaints[['Complaint Type', 'Borough']][:10]
Out[9]:
Complaint Type
Borough
0 Noise - Street/Sidewalk QUEENS
1 Illegal Parking
QUEENS
2 Noise - Commercial
MANHATTAN
3 Noise - Vehicle
MANHATTAN
4 Rodent
MANHATTAN
QUEENS
6 Blocked Driveway
QUEENS
QUEENS
MANHATTAN
BROOKLYN
16
2.4 What's the most common complaint

type?
This is a really easy question to answer! There's a .value_counts() method that we can use:
In [10]:
complaints['Complaint Type'].value_counts()
Out[10]:
HEATING
14200
GENERAL CONSTRUCTION
7471
Street Light Condition
7117
DOF Literature Request
5797
PLUMBING
5373
PAINT - PLASTER
5149
Blocked Driveway
4590
NONCONST
3998
Street Condition
3473
Illegal Parking
3343
Noise
3321
Traffic Signal Condition
3145
Dirty Conditions
2653
Water System
2636
Noise - Commercial
2578
...
Opinion for the Mayor
Window Guard
DFTA Literature Request
Legal Services Provider Complaint
Open Flame Permit
Snow
Municipal Parking Facility
X-Ray Machine/Equipment
Stalled Sites
DHS Income Savings Requirement
Tunnel Condition
Highway Sign - Damaged
Ferry Permit
Trans Fat
DWD
Length: 165, dtype: int64
2
2
2
2
1
1
1
1
1
1
1
1
1
1
1
If we just wanted the top 10 most common complaints, we can do this:
17
In [11]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
Out[11]:
HEATING
GENERAL CONSTRUCTION
Street Light Condition
DOF Literature Request
PLUMBING
PAINT - PLASTER
Blocked Driveway
NONCONST
Street Condition
Illegal Parking
dtype: int64
14200
7471
7117
5797
5373
5149
4590
3998
3473
3343
But it gets better! We can plot them!

In [12]:
complaint_counts[:10].plot(kind='bar')
Out[12]:
<matplotlib.axes.AxesSubplot at 0x7ba2290>
.warning{
color: rgb( 240, 20, 20 )
}
18
Lesson 3
Get Data - Our data set will consist of an Excel file containing customer counts per date. We will
learn how to read in the excel file for processing.
Prepare Data - The data is an irregular time series having duplicate dates. We will be challenged in
compressing the data and coming up with next years forecasted customer count.
Analyze Data - We use graphs to visualize trends and spot outliers. Some built in computational
tools will be used to calculate next years forecasted customer count.
Present Data - The results will be plotted.
NOTE: Make sure you have looked through all previous lessons, as the knowledge learned in
previous lessons will be needed for this exercise.
In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys
%matplotlib inline
In [2]:
print 'Pandas version: ' + pd.__version__
2) [MSC v.1500 64 bit (AMD64)]
Pandas version: 0.15.2
1 2013, 12:37:5
We will be creating our own test data for analysis.

In [3]:
# set seed
np.seed(111)
# Function to generate test data

def CreateDataSet(Number=1):
Output = []
19
for i in range(Number):
# Create a weekly (mondays) date range

rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')
# Create random data

data = np.randint(low=25,high=1000,size=len(rng))
# Status pool
status = [1,2,3]
# Make a random list of statuses

random_status = [status[np.randint(low=0,high=len(status))] for i in
range(len(rng))]
# State pool
states = ['GA','FL','fl','NY','NJ','TX']
# Make a random list of states

random_states = [states[np.randint(low=0,high=len(states))] for i in
range(len(rng))]
Output.extend(zip(random_states, random_status, data, rng))
return Output
Now that we have a function to generate our test data, lets create some data and stick it into a
dataframe.
In [4]:
dataset = CreateDataSet(4)
20
df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','St
atusDate'])
df.info()
State
836 non-null object
Status
836 non-null int64
CustomerCount
836 non-null int64
StatusDate
836 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 32.7+ KB
In [5]:
df.head()
Out[5]:
State Status CustomerCount StatusDate
0 GA 1
877
2009-01-05
1 FL 1
901
2009-01-12
2 fl
3
749
2009-01-19
3 FL 3
111
2009-01-26
4 GA 1
300
2009-02-02
We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. We
simply do this to show you how to read and write to Excel files.
We do not write the index values of the dataframe to the Excel file, since they are not meant to be
part of our initial test data set.
In [6]:
# Save results to excel
df.to_excel('Lesson3.xlsx', index=False)
print 'Done'
Done
Grab Data from Excel

We will be using the read_excel function to read in data from an Excel file. The function allows you
to read in specfic tabs by name or location.
In [7]:
pd.read_excel?
21
Note: The location on the Excel file will be in the same folder as the notebook, unless
specified otherwise.
In [8]:
# Location of file
Location = r'C:\Users\david\notebooks\pandas\Lesson3.xlsx'
# Parse a specific sheet

df = pd.read_excel(Location, 0, index_col='StatusDate')
df.dtypes
Out[8]:
State
Status
CustomerCount
dtype: object
object
int64
int64
In [9]:
df.index
Out[9]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2009-01-05, ..., 2012-12-31]
Length: 836, Freq: None, Timezone: None
In [10]:
df.head()
Out[10]:
State Status CustomerCount
StatusDate
2009-01-05 GA
2009-01-12 FL
2009-01-19 fl
2009-01-26 FL
2009-02-02 GA
1
1
3
3
1
877
901
749
111
300
Prepare Data
This section attempts to clean up the data for analysis.
1. Make sure the state column is all in upper case
2. Only select records where the account status is equal to "1"
22
3. Merge (NJ and NY) to NY in the state column

4. Remove any outliers (any odd results in the data set)
Lets take a quick look on how some of the State values are upper case and some are lower case
In [11]:
df['State'].unique()
Out[11]:
array([u'GA', u'FL', u'fl', u'TX', u'NY', u'NJ'], dtype=object)
To convert all the State values to upper case we will use the upper() function and the dataframe's
apply attribute. The lambda function simply will apply the upper function to each value in the State
column.
In [12]:
# Clean State Column, convert to upper case
df['State'] = df.State.apply(lambda x: x.upper())
In [13]:
Out[13]:
array([u'GA', u'FL', u'TX', u'NY', u'NJ'], dtype=object)
In [14]:
# Only grab where Status == 1
mask = df['Status'] == 1
df = df[mask]
To turn the NJ states to NY we simply...

[df.State == 'NJ'] - Find all records in the State column where they are equal to NJ.
df.State[df.State == 'NJ'] = 'NY' - For all records in the State column where they are equal to NJ,
replace them with NY.
In [15]:
# Convert NJ to NY
mask = df.State == 'NJ'
df['State'][mask] = 'NY'
Now we can see we have a much cleaner data set to work with.
In [16]:
23
Out[16]:
array([u'GA', u'FL', u'NY', u'TX'], dtype=object)
At this point we may want to graph the data to check for any outliers or inconsistencies in the data.
We will be using the plot() attribute of the dataframe.
As you can see from the graph below it is not very conclusive and is probably a sign that we need to
perform some more data preparation.
In [17]:
df['CustomerCount'].plot(figsize=(15,5));
If we take a look at the data, we begin to realize that there are multiple values for the same State,
StatusDate, and Status combination. It is possible that this means the data you are working with is
dirty/bad/inaccurate, but we will assume otherwise. We can assume this data set is a subset of a
bigger data set and if we simply add the values in the CustomerCount column per State,
StatusDate, and Status we will get the Total Customer Count per day.
In [18]:
sortdf = df[df['State']=='NY'].sort(axis=0)
sortdf.head(10)
Out[18]:
State Status CustomerCount
StatusDate
2009-01-19 NY 1
522
2009-02-23 NY 1
710
2009-03-09 NY 1
992
2009-03-16 NY 1
355
2009-03-23 NY 1
728
2009-03-30 NY 1
863
2009-04-13 NY 1
520
2009-04-20 NY 1
820
2009-04-20 NY 1
937
2009-04-27 NY 1
447
Our task is now to create a new dataframe that compresses the data so we have daily customer
counts per State and StatusDate. We can ignore the Status column since all the values in this
column are of value 1. To accomplish this we will use the dataframe's functions groupby and sum().
Note that we had to use reset_index . If we did not, we would not have been able to group by both
the State and the StatusDate since the groupby function expects only columns as inputs. The
reset_index function will bring the index StatusDate back to a column in the dataframe.
In [19]:
# Group by State and StatusDate
24
Daily = df.reset_index().groupby(['State','StatusDate']).sum()
Daily.head()
Out[19]:
Status CustomerCount
State StatusDate
FL 2009-01-12 1
901
2009-02-02 1
653
2009-03-23 1
752
2009-04-06 2
1086
2009-06-08 1
649
The State and StatusDate columns are automatically placed in the index of the Daily dataframe.
You can think of the index as the primary key of a database table but without the constraint of
having unique values. Columns in the index as you will see allow us to easily select, plot, and
perform calculations on the data.
Below we delete the Status column since it is all equal to one and no longer necessary.
In [20]:
del Daily['Status']
Daily.head()
Out[20]:
CustomerCount
State StatusDate
FL 2009-01-12 901
2009-02-02 653
2009-03-23 752
2009-04-06 1086
2009-06-08 649
In [21]:
# What is the index of the dataframe
Daily.index
Out[21]:
MultiIndex(levels=[[u'FL', u'GA', u'NY', u'TX'], [2009-01-05 00:00:00, 2009-0
1-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00,
2009-03-09 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:
00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04
-27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00,
2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:
00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08
-10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00,
2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:
25
00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10

-26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00,
2009-12-07 00:00:00, 2009-12-14 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:
00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02
-15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00,
2010-03-15 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-26 00:
00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05
-24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-28 00:00:00,
2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:
00:00, 2010-08-09 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09
-06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00,
2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:
00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11
-29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00,
2011-01-17 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:
00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-03
-21 00:00:00, 2011-03-28 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00,
2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:
00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-06 00:00:00, ...]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, ...], [1, 3, 7, 9, 17, 19, 20, 21, 23, 25, 27, 28, 29, 30, 31, 35, 3
8, 40, 41, 44, 45, 46, 47, 48, 49, 52, 54, 56, 57, 59, 60, 62, 66, 68, 69, 70
, 71, 72, 75, 76, 77, 78, 79, 85, 88, 89, 92, 96, 97, 99, 100, 101, 103, 104,
105, 108, 109, 110, 112, 114, 115, 117, 118, 119, 125, 126, 127, 128, 129, 13
1, 133, 134, 135, 136, 137, 140, 146, 150, 151, 152, 153, 157, 0, 3, 7, 22, 2
3, 24, 27, 28, 34, 37, 42, 47, 50, 55, 58, 66, 67, 69, ...]],
names=[u'State', u'StatusDate'])
In [22]:
# Select the State index
Daily.index.levels[0]
Out[22]:
Index([u'FL', u'GA', u'NY', u'TX'], dtype='object')
In [23]:
# Select the StatusDate index
Daily.index.levels[1]
Out[23]:
<class 'pandas.tseries.index.DatetimeIndex'>
26
[2009-01-05, ..., 2012-12-10]

Length: 161, Freq: None, Timezone: None
Lets now plot the data per State.

As you can see by breaking the graph up by the State column we have a much clearer picture on
how the data looks like. Can you spot any outliers?
In [24]:
Daily.loc['FL'].plot()
Daily.loc['GA'].plot()
Daily.loc['NY'].plot()
Daily.loc['TX'].plot();
We can also just plot the data on a specific date, like 2012. We can now clearly see that the data for
these states is all over the place. since the data consist of weekly customer counts, the variability of
the data seems suspect. For this tutorial we will assume bad data and proceed.
In [25]:
Daily.loc['FL']['2012':].plot()
Daily.loc['GA']['2012':].plot()
Daily.loc['NY']['2012':].plot()
Daily.loc['TX']['2012':].plot();
We will assume that per month the customer count should remain relatively steady. Any data outside
a specific range in that month will be removed from the data set. The final result should have smooth
graphs with no spikes.
StateYearMonth - Here we group by State, Year of StatusDate, and Month of StatusDate.
Daily['Outlier'] - A boolean (True or False) value letting us know if the value in the CustomerCount
column is ouside the acceptable range.
We will be using the attribute transform instead of apply. The reason is that transform will keep the
shape(# of rows and columns) of the dataframe the same and apply will not. By looking at the
previous graphs, we can realize they are not resembling a gaussian distribution, this means we
cannot use summary statistics like the mean and stDev. We use percentiles instead. Note that we
run the risk of eliminating good data.
In [26]:
27
# Calculate Outliers
StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.
get_level_values(1).year, Daily.index.get_level_values(1).month])
Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['Custom
erCount'] > Daily['Upper'])
# Remove Outliers
Daily = Daily[Daily['Outlier'] == False]
The dataframe named Daily will hold customer counts that have been aggregated per day. The
original data (df) has multiple records per day. We are left with a data set that is indexed by both the
state and the StatusDate. The Outlier column should be equal to False signifying that the record is
not an outlier.
In [27]:
Daily.head()
Out[27]:
CustomerCount Lower Upper Outlier
State StatusDate
FL 2009-01-12 901
450.5 1351.5 False
2009-02-02 653
326.5 979.5 False
2009-03-23 752
376.0 1128.0 False
2009-04-06 1086
543.0 1629.0 False
2009-06-08 649
324.5 973.5 False
We create a separate dataframe named ALL which groups the Daily dataframe by StatusDate. We
are essentially getting rid of the State column. The Max column represents the maximum customer
count per month. The Max column is used to smooth out the graph.
In [28]:
# Combine all markets
# Get the max customer count by Date

ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_value
s(1)).sum())
ALL.columns = ['CustomerCount'] # rename column
28
# Group by Year and Month

YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])
# What is the max customer count per Year and Month

ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())
ALL.head()
Out[28]:
CustomerCount Max
StatusDate
2009-01-05 877
901
2009-01-12 901
901
2009-01-19 522
901
2009-02-02 953
953
2009-02-23 710
953
As you can see from the ALL dataframe above, in the month of January 2009, the maximum
customer count was 901. If we had used apply, we would have got a dataframe with (Year and
Month) as the index and just the Max column with the value of 901.
There is also an interest to gauge if the current customer counts were reaching certain goals the
company had established. The task here is to visually show if the current customer counts are
meeting the goals listed below. We will call the goals BHAG (Big Hairy Annual Goal).
12/31/2011 - 1,000 customers

12/31/2012 - 2,000 customers
12/31/2013 - 3,000 customers
We will be using the date_range function to create our dates.

Definition: date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False,
name=None, closed=None)
Docstring: Return a fixed frequency datetime index, with day (calendar) as the default frequency
By choosing the frequency to be A or annual we will be able to get the three target dates from
above.
In [29]:
date_range?
Object `date_range` not found.
In [30]:
# Create the BHAG dataframe
29
data = [1000,2000,3000]
idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')
BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])
BHAG
Out[30]:
BHAG
2011-12-31 1000
2012-12-31 2000
2013-12-31 3000
Combining dataframes as we have learned in previous lesson is made simple using the concat
function. Remember when we choose axis = 0 we are appending row wise.
In [31]:
# Combine the BHAG and the ALL data set
combined = pd.concat([ALL,BHAG], axis=0)
combined = combined.sort(axis=0)
combined.tail()
Out[31]:
BHAG CustomerCount Max
2012-11-19 NaN 136
1115
2012-11-26 NaN 1115
1115
2012-12-10 NaN 1269
1269
2012-12-31 2000 NaN
NaN
2013-12-31 3000 NaN
NaN
In [32]:
fig, axes = plt.subplots(figsize=(12, 7))
combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')
combined['Max'].plot(color='blue', label='All Markets')
plt.legend(loc='best');
There was also a need to forecast next year's customer count and we can do this in a couple of
simple steps. We will first group the combined dataframe by Year and place the maximum customer
count for that year. This will give us one row per Year.
In [33]:
30
# Group by Year and then get the max value per year
Year = combined.groupby(lambda x: x.year).max()
Year
Out[33]:
BHAG CustomerCount Max
2009 NaN 2452
2452
2010 NaN 2065
2065
2011 1000 2711
2711
2012 2000 2061
2061
2013 3000 NaN
NaN
In [34]:
# Add a column representing the percent change per year
Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)
Year
Out[34]:
BHAG CustomerCount Max YR_PCT_Change
2009 NaN 2452
2452 NaN
2010 NaN 2065
2065 -0.157830
2011 1000 2711
2711 0.312833
2012 2000 2061
2061 -0.239764
2013 3000 NaN
NaN NaN
To get next year's end customer count we will assume our current growth rate remains constant. We
then will increase this years customer count by that amount and that will be our forecast for next
year.
In [35]:
(1 + Year.ix[2012,'YR_PCT_Change']) * Year.ix[2012,'Max']
Out[35]:
1566.8465510881595
Present Data
Create individual Graphs per State.
In [36]:
# First Graph
ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')
31
# Last four Graphs

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
fig.subplots_adjust(hspace=1.0) ## Create space between plots
Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,0])
Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,1])
Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,0])
Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,1])
# Add titles
axes[0,0].set_title('Florida')
axes[0,1].set_title('Georgia')
axes[1,0].set_title('Texas')
axes[1,1].set_title('North East');
32
Lesson 4
In this lesson were going to go back to the basics. We will be working with a small data set so that
you can easily understand what I am trying to explain. We will be adding columns, deleting columns,
and slicing the data many different ways. Enjoy!
In [1]:
# Import libraries
import pandas as pd
import sys
In [2]:
print 'Pandas version: ' + pd.__version__
2) [MSC v.1500 64 bit (AMD64)]
Pandas version: 0.15.2
1 2013, 12:37:5
In [3]:
# Our small data set
d = [0,1,2,3,4,5,6,7,8,9]
# Create dataframe
df = pd.DataFrame(d)
df
Out[3]:
0
00
11
22
33
44
55
66
77
88
99
In [4]:
# Lets change the name of the column
33
df.columns = ['Rev']
df
Out[4]:
Rev
00
11
22
33
44
55
66
77
88
99
In [5]:
# Lets add a column
df['NewCol'] = 5
df
Out[5]:
Rev NewCol
00 5
11 5
22 5
33 5
44 5
55 5
66 5
77 5
88 5
99 5
In [6]:
# Lets modify our new column
df['NewCol'] = df['NewCol'] + 1
df
Out[6]:
Rev NewCol
00 6
11 6
22 6
33 6
34
Rev NewCol
44 6
55 6
66 6
77 6
88 6
99 6
In [7]:
# We can delete columns
del df['NewCol']
df
Out[7]:
Rev
00
11
22
33
44
55
66
77
88
99
In [8]:
# Lets add a couple of columns
df['test'] = 3
df['col'] = df['Rev']
df
Out[8]:
Rev test col
00 3 0
11 3 1
22 3 2
33 3 3
44 3 4
55 3 5
66 3 6
77 3 7
88 3 8
99 3 9
In [9]:
35
# If we wanted, we could change the name of the index

i = ['a','b','c','d','e','f','g','h','i','j']
df.index = i
df
Out[9]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
e4 3 4
f5 3 5
g6 3 6
h7 3 7
i 8 3 8
j 9 3 9
We can now start to select pieces of the dataframe using loc.
In [10]:
df.loc['a']
Out[10]:
Rev
0
test
3
col
0
Name: a, dtype: int64
In [11]:
# df.loc[inclusive:inclusive]
df.loc['a':'d']
Out[11]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
In [12]:
# df.iloc[inclusive:exclusive]
# Note: .iloc is strictly integer position based. It is available from [versi
on 0.11.0] (http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-11-0
-april-22-2013)
36
df.iloc[0:3]
Out[12]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
We can also select using the column name.
In [13]:
df['Rev']
Out[13]:
a
0
b
1
c
2
d
3
e
4
f
5
g
6
h
7
i
8
j
9
Name: Rev, dtype: int64
In [14]:
df[['Rev', 'test']]
Out[14]:
Rev test
a0 3
b1 3
c2 3
d3 3
e4 3
f5 3
g6 3
h7 3
i 8 3
j 9 3
In [15]:
# df['ColumnName'][inclusive:exclusive]
df['Rev'][0:3]
Out[15]:
37
a
0
b
1
c
2
Name: Rev, dtype: int64
In [16]:
df['col'][5:]
Out[16]:
f
5
g
6
h
7
i
8
j
9
Name: col, dtype: int64
In [17]:
df[['col', 'test']][:3]
Out[17]:
col test
a0 3
b1 3
c2 3
There is also some handy function to select the top and bottom records of a dataframe.
In [18]:
# Select top N number of records (default = 5)
df.head()
Out[18]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
e4 3 4
In [19]:
# Select bottom N number of records (default = 5)
df.tail()
Out[19]:
Rev test col
f5 3 5
g6 3 6
38
Rev test col

h7 3 7
i 8 3 8
j 9 3 9
39

2 Pandas

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

2 Pandas

Caricato da

Copyright:

Formati disponibili

PANDAS

# Import all libraries needed for the tutorial

# General syntax to import specific functions in a library:

# General syntax to import a library but no functions:

# Enable inline plotting

print 'Python version ' + sys.version

print 'Pandas version ' + pd.__version__

# The inital set of baby names and bith rates

Sort the dataframe and select the top row

# Maximum value in the data set

# Name associated with the maximum value

# Text to display on graph

# Add text to graph

print "The most popular name"

2.1 What's even in it? (the summary)

2.2 Selecting columns and rows

and it doesn't matter which direction we do it in:

2.3 Selecting multiple columns

2.4 What's the most common complaint

If we just wanted the top 10 most common complaints, we can do this:

But it gets better! We can plot them!

We will be creating our own test data for analysis.

# Function to generate test data

# Create a weekly (mondays) date range

# Create random data

# Make a random list of statuses

# Make a random list of states

Output.extend(zip(random_states, random_status, data, rng))

Grab Data from Excel

# Parse a specific sheet

3. Merge (NJ and NY) to NY in the state column

To turn the NJ states to NY we simply...

00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10

[2009-01-05, ..., 2012-12-10]

Lets now plot the data per State.

# Get the max customer count by Date

# Group by Year and Month

# What is the max customer count per Year and Month

12/31/2011 - 1,000 customers

We will be using the date_range function to create our dates.

# Last four Graphs

# If we wanted, we could change the name of the index

Rev test col

Potrebbero piacerti anche

print 'Pandas version ' + pd.version