LondonR - BigR Data - Hadley Wickham - 20130716

ht
tp
://b
Bigger data it.

ly/
bi
gr
da
analysis ta
3
Hadley Wickham
@hadleywickham
Chief Scientist, RStudio
July 2013
Thursday, July 18, 13

ht
tp
://b
it.
ly/
bi
gr
da
ta
3
1. What is data analysis?
2. Transforming data
3. Visualising data

What is data
analysis?

Data analysis
Data analysis
is the process
is the process
by whichbydata
which becomes
data becomes
understanding,
understanding,
knowledge
knowledge
and insight
and insight

Data analysis is the process
by which data becomes
understanding, knowledge
and insight

Visualise
Tidy
Transform
Model
Frequent data analysis learn to program
http://www.flickr.com/photos/compleo/5414489782
http://www.flickr.com/photos/mutsmuts/4695658106
Cognition time Computation time

Visualise
ggplot2
Tidy
reshape2 Transform
plyr
stringr
lubridate
Model
Computation time Cognition time

Visualise
bigvis
Tidy
Transform
dplyr
Model
Studio
Data
Every commercial US flight 2000-2011:

~76 million flights
Total database: ~11 Gb

>100 variables, but Ill focus on a
handful: airline, delay, distance, flight
time and speed.

Transformation

Split Apply Combine
name n total
name n Al 2
2
Al 2 name n
Bo 4 Bo 4
name total
total
Al 2
Bo 0 Bo 0
9 Bo 9
Bo 5 Bo 5
Ed 15
Ed 5 name n total
Ed 10 Ed 5
15
Ed 10

array data frame list nothing
array aaply adply alply a_ply
data frame daply ddply dlply d_ply
list laply ldply llply l_ply
n replicates raply rdply rlply r_ply
function
maply mdply mlply m_ply
arguments

array data frame list nothing
array aaply adply alply a_ply
data frame daply ddply dlply d_ply
list laply ldply llply l_ply
n replicates raply rdply rlply r_ply
function
maply mdply mlply m_ply
arguments

a_ply
alply
aaply
l_ply
daply
use
adply Never
fun
Occassionally
laply Often
All the time
d_ply
llply
dlply
ldply
ddply
0 50 100 150
count
Data analysis verbs
select: subset variables
filter: subset rows
mutate: add new columns
summarise: reduce to a single row
arrange: re-order the rows

Data analysis verbs
+ gro
up
by
select: subset variables
filter: subset rows
mutate: add new columns
summarise: reduce to a single row
arrange: re-order the rows

h <- readRDS("houston.rdata")
# ~2,100,000 x 6, ~57 meg; not huge, but substantial
library(plyr)
ddply(h, c("Year", "Month", "DayofMonth"),
summarise, n = length(Year))
# user system elapsed
# 2.320 0.330 2.649
count(h, c("Year", "Month", "DayofMonth"))

# 0.687 0.183 0.869

# Often work with the same grouping variables
# multiple times, so define upfront. Also refer
# to variables in the same way
daily_df <- group_by(h, Year, Month, DayofMonth)
# Now summarise knows how to deal with grouped

# data frames
summarise(daily_df, n())

# 0.095 0.015 0.110
# 20x faster!

library(data.table)
h_dt <- data.table(h)

daily_dt <- group_by(h_dt, Year, Month, DayofMonth)
summarise(daily_dt, n())
# 0.045 0.000 0.045
# Exactly the same syntax, but 2.5x faster!

# Don't need to learn the idiosyncrasies of
# data.table; just 2 lines of code

# And dplyr also works seamlessly with databases:
ontime <- source_sqlite("flights.sqlite3", "ontime")

h_db <- filter(ontime, Origin == "IAH")
daily_db <- group_by(h_db, Year, Month, DayofMonth)

summarise(daily_db, n())

# 22.190 0.546 22.734

# 5.565 0.425 5.986
# Much slower, but not restricted to a predefined subset

# Could speed up by carefully crafting indices

# Behind the scenes
library(dplyr)
ontime <- source_sqlite("../flights.sqlite3", "ontime")
translate_sql(Year > 2005, ontime)

# <SQL> Year > 2005.0
translate_sql(Year > 2005L, ontime)
# <SQL> Year > 2005
translate_sql(Origin == "IAD" || Dest == "IAD", ontime)

# <SQL> Origin = 'IAD' OR Dest = 'IAD'
years <- 2000:2005

translate_sql(Year %in% years, ontime)
# <SQL> Year IN (2000, 2001, 2002, 2003, 2004, 2005)

Data sources
Data frames (dplyr)
Data tables (dplyr)
SQLite tables (dplyr)
Postgresql, MySql, SQL server, ...
MonetDB (planned)
Google bigquery (bigrquery)

daily_df <- group_by(h, Year, Month, DayofMonth)
summarise(daily_df, n())
daily_dt <- group_by(h_dt, Year, Month, DayofMonth)

summarise(daily_dt, n())
daily_db <- group_by(h_db, Year, Month, DayofMonth)

summarise(daily_db, n())
# It doesn't matter how your data is stored

# It might even live on the web
library(bigrquery)
library(dplyr)
library(bigrquery)
h_bq <- source_bigquery(billing_project, "ontime",

"houston")
daily_bq <- group_by(h_bq, Year, Month, DayofMonth)

system.time(summarise(daily_bq, n()))
# ~2 seconds
# Storage = $80 / TB / Month
# Query = $35 / TB (100 GB free)

dplyr
Currently experimental and incomplete,
but it works, and youre welcome to try it
out.
library(devtools)
install_github("assertthat")
install_github("dplyr")
install_github("bigrquery")
Needs a development environment
(http://www.rstudio.com/ide/docs/packages/prerequisites)

Google for:
split apply combine
dplyr

Visualisation

Studio
library(ggplot2)
library(bigvis)
# Can't use data frames :(

dist <- readRDS("dist.rds")
delay <- readRDS("delay.rds")
time <- readRDS("time.rds")
speed <- dist / time * 60
# There's always bad data

time[time < 0] <- NA
speed[speed < 0] <- NA
speed[speed > 761.2] <- NA
qplot(dist, speed, colour = delay) +
scale_colour_gradient2()
One hour later...
qplot(dist, speed, colour = delay) +

scale_colour_gradient2()
x <- runif(2e5)
y <- runif(2e5)
system.time(plot(x, y))
user system elapsed
2.785 0.010 2.806

Studio
Goals
Support exploratory analysis (e.g. in R)

Fast on commodity hardware
100,000,000 in <5s
10 obs = 0.8 Gb, ~20 vars in 16 Gb
8

Studio
Insight
Bottleneck is number of pixels:

1d 3,000; 2d: 3,000,000
Process:
Condense (bin & summarise)
Smooth
Visualise
Bin

x origin
width

Summarise
Count Histogram, KDE
Mean Regression, Loess
Std. dev.
Boxplots, Quantile regression

Quantiles smoothing

Studio
1500000
1000000
.count
500000
0 1000 2000 3000 4000 5000

dist
dist_s <- condense(bin(dist, 10))

autoplot(dist_s)
Studio
1500000 user system elapsed

2.642 0.972 3.613
1000000
.count
500000
0 1000 2000 3000 4000 5000

dist
dist_s <- condense(bin(dist, 10))

autoplot(dist_s)
Studio
NA
1500000
1000000
.count
500000
0 1000 2000 3000

time
time_s <- condense(bin(time, 1))

autoplot(time_s)
Studio
750000
500000
.count
250000
0 250 500 750 1000

time
autoplot(time_s, na.rm = TRUE)

Studio
750000
500000
.count
250000
0 100 200 300 400 500

time
autoplot(time_s[time_s < 500, ])

Studio
1500000
1000000
.count
500000
0 20 40 60
time
autoplot(time_s %% 60)

600

.count

1e+06

speed

400

1e+04

1e+02

1e+00

200

0 1000 2000 3000 4000 5000

dist

600

.count

1e+06

speed

400

1e+04

1e+02

1e+00

200

sd1 <-0 condense(bin(dist,

1000 2000
10),
3000
z =
4000
speed)
5000
autoplot(sd1) + ylab("speed")
dist

user system elapsed
2.568 0.767 3.339
600

.count

1e+06

speed

400

1e+04

1e+02

1e+00

200


1000 2000
10),
3000
z =
4000
speed)
5000
autoplot(sd1) + ylab("speed")
dist
800
600
.count
6e+05
5e+05
speed
400 4e+05
3e+05
2e+05
1e+05
0e+00
200
0 1000 2000 3000 4000 5000

dist
800
600
.count
6e+05
5e+05
speed
400 4e+05
3e+05
2e+05
1e+05
0e+00
200
0
1000 2000
20),
3000
bin(speed,
4000 5000
20))
autoplot(sd2) dist
800
user system elapsed
7.366 1.190 8.552
600
.count
6e+05
5e+05
speed
400 4e+05
3e+05
2e+05
1e+05
0e+00
200
0
1000 2000
20),
3000
bin(speed,
4000 5000
20))
autoplot(sd2) dist
Studio
Demo
shiny::runApp("mt/", 8002)

Google for:
bigvis

Conclusions

Visualise
bigvis
Tidy
Transform
dplyr
Model

LondonR - BigR Data - Hadley Wickham - 20130716

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

LondonR - BigR Data - Hadley Wickham - 20130716

Caricato da

Copyright:

Formati disponibili

ht

Bigger data it.

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

Every commercial US flight 2000-2011:

Total database: ~11 Gb

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

array aaply adply alply a_ply

data frame daply ddply dlply d_ply

list laply ldply llply l_ply

n replicates raply rdply rlply r_ply

Thursday, July 18, 13

array aaply adply alply a_ply

data frame daply ddply dlply d_ply

list laply ldply llply l_ply

n replicates raply rdply rlply r_ply

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

count(h, c("Year", "Month", "DayofMonth"))

Thursday, July 18, 13

# Now summarise knows how to deal with grouped

# user system elapsed

Thursday, July 18, 13

h_dt <- data.table(h)

# Exactly the same syntax, but 2.5x faster!

Thursday, July 18, 13

ontime <- source_sqlite("flights.sqlite3", "ontime")

daily_db <- group_by(h_db, Year, Month, DayofMonth)

# user system elapsed

# user system elapsed

# Much slower, but not restricted to a predefined subset

Thursday, July 18, 13

translate_sql(Year > 2005, ontime)

translate_sql(Origin == "IAD" || Dest == "IAD", ontime)

years <- 2000:2005

Thursday, July 18, 13

Thursday, July 18, 13

daily_dt <- group_by(h_dt, Year, Month, DayofMonth)

daily_db <- group_by(h_db, Year, Month, DayofMonth)

# It doesn't matter how your data is stored

Thursday, July 18, 13

h_bq <- source_bigquery(billing_project, "ontime",

daily_bq <- group_by(h_bq, Year, Month, DayofMonth)

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

Thursday, July 18, 13

# Can't use data frames :(

# There's always bad data

qplot(dist, speed, colour = delay) +

Thursday, July 18, 13

Support exploratory analysis (e.g. in R)

Thursday, July 18, 13

Bottleneck is number of pixels:

Thursday, July 18, 13

Mean Regression, Loess