Sei sulla pagina 1di 56

ht

tp
://b

Bigger data it.


ly/
bi
gr
da
analysis ta
3

Hadley Wickham
@hadleywickham
Chief Scientist, RStudio

July 2013

Thursday, July 18, 13


ht
tp
://b
it.
ly/
bi
gr
da
ta
3
1. What is data analysis?
2. Transforming data
3. Visualising data

Thursday, July 18, 13


What is data
analysis?

Thursday, July 18, 13


Data analysis
Data analysis
is the process
is the process
by whichbydata
which becomes
data becomes
understanding,
understanding,
knowledge
knowledge
and insight
and insight

Thursday, July 18, 13


Data analysis is the process
by which data becomes
understanding, knowledge
and insight

Thursday, July 18, 13


Visualise

Tidy
Transform

Model
Thursday, July 18, 13
Frequent data analysis learn to program

http://www.flickr.com/photos/compleo/5414489782
Thursday, July 18, 13
http://www.flickr.com/photos/mutsmuts/4695658106
Cognition time Computation time

Thursday, July 18, 13


Visualise
ggplot2

Tidy
reshape2 Transform
plyr
stringr
lubridate

Model
Thursday, July 18, 13
Computation time Cognition time

Thursday, July 18, 13


Visualise
bigvis

Tidy
Transform
dplyr

Model
Thursday, July 18, 13
Studio

Data

Every commercial US flight 2000-2011:


~76 million flights

Total database: ~11 Gb


>100 variables, but Ill focus on a
handful: airline, delay, distance, flight
time and speed.

Thursday, July 18, 13


Transformation

Thursday, July 18, 13


Split Apply Combine
name n total

name n Al 2
2
Al 2 name n

Bo 4 Bo 4
name total

total
Al 2
Bo 0 Bo 0
9 Bo 9
Bo 5 Bo 5
Ed 15
Ed 5 name n total

Ed 10 Ed 5
15
Ed 10

Thursday, July 18, 13


array data frame list nothing

array aaply adply alply a_ply

data frame daply ddply dlply d_ply

list laply ldply llply l_ply

n replicates raply rdply rlply r_ply

function
maply mdply mlply m_ply
arguments

Thursday, July 18, 13


array data frame list nothing

array aaply adply alply a_ply

data frame daply ddply dlply d_ply

list laply ldply llply l_ply

n replicates raply rdply rlply r_ply

function
maply mdply mlply m_ply
arguments

Thursday, July 18, 13


a_ply

alply

aaply

l_ply

daply
use

adply Never
fun

Occassionally
laply Often
All the time
d_ply

llply

dlply

ldply

ddply

0 50 100 150
count
Thursday, July 18, 13
Data analysis verbs
select: subset variables
filter: subset rows
mutate: add new columns
summarise: reduce to a single row
arrange: re-order the rows

Thursday, July 18, 13


Data analysis verbs
+ gro
up
by
select: subset variables
filter: subset rows
mutate: add new columns
summarise: reduce to a single row
arrange: re-order the rows

Thursday, July 18, 13


h <- readRDS("houston.rdata")
# ~2,100,000 x 6, ~57 meg; not huge, but substantial

library(plyr)
ddply(h, c("Year", "Month", "DayofMonth"),
summarise, n = length(Year))
# user system elapsed
# 2.320 0.330 2.649

count(h, c("Year", "Month", "DayofMonth"))


# user system elapsed
# 0.687 0.183 0.869

Thursday, July 18, 13


# Often work with the same grouping variables
# multiple times, so define upfront. Also refer
# to variables in the same way
daily_df <- group_by(h, Year, Month, DayofMonth)

# Now summarise knows how to deal with grouped


# data frames
summarise(daily_df, n())

# user system elapsed


# 0.095 0.015 0.110

# 20x faster!

Thursday, July 18, 13


library(data.table)

h_dt <- data.table(h)


daily_dt <- group_by(h_dt, Year, Month, DayofMonth)
summarise(daily_dt, n())
# user system elapsed
# 0.045 0.000 0.045

# Exactly the same syntax, but 2.5x faster!


# Don't need to learn the idiosyncrasies of
# data.table; just 2 lines of code

Thursday, July 18, 13


# And dplyr also works seamlessly with databases:

ontime <- source_sqlite("flights.sqlite3", "ontime")


h_db <- filter(ontime, Origin == "IAH")

daily_db <- group_by(h_db, Year, Month, DayofMonth)


summarise(daily_db, n())

# user system elapsed


# 22.190 0.546 22.734

# user system elapsed


# 5.565 0.425 5.986

# Much slower, but not restricted to a predefined subset


# Could speed up by carefully crafting indices

Thursday, July 18, 13


# Behind the scenes
library(dplyr)
ontime <- source_sqlite("../flights.sqlite3", "ontime")

translate_sql(Year > 2005, ontime)


# <SQL> Year > 2005.0
translate_sql(Year > 2005L, ontime)
# <SQL> Year > 2005

translate_sql(Origin == "IAD" || Dest == "IAD", ontime)


# <SQL> Origin = 'IAD' OR Dest = 'IAD'

years <- 2000:2005


translate_sql(Year %in% years, ontime)
# <SQL> Year IN (2000, 2001, 2002, 2003, 2004, 2005)

Thursday, July 18, 13


Data sources
Data frames (dplyr)
Data tables (dplyr)
SQLite tables (dplyr)
Postgresql, MySql, SQL server, ...
MonetDB (planned)
Google bigquery (bigrquery)

Thursday, July 18, 13


daily_df <- group_by(h, Year, Month, DayofMonth)
summarise(daily_df, n())

daily_dt <- group_by(h_dt, Year, Month, DayofMonth)


summarise(daily_dt, n())

daily_db <- group_by(h_db, Year, Month, DayofMonth)


summarise(daily_db, n())

# It doesn't matter how your data is stored

Thursday, July 18, 13


# It might even live on the web
library(bigrquery)
library(dplyr)
library(bigrquery)

h_bq <- source_bigquery(billing_project, "ontime",


"houston")

daily_bq <- group_by(h_bq, Year, Month, DayofMonth)


system.time(summarise(daily_bq, n()))
# ~2 seconds
# Storage = $80 / TB / Month
# Query = $35 / TB (100 GB free)

Thursday, July 18, 13


dplyr
Currently experimental and incomplete,
but it works, and youre welcome to try it
out.
library(devtools)
install_github("assertthat")
install_github("dplyr")
install_github("bigrquery")
Needs a development environment
(http://www.rstudio.com/ide/docs/packages/prerequisites)

Thursday, July 18, 13


Google for:
split apply combine
dplyr

Thursday, July 18, 13


Visualisation

Thursday, July 18, 13


Studio

library(ggplot2)
library(bigvis)

# Can't use data frames :(


dist <- readRDS("dist.rds")
delay <- readRDS("delay.rds")
time <- readRDS("time.rds")
speed <- dist / time * 60

# There's always bad data


time[time < 0] <- NA
speed[speed < 0] <- NA
speed[speed > 761.2] <- NA
Thursday, July 18, 13
qplot(dist, speed, colour = delay) +
scale_colour_gradient2()
Thursday, July 18, 13
One hour later...

qplot(dist, speed, colour = delay) +


scale_colour_gradient2()
Thursday, July 18, 13
x <- runif(2e5)
y <- runif(2e5)
system.time(plot(x, y))
Thursday, July 18, 13
Thursday, July 18, 13
user system elapsed
2.785 0.010 2.806

Thursday, July 18, 13


Studio

Goals

Support exploratory analysis (e.g. in R)


Fast on commodity hardware
100,000,000 in <5s
10 obs = 0.8 Gb, ~20 vars in 16 Gb
8

Thursday, July 18, 13


Studio

Insight

Bottleneck is number of pixels:


1d 3,000; 2d: 3,000,000

Process:
Condense (bin & summarise)
Smooth
Visualise
Thursday, July 18, 13
Bin

x origin
width

Thursday, July 18, 13


Summarise
Count Histogram, KDE

Mean Regression, Loess

Std. dev.

Boxplots, Quantile regression


Quantiles smoothing

Thursday, July 18, 13


Studio

1500000

1000000
.count

500000

0 1000 2000 3000 4000 5000


dist

dist_s <- condense(bin(dist, 10))


autoplot(dist_s)
Thursday, July 18, 13
Studio

1500000 user system elapsed


2.642 0.972 3.613
1000000
.count

500000

0 1000 2000 3000 4000 5000


dist

dist_s <- condense(bin(dist, 10))


autoplot(dist_s)
Thursday, July 18, 13
Studio

NA

1500000

1000000
.count

500000

0 1000 2000 3000


time

time_s <- condense(bin(time, 1))


autoplot(time_s)
Thursday, July 18, 13
Studio

750000

500000
.count

250000

0 250 500 750 1000


time

autoplot(time_s, na.rm = TRUE)


Thursday, July 18, 13
Studio

750000

500000
.count

250000

0 100 200 300 400 500


time

autoplot(time_s[time_s < 500, ])


Thursday, July 18, 13
Studio

1500000

1000000
.count

500000

0 20 40 60
time

autoplot(time_s %% 60)
Thursday, July 18, 13

600




















































.count
















1e+06














speed



400


1e+04













1e+02






1e+00






200

0 1000 2000 3000 4000 5000


dist
Thursday, July 18, 13

600




















































.count
















1e+06














speed



400


1e+04













1e+02






1e+00






200

sd1 <-0 condense(bin(dist,


1000 2000
10),
3000
z =
4000
speed)
5000
autoplot(sd1) + ylab("speed")
dist
Thursday, July 18, 13

user system elapsed
2.568 0.767 3.339
600




















































.count
















1e+06














speed



400


1e+04













1e+02






1e+00






200

sd1 <-0 condense(bin(dist,


1000 2000
10),
3000
z =
4000
speed)
5000
autoplot(sd1) + ylab("speed")
dist
Thursday, July 18, 13
800

600

.count
6e+05
5e+05
speed

400 4e+05
3e+05
2e+05
1e+05
0e+00

200

0 1000 2000 3000 4000 5000


dist
Thursday, July 18, 13
800

600

.count
6e+05
5e+05
speed

400 4e+05
3e+05
2e+05
1e+05
0e+00

200

0
sd2 <-0 condense(bin(dist,
1000 2000
20),
3000
bin(speed,
4000 5000
20))
autoplot(sd2) dist
Thursday, July 18, 13
800
user system elapsed
7.366 1.190 8.552

600

.count
6e+05
5e+05
speed

400 4e+05
3e+05
2e+05
1e+05
0e+00

200

0
sd2 <-0 condense(bin(dist,
1000 2000
20),
3000
bin(speed,
4000 5000
20))
autoplot(sd2) dist
Thursday, July 18, 13
Studio

Demo
shiny::runApp("mt/", 8002)

Thursday, July 18, 13


Google for:
bigvis

Thursday, July 18, 13


Conclusions

Thursday, July 18, 13


Visualise
bigvis

Tidy
Transform
dplyr

Model
Thursday, July 18, 13

Potrebbero piacerti anche