Sei sulla pagina 1di 8



WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

ABOUT

CONFERENCE

COURSES

22

MAY

INTERVIEWS

10 things statistics taught us


about big data analysis
POSTED BY JEFF LEEK / UNCATEGORIZED







Inmy previous postI pointed out a major problem with big data is that applied
statistics have been left out. But many cool ideas in applied statistics are really
relevant for big data analysis.So I thought I'd try to answer the second question in
my previous post: "When thinking about the big data era, what are some statistical
ideas we've already figured out?" Because the internet loves top 10 lists I came up
with 10, but there are more if people find this interesting. Obviously mileage may
vary with these recommendations, but I think they are generally not a bad idea.
1. If the goal is prediction accuracy, average many prediction models
together. In general, the prediction algorithms that most frequently win Kaggle
competitions or the Netflix prize blend multiple models together. The idea is
that by averaging (or majority voting) multiple good prediction algorithms you
can reduce variability without giving up bias. One of the earliest descriptions of
this idea was of a much simplified version based on bootstrapping samples
and building multiple prediction functions - a process called bagging(short for
bootstrap aggregating). Random forests, another incredibly successful
prediction algorithm, is based on a similar idea with classification trees.
2. When testing many hypotheses, correct for multiple testingThis
comicpoints out the problem with standard hypothesis testing when many
tests are performed. Classic hypothesis tests are designed to call a set of data
significant 5% of the time, even when the null is true (e.g. nothing is going on).
One really common choice for correcting for multiple testing is to use the false
discovery rateto control the rate at which things you call significant are false
discoveries. People like this measure because you can think of it as the rate of
noise among the signals you have discovered. Benjamini and Hochber gave the
KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV





WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

first definition of the false discovery rate and provided a procedure to


control the FDR. There is also a really readable introduction to FDR by Storey
and Tibshirani.
3. When you have data measured over space, distance, or time, you
should smoothThis is one of the oldest ideas in statistics (regression is a form
of smoothing and Galton popularized that a while ago). I personally like locally
weighted scatterplot smoothing a lot. This paperis a good one by Cleveland

about loess. Here it is in a gif.

But people also like

smoothing splines, Hidden Markov Models, moving averages and many other
smoothing choices.
4. Before you analyze your data with computers, be sure to plot itA common
mistake made by amateur analysts is to immediately jump to fitting models to
big data sets with the fanciest computational tool. But you can miss pretty
obvious things like thisif you don't plot your data.

There are too many plots to talk

about individually, but one example of an incredibly important plot is theBlandAltman plot,(called an MA-plot in genomics)when comparing measurements
from multiple technologies.R provides tons of graphics for a reason
andggplot2makes them pretty.
5. Interactive analysis is the best way to really figure out what is going on
KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV





WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

in a data setThis is related to the previous point; if you want to understand a


data set you have to be able to play around with it and explore it. You need to
make tables, make plots, identify quirks, outliers, missing data patterns and
problems with the data. To do this you need to interact with the data quickly.
One way to do this is to analyze the whole data set at once using tools like Hive,
Hadoop, or Pig. But an often easier, better, and more cost effective approach is
to use random sampling . As Robert Gentleman put it "make big data as small
as possible as quick as possible".
6. Know what your real sample size is. It can be easy to be tricked by the size
of a data set. Imagine you have an image of a simple black circle on a white
background stored as pixels. As the resolution increases the size of the data
increases, but the amount of information may not (hence vector graphics).
Similarly in genomics, the number of reads you measure (which is a main
determinant of data size) is not the sample size, it is the number of individuals.
In social networks, the number of people in the network may not be the sample
size. If the network is very dense, the sample size might be much less. In
general the bigger the sample size the better and sample size and data size
aren't always tightly correlated.
7. Unless you ran a randomized trial, potential confounders should keep
you up at nightConfounding is maybe the most fundamental idea in statistical
analysis. It is behind the spurious correlations like these and the reason why
nutrition studies are so hard. It is very hard to hold people to a randomized
diet and people who eat healthy diets might be different than people who don't
in other important ways. In big data sets confounders might be technical
variables about how the data were measured or they could be differences
over time in Google search terms. Any time you discover a cool new result,
your first thought should be, "what are the potential confounders?"

8. Define a metric for success up front Maybe the simplest idea, but one that
is critical in statistics and decision theory. Sometimes your goal is to discover
new relationships and that is great if you define that up front. One thing that
applied statistics has taught us is that changing the criteria you are going for
after the fact is really dangerous. So when you find a correlation, don't assume
you can predict a new result or that you have discovered which way a causal
arrow goes.
9. Make your code and data available and have smart people check itAs
several people pointed out about my last post, the Reinhart and Rogoff problem
KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV





WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

did not involve big data. But even in this small data example, there was a bug in
the code used to analyze them. With big data and complex models this is even
more important. Mozilla Science is doing interesting work on code review for
data analysis in science. But in general if you just get a friend to look over your
code it will catch a huge fraction of the problems you might have.
10. Problem first not solution backwardOne temptation in applied statistics is to
take a tool you know well (regression) and use it to hit all the nails (epidemiology

problems).

There is a similar

temptation in big data to get fixated on a tool (hadoop, pig, hive, nosql
databases, distributed computing, gpgpu, etc.) and ignore the problem of can
we infer x relates to y or that x predicts y.

Previous

Next

Brian
Great post! I am curious about #3 - are you thinking of this in terms of preparing predictor
variables for modeling or generally in terms of visualizing or expressing a relationship
descriptively?
Seref Arikan
Brilliant stuff. I have been looking for this type of material for quite some time now, but
statistics books rarely focus on these type of gems. As a PhD student who had to do lots of self
learning, I wish had a lot more of this type of wisdom around, especially in the form of books.
Any suggestions you may have would be great
Cheers,
Seref
Trust me, Im a statistician
In #2, Pro. Hochberg's name is misspelled
KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV





WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

Chandrika B-Rao
Excellent summarization. However, half knowledge is dangerous and suggestion 1 can be
misused. It is taken as license to combine results of multiple heterogeneous models without
any understanding of the bases of the models and the right way to combine them. Exactly how
the averaging over multiple models is done is crucial to the reliability of the combined result. I
have seen misuse of suggestion 3 as well - 50% inhibitory constants that are lower than the
lowest tested concentration or higher than the highest tested concentration of a compound in
a bioassay. This comes from blind use of software prediction involving smoothing and
extrapolation that the user is often unaware of. Combined with suggestion 4, this problem
could be overcome.
SandraPickering
Thank you - some of my 'favourites' and a few new ones.
I appreciate having all of them gathered in one place.
http://choicewords.tumblr.com/ A Corresponder
Please credit the art! The bottom one is clearly Allie Bosch's from hyperbole and a half.
pls
second this. Good art is good.
(Btw. surprisingly wasn't sure about the association, bcs unlikely. Thx)
http://www.matrixdatalabs.com/ Matrix DataLabs
Thanks for this! The basic ideas in statistics relating data to information are always at the heart
of predictive analytics, knowledge discovery, data mining, etc whether with big data or not. Just
one comment: in aggregation models it tends to work better to consider weighted
combinations (this is actuallly what bagging does, searching iteratively a solution over the set of
convex combinations of very simple learners using a gradient algorithm)
Nick Jachowski
Great article Jeff! I like that you differentiate between predictive and descriptive modeling. Too
often practitioners conflate the two which leads to sub-optimal results! If you're trying to make
a prediction - great. If you're trying to describe what you think is going on too, that's also great.
But rarely are the two one and the same...
http://www.twentylys.com/ TwentyLYS
Great stuffs. Although not sure if I get number 1 quite correctly, average out prediction models
results to get better accuracy. How about the assumptions each prediction model makes? I
think if we use bootstrapping then it re-samples of sample and repeats the process large
number of times. i.e runs the same model but large number of times to output the shape of
KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV





WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

the distribution. As far as I know Random Forest applies the same technique of bootstrap
aggregating algorithm and uses some sort of modified decision trees method to reduce
correlation among spitted trees.
Gianluca Rossi
Great article! Unfortunately the link to the article by Cleveland about loess is broken
6HDUFK

RECENT POSTS
A non-comprehensive list of awesome things other people did in 2015
Not So Standard Deviations: Episode 6 - Google is the New Fisher
Instead of research on reproducibility, just do reproducible research
By opposing tracking well-meaning educators are hurting disadvantaged kids
Not So Standard Deviations: Episode 5 - IRL Roger is Totally With It

ARCHIVES
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV





WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011

META
Log in
Entries RSS
Comments RSS
WordPress.org

PAGES
About
KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV





WKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV_6LPSO\6WDWLVWLFV

Conference
Courses
Interviews

2013 ROGER D. PENG, RAFAEL A. IRIZARRY, JEFFREY T. LEEK. ALL RIGHTS


RESERVED.

KWWSVLPSO\VWDWLVWLFVRUJWKLQJVVWDWLVWLFVWDXJKWXVDERXWELJGDWDDQDO\VLV



Potrebbero piacerti anche