Sei sulla pagina 1di 4

3/5/19

Lecture 19 – Reproducibility
Michael Franklin and Dan Nicolae
CS-Stat 119, Intro to Data Science II, Winter 2019
March 5, 2019

Scientific
Reproducibility in Action

Slide From Victoria Stodden “Life Cycle of Data”

Is “Data Science” a “Science”? “Big Data Hubris”?


“Nowcasting”:
In 2010 Google
was able to predict
flu outbreaks 2 weeks
earlier than the CDC.

In 2013 it missed the


peak of the flu season
by 140%
(impacted by prevelance of
Adapted From Victoria Stodden “Life Cycle of Data” flu in the news, search algo changes…)

1
3/5/19

Is “Data Science” a “Science”? The “Ubiquity of Error”


• The central motivation for the scientific method is to
root out error:

• Deductive science has mathematical proof

• Empirical science has hypothesis testing, statistical


methods, accepted norms for communications of
methods and protocols

Adapted From Victoria Stodden “Life Cycle of Data”


• What about Data Science?

Terminology: Let’s try not to get too hung up Aspects of Reproducibility


• Empirical

• Computational

• Statistical

From: Chen et al . “Open is not Enough”, Nature Physics November 15, 2018

Science is having its own reproducibility crisis

Slide From Victoria Stodden “Life Cycle of Data”

2
3/5/19

P-Hacking or Selective Reporting


• Try lots of things – selectively report those that produce signficant
results.
• Adjusting data collection mid-experiment
• Recording many response variables and selectively reporting
• Outlier handling ”post-analysis”
• Adjusting treatment groups “post-analysis”
• Stopping data exploration once a significant p-value is obtained
•…

Slide From Victoria Stodden “Life Cycle of Data”

What are new Challenges for Data Science


Tools to the Rescue?
Reproducibility?
• Q: What are all the components that go into an
analysis?

• Data Sets
• Software
• Who’s? Libraries, Stack Overflow, Stuff you write, Operating
Systems, …
• Environment – e.g. random seeds
• Hardware
• Workflows

Tools for Curation and Reproducibility


Workflow – It’s more complicated than that!
Provenance and Lineage
• Think about all the places in a typical analysis where decisions that Need efficient fine-grained lineage for
could dramatically impact the outcome are made:
machine learning and advanced
• Choice of Data Sets
analytics pipelines
• Choice of Sensing technology Supports code debugging, result
• Data Cleaning analysis, data anomaly removal and
• Outlier handling
• Missing Data computation replay
• Entity Resolution….
• Sampling
Provides interactive answers to queries
• Choice of Algorithm over lineage
•…
Hippo (Z. Zhang et al. HPDC 17)

3
3/5/19

Community
Efforts

The “Underwriters
Laboratories” approach

Slide From Victoria Stodden “Life Cycle of Data”


Anything Else???

Slide From Victoria Stodden “Life Cycle of Data”

Potrebbero piacerti anche