Sei sulla pagina 1di 33

Should we all be teaching

Intro to Data Science instead


of Intro to Databases?

7/11/14

Bill Howe, UW

Mike Franklin, UC Berkeley


Juliana Freire, NYU
Jim Frew, UC Santa Barbara
Bill Howe, University of Washington
Tim Kraska, Brown
Raghu Ramakrishnan, Microsoft
couldnt make it

7/11/14

Bill Howe, UW

Plan
context (8 min)
panelists (5 x (5min + 2min))
discussion

7/11/14

Bill Howe, UW

What is Data Science?


The next sexy job
The ability to take datato be able to understand it, to process
it, to extract value from it, to visualize it, to communicate it
thats going to be a hugely important skill.
Hal Varian, Google

Data science, as it's practiced, is a blend of Red-Bull-fueled


hacking and espresso-inspired statistics.
Data science is the civil engineering of data. Its acolytes
possess a practical knowledge of tools & materials, coupled
with a theoretical understanding of what's possible.
Mike Driscoll, CEO of metamarkets:
7/11/14

Bill Howe, UW

Drew Conways Data Science Venn Diagram

7/11/14

Bill Howe, UW

Data Scientist (n.): Person who is better at


statistics than any software engineer and better
at software engineering than any statistician.
-- Josh Wills, Cloudera

7/11/14

Bill Howe, UW

A data scientist is a computer scientist


that understands error bars
-- unknown

A data scientist is a statistician that


lives in Silicon Valley
-- paraphrase of Stephen Probst, Teradata

7/11/14

Bill Howe, UW

What do data scientists do?


They need to find nuggets of truth in data and then explain it to the
business leaders
-- Rchard Snee, EMC

Data scientists tend to be hard scientists, particularly physicists, rather


than computer science majors. Physicists have a strong mathematical
background, computing skills, and come from a discipline in which survival
depends on getting the most from the data. They have to think about the
big picture, the big problem.
-- DJ Patil, Chief Scientist at LinkedIn

7/11/14

Bill Howe, UW

A data scientist is someone who can obtain, scrub, explore, model


and interpret data, blending hacking, statistics and machine
learning. Data scientists not only are adept at working with data, but
appreciate data itself as a first-class product.
-- Hilary Mason, chief scientist at bit.ly

7/11/14

Bill Howe, UW

I worry that the Data Scientist role is like


the mythical webmaster of the 90s:
master of all trades.
-- Aaron Kimball, CTO Wibidata

7/11/14

Bill Howe, UW

10

Mike Driscolls three sexy skills of data geeks

Statistics
traditional analysis

Data Munging
parsing, scraping, and formatting data

Visualization
graphs, tools, etc.

7/11/14

Bill Howe, UW

11

Three types of tasks:


1) Preparing to run a model

80% of the work


-- Aaron Kimball

Gathering, cleaning, integrating, restructuring,


transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging

2) Running the model


3) Interpreting the results
7/11/14

Bill Howe, UW

The other 80% of the work


12

What are the abstractions of


data science?
Data Jujitsu
Data Wrangling
Data Munging

7/11/14

Translation: We have no idea what


this is all about

Bill Howe, UW

13

What are the abstractions of


data science?
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and scripts?
data frames and functions?
Claim: Relational Algebra is at least as important as Linear Algebra

7/11/14

Bill Howe, UW

14

Huge number of
relevant courses,
new and existing.

7/11/14

Bill Howe, UW

15

Tools
tools

abstr.

structs

stats

desk

cloud

Math

Scale

Audience
hackers
7/11/14

analysts

Bill Howe, UW

16

tools

abstr.

structs

stats

desk

cloud

hackers
7/11/14

Bill Howe, UW

analysts
17

William W. Cohen

Machine
Learning

tools

abstr.

structs

stats

desk

cloud

hackers
7/11/14

Bill Howe, UW

tools

abstr.

structs

stats

desk

cloud

hackers
analysts

analysts

18

Dan
Suciu

CSE 344 Introduction to Data Management

Magda
Balazinska

tools

abstr.

structs

stats

desk

cloud

hackers

7/11/14

Bill Howe, UW

analysts

19

Jeff Hammerbacher Mike Franklin

tools

abstr.

structs

stats

desk

cloud

hackers
7/11/14

Bill Howe, UW

analysts
20

Introduction to Data Science


Rachel Schutt

tools

abstr.

structs

stats

desk

cloud

hackers
7/11/14

Bill Howe, UW

analysts
21

tools

abstr.

structs

stats

desk

cloud

hackers

7/11/14

Bill Howe, UW

analysts

22

Bill Howe Richard Sharp Roger Barga

7/11/14

Bill Howe, UW

tools

abstr.

structs

stats

desk

cloud

hackers

analysts
23

UW Big Data Education Efforts


Students
Non-Students
CS/Informa3cs
Non-Major
professionals researchers
undergrads grads undergrads grads

UWEO Data Science Cer3cate


IGERT: Big Data PhD Track
CS Courses
Bootcamps and workshops
Intro to Data Programming
Data Science Masters (planned)
MOOC: Intro to Data Science
Incubator: hands-on training

7/11/14

Bill Howe, UW

24

Bill Howe

Session 1,
Spring 2013

tools

abstr.

structs

stats

desk

cloud

Session 2
(starts Monday!)
hackers
7/11/14

Bill Howe, UW

analysts
25

Participation numbers

Registered:
Clicked play in first 2 weeks:
Turned in 1st homework:
Completed all assignments:
Passed:
Forum threads:
Forum posts:

119,517 totally irrelevant


78,589
10,663
~9000 typical attrition for a MOOC
7022
4661
22,900

Fairly consistent with Coursera data across hard courses

26

Who took the course?

7/11/14

Bill Howe, UW

27

7/11/14

Bill Howe, UW

28

Syllabus
Data Science Landscape (~1 week)
Data Manipulation at Scale
Relational Databases (~1 week)
MapReduce (~1 week)
NoSQL (~1 week)

Analytics
Statistics Pearls (~1 week)
multiple hypothesis testing, effect size, bayesian, bootstrap

Machine Learning Pearls (~1 week)


evaluation / overfitting, boosting / bagging, trees / forests, gradient descent

Visualization (~1 week)


Graph Analytics (~1 week)
Guest Lectures

Relational Algebra is the Calculus of Big Data

RA-flavored Hadoop-spawn: Pig, HIVE, blah


Hadoop contemporaries: Cascalog, Flume, blah
Post-Hadoop: Spark/Shark, Dremel, blah
Its all RA

7/11/14

Bill Howe, UW

30

Relational Algebra is the Calculus of Small Data


Galaxy bioinformatics workflows
Operate on Genomics Intervals -> Join

Pandas (Python)
merge(left, right, on=key)

dplyr (R)
filter(x), select(x), arrange(x), groupby(x),
inner_join(x, y), left_join(x, y), .

Manimal, Pyxis/StatusQuo, others


Extract RA operators implemented manually in Java code
7/11/14

Bill Howe, UW

31

The next hour


~5 minute talks
Discussion

7/11/14

Bill Howe, UW

32

Possible Responses
Data science is just a buzzword; theres
no substance to it.
Im already teaching all this stuff;
theres nothing new here.
This is a job for statistics departments /
B-schools / I-schools / applied math /
anyone else.
7/11/14

Bill Howe, UW

33

Potrebbero piacerti anche