Sei sulla pagina 1di 49

Data Science Complete Tutorial with Detail

What is data science?

Data Science Service Change


Applying advanced Converting new data
statistical tools to insights into (often
existing data to small) changes to
generate new insights business processes

Smarter Work
More efficient and effective use of staff and resources
What complements
(and is really good stuff to do)
data science?
Approach Process Outcome Examples

Define, visualize, often Meet goals and KPI SF Scorecard, Public


Performance
using dashboards, and targets Works Stat & Stat
Management manage to KPIs starter kit

Assess a project, Better investment of Evaluation of


Evaluation program or policy resources; Better transitional-
design or results policy decisions kindergarten in SF

Define and assess Report or memo with Shape Up SF Policy


Policy Analysis alternatives using a policy or program Analysis
broad range of tools recommendations

Publish civic data for Easier data sharing and SFPUC Adopt a Drain
Open Data use by the City and the reporting, new tools or
public services built on data

Identify insights using Smarter work “on the See rest of deck!
DataScienceSF advanced statistics tied ground” in real time
to a service change
What complements
(and is really good stuff to do)
data science?
Approach

Performance
Management

Evaluation All approaches can lead to service


improvement. It’s about choosing the
Policy Analysis right tool for the job (and sometimes
combining them)!
Open Data

DataScienceSF
What’s in the DataScienceSF Toolkit?
Statistical Methods Tools User Experience Research

Sentiment Time series analysis


analysis Data mining

Multilevel
Missing data
modeling imputations Classification and
clustering
Survival analysis
Pattern recognition
Principal component
and factor analysis
AB testing Machine learning
Forecasting
Propensity score Logistic, multinomial
matching and multiple linear
regression techniques Network analysis
What’s in the DataScienceSF Toolkit?
Statistical Methods Tools User Experience Research

Languages Libraries Data Engineering Visualization


Python SciPy Profiling D3.js
R Pandas ETL Gephi
SQL Scikit-learn Job notices R
Javascript GPText APIs Leaflet
NodeJS OpenNLP Optimized data PowerBI
Mahout pipelines ggplot2
+many others Optimized data shiny
storage/access
What’s in the DataScienceSF Toolkit?
Statistical Methods Tools User Experience Research

Iterative
Prototyping Photo journaling
and documenting
Service
blueprinting
Journey mapping
Ride-alongs
Process mapping
Ethnographic field
research and user
observation Usability testing
What is NOT data science?
 This  Not that
Service change Academic research

Major overhauls /
Small changes
service disruptions

Collecting new
Use existing data
data (mostly ;)
Data Science
Project Types
Project Type: Find the needle in the haystack
What to target? Data Science Service Change

Target areas

Target categories

Target individuals

Service Issue: Data Science Process: Service Change:


Difficult to identify Use existing data and Engage with target
targets in a population predictive modeling to subset of population
identify targets

Result: Department resources are spent where most needed


Examples: Free fire alarms in New Orleans
Service Issue

Fire alarms to homes


that have them

Data Science

ID homes with high prob.


of no alarm

Service Change

Use list to shape


outreach

Result

2x increase in hit rate


Examples: Find the needle in the haystack
Service Issue Data Science Service Change Result

New Orleans Fire Nola’s analytics Nola FD used the With no increase in
New Orleans Fire
Alarms

Department (Nola team used public list to determine resources or


FD) distributes free data to identify where to offer fire patrols, Nola FD
fire alarms to homes with a high alarms. increased the hit
homes. But many probability of not rate of homes
homes they visited having a fire alarm needing smoke
already had them, and provided Nola alarms by 2x.
wasting Nola FD’s FD with a list.
resources.

New York City (NYC) NYC analyzed The audit team With the same staff
Compliance
New York City Tax

conducts corporate historical audit targeted the levels, the audit


tax audits. They are records and flagged cases for team decreased the
time consuming identified patterns audits. percent of cases
and 37% have no of businesses. with no finding
findings. They want Outliers were from 37 to 22%,
to increase findings flagged as possible leading to
but maintain their audit targets. increased revenues.
number of audits.
Project Type: Prioritize your backlog
What to prioritize? Data Science Service Change

Service Issue: Data Science Process: Service Change:


Backlog is tackled via Create a model to Prioritize cases based on
first in, first out (FIFO) categorize and group categories in order of
past and current cases risk, need or
opportunity

Result: Department addresses high priority cases first


Examples: Blight backlog in New Orleans
Service Issue

Backlog in blight
enforcement

Data Science

Use data to grade cases


per prior decisions

Service Change

Result created
abatement tool

Result

1500+ case backlog gone


in 100 days
Examples: Prioritize your backlog
Service Issue Data Science Service Change Result

In Boston, they The analytics team The Air Pollution


Boston

With no change in
Complaints

have a large list of pooled data from Control resources, Boston


residences with housing, police, Commission saw a 55%
anti-social and tax agencies to expedited reduction in police
complaints filed gauge the nature of enforcement with calls associated
against them. complaints and the biggest with the targeted
identify the biggest contributors. residences.
contributors to
complaints.

New Orleans (Nola) Nola used data on The enforcement Nola eliminated the
Blight
New Orleans

faced a significant the outcomes of team used the 1,500+ case


backlog in blight previous blight results as an backlog in less than
enforcement due in cases to grade abatement decision 100 days.
part to bottlenecks cases in the backlog tool to speed the
in the decision and to recommend decision-making
making process and additional data to process of whether
missing collect by field to demolish or
information. teams. foreclose a home.
Project Type: Flag “stuff” early
How to detect? Data Science Service Change

Service Issue: Data Science Process: Service Change:


Hard to predict future Use historical and Use estimates to change
condition which leads to current data to create and tailor intervention
reactive services estimate ranges for points
potential outcomes

Result: Department provides pro-active early interventions


Examples: Use of force alerts in Charlotte
Service Issue

Excessive force have neg.


impact on community

Data Science

Identify patterns to
refine early warning

Service Change

Flagged recurring
complaints

Result

Accuracy up 20%; False


positives down 55%
Examples: Flag “stuff” early
Service Issue Data Science Service Change Result

Excessive force The analytics team The department The CMPD system
Violence
Charlotte Police

violations by police refined an early flagged recurring increased accuracy


officers have huge warning system, complaints against by 15-20% while
negative identifying patterns officers and reducing false
repercussions in that often led to notified supervisors positives by 55%.
the community and officers having when certain
for police careers. negative thresholds were
interactions with reached.
the public.

In Chicago, a large The analytics team They conducted


Chicago

Chicago reached
Lead Poisoning in

number of children built a model of targeted the most


are thought to be exposure using inspections and vulnerable families
exposed to lead data on homes, provided before severe
paint in older history of children’s remediation health effects from
houses. exposure at that funding to homes lead contamination
address and identified in the manifest.
conditions of model.
neighborhood.
Project Type: A/B test something
Which form? Data Science Service Change

62% 78%
respond respond

Service Issue: Data Science Process: Service Change:


Costly outreach Statistical testing on Use statistically
methods are not tested outreach methods to validated outreach
before implementation identify which, when, method
and to whom to send

Result: Department increases response rates


Examples: NYC Summons Redesign
Service Issue

40% cited no-show


leading to costly arrest

Data Science

Redesigned and tested


summons form

Service Change

Deployed new form and


rescheduled timelines

Result

Currently evaluating
impact
Examples: A/B test something
Service Issue Data Science Service Change Result

In New Orleans, The analytics team The department 60% increase in


Health Program
NOLA Community

they have a low tested different implemented the clients using free
take up rate of free SMS reminders to most successful primary care
primary care those eligible for SMS text. appointments
appointments. appointments.

40% of those cited Experiment and Reschedule court


Redesign

Evaluating impact
NYC Summons

for low-level test redesign of timelines to on use of costly


violations did not summons process facilitate greater arrest warrants
take required next access (Project currently in
steps, leading to progress)
issuance of arrest
warrants.
Project Type: Optimize your resources
How to distribute? Data Science Service Change

Service Issue: Data Science Process: Service Change:


Difficult to identify Use geospatial and/or Re-allocates resources
where to place or other data to identify to optimal distribution
distribute resources to optimal distribution of
be most effective resources

Result: Department decreases response times; increases volume


Examples: Chicago Pest Control
Service Issue

Challenging to predict
outbreaks

Data Science

Analyze data associated


with outbreaks

Service Change

Proactive targeting of
leading indicators

Result

15% drop in requests for


service
Examples: Optimize your resources
Service Issue Data Science Service Change Result

Chicago’s rodent Predicted potential Directed rodent Resident requests


Control
Chicago Pest

baiting program danger of baiting to areas for rodent control


finds it challenging outbreaks by using identified by services dropped
to predict rodent leading indicators leading indicators, by 15%
outbreaks and and other data including events,
locations leading to correlated with like water main
spikes in 311 previous outbreaks. breaks.
complaints.

In New Orleans, Analytics team Ambulances Targeting short


Stand-by Location
NOLA Ambulance

ambulance standby used city wide deployed at new response times to


locations are analysis of data on optimized locations EMS calls (Project
chosen based on accident patterns, currently in
dispatcher habits or traffic patterns, and progress)
instincts. crew readiness to
identify optimal
standby locations
What was the service change?

 From that  To This


Fire Alarms Random List Prioritized List

Blight Staff evaluates all cases Tool evaluates easy cases

Early Warning Focus on that set of officers Focus on this set of officers

Summons Send Original Form Send new form

Control Arrive at location X too late Arrive at location X early

Service Change = Small Business Process Change


Summary: The five project types
Find the needle in the haystack

Prioritize your backlog Some combination

Flag “stuff” early

A/B test something

Optimize your resources Something else…


DataScienceSF
Cohort 1
ASR: Increase property tax revenues
Service Issue

When a property sells in SF, we either accept the sales


price or modify it to collect property taxes. So which
sales should you accept and which should you dig into?

Data Science

Our regression model identifies which sale prices are


unusual for the location, time and property details
http://www.markersf.com/blog/
Service Change

The model splits properties into two lists: normal sale


prices to enroll directly in tax collection and outlier sales
for manual review by appraisers

Result

Expected: Increased revenue and time to revenue, Prioritize your backlog


reduced backlog, and more consistency in assessments
Full write up at datasf.org/showcase/datascience/
Evictions: Pro-actively prevent evictions
Service Issue

How can we make eviction prevention more proactive by


identifying the most problematic eviction notices in real
time?

Data Science

An algorithm combines data sources to identify eviction


notice filings that are outside the norm

Service Change

A list of flagged eviction notices is sent to eviction


prevention services to proactively review for service
outreach

Result

Expected: Targeted eviction prevention that keeps Find the needle Flag “stuff”
residents in their homes in the haystack early
Full write up at datasf.org/showcase/datascience/
ENV: Find new clients to help green our City
Service Issue

SF Environment offers financial incentives and technical


assistance to help our constituents upgrade their lighting
& refrigeration systems. But their list of leads is
dwindling - how can they find new leads?

Data Science

Mashed together multiple data sources to identify


characteristics of stronger leads

Service Change

New and longer list of property leads with enriched data


for targeting marketing campaigns

Result

Expected: New customers and increased uptake of green Find the needle Optimize your
subsidies in the haystack resources
Full write up at datasf.org/showcase/datascience/
DPH WIC: Help moms and babies stay in
nutrition program
Service Issue

Since 2011, DPH has seen an increase in mothers


dropping out of their nutrition program. Which moms
are most at risk of dropout?

Data Science

Built a predictive model that identified moms and infants


who are at greatest risk for dropping out

Service Change

Using the high-risk client profiles to conduct targeted


interviews to identify program barriers and make service
changes

Result

Expected: Reduce the dropout rate of moms, infants and Flag “stuff” early
children, leading to healthier outcomes for both
Full write up at datasf.org/showcase/datascience/
DPH BHS: Improve results and reduce costs in
mental health care
Service Issue

A small fraction of mental health patients use a large %


of resources. Can we identify high users early to improve
their outcomes and reduce costs?

Data Science

Build predictive model to identify clients at greatest risk


for becoming high users

Service Change

Expected: Targeted service model to direct high users to


more stable and preventative services

Result

Expected: Reduction in high cost clients and use of high Find the needle Flag “stuff”
cost emergency services in the haystack early
TTX: Increase response to tax letter
Service Issue

TTX wanted to use behavioral economics and A/B test to


increase effectiveness of collection letter for unsecured
personal property (a difficult type to collect on).

Data Science

DataSF helped organize a Behavioral Insights Training


(BIT) workshop and provided guidance on A/B test

Service Change

Use whichever letter gets the best response

Result

Improved response rate by 17%. TTX continuing to apply A/B test something
BIT principles to other taxpayer communications
Full write up at datasf.org/showcase/datascience/
ART: Preserve City art for the future
Service Issue

The Arts Commission needs to accurately and efficiently


project long-term costs to budget for art preservation

Data Science

Revised cost formula and new tool to provide long-term


projections and prioritization of conservation projects on
demand

Service Change

Use tool to model cost scenarios instead of manual, one


time process

Result

Expected: Reduction in staff time, more accurate cost Optimize your resources
estimates, and earlier identification of pieces in need of
conservation
Full write up at datasf.org/showcase/datascience/
Overview of Phases

Cohort 2: Jan – June


Solicitation Selection Project refining Present

Oct - Nov Nov 27 Dec


Dec January - May June
Nov 22 – Dec 13 13

Application due Notify applicants Analysis & service change


Phase: Solicitation
Opportunities to learn more
• Brown bags
• Office hours
• Invited presentations

Dates at datasf.org/science

April - Mid
May May June July - November Dec
May May
Phase: Solicitation
How to prepare
• Brainstorm projects using the project types
• Identify possible service changes
• Review data that could help
• Identify key staff members

Learn more at datasf.org/science


April - Mid
May May June July - November Dec
May May
Phase: Application
Available at datasf.org/science

• Brief online form


– Problem statement (200
word max)
– Impact statement (100
words max)
– Service change statement
– Data overview
– Project champion

April - Mid
May May June July - November Dec
May May
Phase: Application
Criteria to keep in mind
• Above all else: A viable path to service change
• Question / problem answerable by data science
• Solvable within cohort time frame
• Impact
• Department commitment
• Data readiness
April - Mid
May May June July - November Dec
May May
Phase: Selection
Process
• Initial review
– Criteria assessment
– Application scoring
• Department follow-ups, as needed
– Be available for questions (email or in person)
• Estimating 5-10 projects per Cohort

April - Mid
May May June July - November Dec
May May
Phase: Winners Announced
And gentle off-ramps for the rest…
Some projects may not be appropriate for data science or for our timeline. We will help identify other
opportunities that may be a better fit:
• Civic Bridge – pro bono opportunities via the Mayor’s Office of Civic Innovation
• STIR – startup technology engagements via the Mayor’s Office of Civic Innovation
• DataSF Dashboarding Services
• Controller's Performance Unit
• Data Academy classes
• External Data Science groups or volunteers
• Other technical assistance

April - Mid
May May June July - November Dec
May May
Phase: Project refining
During this phase, we will:
• Meet to refine the scope
• Optionally, do initial site visits/interviews
• Prepare data for analysis
• Outputs
– Project charter
– Data exchanges and agreements, as needed

April - Mid
May May June July - November Dec
May May
Phase: Analysis and service change
During this phase, we will:
• Conduct site visits, ride-alongs
Service
and interviews, as appropriate Plan
Analysis

• Conduct iterative analysis


Review
• Implementation testing
• Handoff and training

April - Mid
May May June July - November Dec
May May
Phase: Analysis and service change

Statistical Methods Final Product is


What
DataSF Tools Algorithm + Tool:
Brings Algorithms that are
User Experience Research scripted and automated
(real time if needed) tied to
Issue expertise some service change tool
What You (e.g. list, service, alert)
A good question & data
Bring implemented together and
Project champion maintained by department
Phase: Present (& Disseminate)
During this phase, we will:
• Present and celebrate the results with cohort
• As appropriate, write an article for DataSF
Speaks (datasf.org/blog) and/or other venues
• Disseminate method and approach (not data) for
other departments and cities to learn
• Data Scientist will continue to be available
during office hours for continued support
April - Mid
May May June July - November Dec
May May
Visit datasf.org/science
At datasf.org/science:
• This powerpoint
• 1 pager
• Sign up for office hours
• Sign up for brown bag
• Apply!
Other Resources: Civic Bridge
THANK YOU
@datasf | datasf.org |datasf.org/blog
Activity
• Take 5 minutes by yourself
– Brainstorm ideas
– Take your best idea and complete the form
• With your neighbors
– Review each top idea and refine/iterate
• Report out

Potrebbero piacerti anche