Sei sulla pagina 1di 74

# Sports Predictive Analytics:

By
Dr. Ash Pahwa

## IEEE Computer Society

San Diego Chapter

## Copyright 2017 Dr. Ash Pahwa 1

Outline
Case Studies of Sports Analytics
Sports
Sports Analytics
Applications of Sports Analytics
Sports Analytics Literature
Data Sources
Sports Predictive Models
Regression Model
Multi Variable Regression with Lasso
NFL Prediction Model
Prediction for Super Bowl 2016
Prediction for NFL 2017 Playoffs
Copyright 2017 Dr. Ash Pahwa 2
San Diego L.A. Chargers

## Copyright 2017 Dr. Ash Pahwa 3

Case Studies of Sports
Analytics?

## Copyright 2017 Dr. Ash Pahwa 4

What is Sports Analytics?
Which Pitcher is Better?

2007 Baseball
Jake Peavy John Lackey
San Diego Padres : L.A. Angles :
National League American League
ERA: 2.54 ERA: 3.01

9 innings

## • National league does not allow Designated Hitter (DH) for

pitcher. Pitcher must bat.

## Copyright 2017 Dr. Ash Pahwa 5

Which Variable is the Best Predictor
of the Winning Percentage?

## Copyright 2017 Dr. Ash Pahwa 6

Pythagorean Theorem
Used in Baseball. Proposed by Bill James
Suppose
‘F’ represents a team’s run scored
‘A’ represents team’s runs allowed
𝐹2
𝑃𝑦ℎ𝑡𝑎𝑔𝑜𝑟𝑒𝑎𝑛 𝑊𝑖𝑛𝑛𝑖𝑛𝑔 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 =
𝐹 2 +𝐴2
It is called Pythagorean theorem because it is similar to the
elementary geometry theorem

Example:
Year 2012: Detroit Tigers
Scored Runs = F = 726
Allowed Runs = A = 670
𝐹2 7262 527,076
𝑃𝑦ℎ𝑡𝑎𝑔𝑜𝑟𝑒𝑎𝑛 𝑊𝑖𝑛𝑛𝑖𝑛𝑔 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 = = = = 0.54
𝐹 2 +𝐴2 7262 +6702 975,976
Total games won = 162*0.54 = 88 games

## Copyright 2017 Dr. Ash Pahwa 7

How to Convey Information to
Decision Makers

## • Understand the Sport • Management

• Understand the Players • Operational Personals
• Understand Performance data

Selection of
• Statistical Models Meaningful
Raw Data
• Predictive Models Results

## Copyright 2017 Dr. Ash Pahwa 8

Goals of Sports Analytics
1. Apply Statistical Models to Sporting
Data
2. Ratings and Rankings
3. Predictive Models
4. Player and Team Assessment

## Copyright 2017 Dr. Ash Pahwa 9

Statistical Models
Predictive Models

## Indices of Statistics Used Inferential Ratings + Predictive

Central to Examine Statistics Rankings Models
Tendencies and Relationships
Variability
Histogram Normal Distribution Normality Ratings + Simple Linear
(Frequency (z-values and Rankings Regression
Distribution) p-values)
Mean Covariance Outlier Rank Aggregation Multiple Linear
Regression
Median Correlation – t-test Polynomial
Pearson Regression
Mode Rank Correlation – ANOVA Logistic Regression
Spearman
Range, Variance Partial Correlation Chi-Square

Standard Deviation

## Copyright 2017 Dr. Ash Pahwa 10

Ranking of Players and
Teams?

## Copyright 2017 Dr. Ash Pahwa 11

Pair-Wise Comparison

## Copyright 2017 Dr. Ash Pahwa 12

Pair-Wise Comparison
The Social Network

## Copyright 2017 Dr. Ash Pahwa 13

Pair-Wise Comparison
Can be Used for Ranking

## Copyright 2017 Dr. Ash Pahwa 14

Log 5 Method
Developed by Bill James in 1970s
Computes the probability that Team A
will beat Team B
Log 5 formula has nothing to do with
the mathematical function ‘Log’

## Copyright 2017 Dr. Ash Pahwa 15

Log 5 𝑝𝑎,𝑏 =
𝑝𝑎 − 𝑝𝑎 ∗ 𝑝𝑏

Formula
𝑝𝑎 + 𝑝𝑏 − 2 ∗ 𝑝𝑎 ∗ 𝑝𝑏

## Suppose Team A true winning percentage is 10 out of 16 games

Percentage of true winning = 𝑝𝑎 = 10/16 = 0.625
Suppose Team B true winning percentage is 7 out of 16 games
Percentage of true winning = 𝑝𝑏 = 7/16 = 0.438
--------------------------------------------------------------------------------------
The probability that Team A will beat Team B
𝑝𝑎 −𝑝𝑎 ∗𝑝𝑏 0.625−0.625∗0.438 0.625−0.274 0.351
𝑝𝑎,𝑏 = = = = = 0.681
𝑝𝑎 +𝑝𝑏 −2∗𝑝𝑎 ∗𝑝𝑏 0.625+0.438−2∗0.625∗0.438 1.063−2∗0.274 0.515
----------------------------------------------------------------------------------------
The probability that Team B will beat Team A
𝑝𝑏 −𝑝𝑎 ∗𝑝𝑏 0.438−0.625∗0.438 0.438−0.274 0.164
𝑝𝑏,𝑎 = = = = = 0.318
𝑝𝑎 +𝑝𝑏 −2∗𝑝𝑎 ∗𝑝𝑏 0.625+0.438−2∗0.625∗0.438 1.063−2∗0.274 0.515
---------------------------------------------------------------------------------------------------
𝑝𝑎,𝑏 + 𝑝𝑏,𝑎 = 1

## Copyright 2017 Dr. Ash Pahwa 16

Physics Professor at Marquette
University
Milwaukee, Wisconsin
Chess Player
Devised a method to rank chess players
US Chess Federation
World Chess Federation

## Copyright 2017 Dr. Ash Pahwa 17

Points Gained or Lost
Points gained/lost by player A = Points gained/lost by player B

## • Before the game • After the game

• 𝑃𝑙𝑎𝑦𝑒𝑟 𝐴 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐴 • 𝑃𝑙𝑎𝑦𝑒𝑟 𝐴 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐴′
• 𝑃𝑙𝑎𝑦𝑒𝑟 𝐵 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐵 • 𝑃𝑙𝑎𝑦𝑒𝑟 𝐵 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐵′
• 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑑𝐴𝐵 = 𝑟𝐴 − 𝑟𝐵 • 𝑟𝐴 + 𝑟𝐵 = 𝑟𝐴′ + 𝑟𝐵′

𝑑𝐴𝐵 1
• 𝜇𝐴𝐵 = 𝐿 = −𝑑𝐴𝐵
400
1+10 400

## Copyright 2017 Dr. Ash Pahwa 18

If Player A Draw If Player B wins

Example S(AB)
wins
1 0.5 0

S(BA) 0 0.5 1
• Before the game
• 𝑃𝑙𝑎𝑦𝑒𝑟 𝐴 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐴 = 2400
• 𝑃𝑙𝑎𝑦𝑒𝑟 𝐵 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐵 = 2000
• 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑑𝐴𝐵 = 𝑟𝐴 − 𝑟𝐵 = 400 𝜇𝐴𝐵 = 0.91
• 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑑𝐵𝐴 = 𝑟𝐵 − 𝑟𝐴 = −400 𝜇𝐵𝐴 = 0.09
Suppose K = 32 for Chess
After the game
If Player A wins 𝑃𝑙𝑎𝑦𝑒𝑟 𝐴 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐴′ = 𝑟𝐴 + 𝐾 𝑆𝐴𝐵 − 𝜇𝐴𝐵 = 2400 + 32 1 − 0.91 = 2403
𝑃𝑙𝑎𝑦𝑒𝑟 𝐵 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐵′ = 𝑟𝐵 + 𝐾 𝑆𝐵𝐴 − 𝜇𝐵𝐴 = 2000 + 32 0 − 0.09 = 1997
𝑟𝐴 + 𝑟𝐵 = 𝑟𝐴′ + 𝑟𝐵′
2400 + 2000 = 2403 + 1997

## • After the game

If Player B wins • 𝑃𝑙𝑎𝑦𝑒𝑟 𝐴 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐴′ = 𝑟𝐴 + 𝐾 𝑆𝐴𝐵 − 𝜇𝐴𝐵 = 2400 + 32 0 − 0.91 = 2371
• 𝑃𝑙𝑎𝑦𝑒𝑟 𝐵 𝑟𝑎𝑡𝑖𝑛𝑔 𝑟𝐵′ = 𝑟𝐵 + 𝐾 𝑆𝐵𝐴 − 𝜇𝐵𝐴 = 2000 + 32 1 − 0.09 = 2029
• 𝑟𝐴 + 𝑟𝐵 = 𝑟𝐴′ + 𝑟𝐵′
• 2400 + 2000 = 2371 + 2029

## Copyright 2017 Dr. Ash Pahwa 19

The Social Network
Elo Formula was Used to pair-wise Comparison of Girls

Sports

## Copyright 2017 Dr. Ash Pahwa 21

Sports
Inherent part of Human Culture
Sports competition dates back to the dawn of
our species
Greek Olympics : 776 BC

## Copyright 2017 Dr. Ash Pahwa 22

Sportsman Spirit
Virtues of sports
fairness
self-control
courage
persistence
It has been associated with interpersonal concepts of treating
others and being treated fairly
Maintaining self-control if dealing with others, and respect for
both authority and opponents
GOOD SPORTSMEN HAVE ALWAYS BEEN HELD IN HIGH
ESTEEM

## Copyright 2017 Dr. Ash Pahwa 23

Most Popular Sports in the
World

## Copyright 2017 Dr. Ash Pahwa 24

Most Popular Sports in America

Sports Analytics

## Copyright 2017 Dr. Ash Pahwa 26

Predictive Models
Estimation
Regression
Classification (win/loss)
Logistic Regression
Discriminant Analysis
Linear
Support Vector Machine
Copyright 2017 Dr. Ash Pahwa 27
Goals of Sports Analytics
Player
Discovering hidden talent in a new player
Player Evaluation
Assessing Player Performance
Which metrics is most important to
assess a players’ performance
Assessing Player Value
How much value a player adds to the
teams’ value

## Copyright 2017 Dr. Ash Pahwa 28

Goals of Sports Analytics
Team
Ranking top teams
Accessing Team Performance
How to compute the value of a team
Which Team Members are best suited to play
against the opposing team
Which strategy to use to play against a team?
Anticipating Opponents Behavior
Accessing the probability of a win in a sporting
event
Copyright 2017 Dr. Ash Pahwa 29
Need for Prediction Results
Betting on an sporting event
People betting on sports need to see the
prediction results
Probability of a win
Fantasy Sports
DraftKings
FanDuel
Copyright 2017 Dr. Ash Pahwa 30
Applications of Sports
Analytics

## Copyright 2017 Dr. Ash Pahwa 31

Application of Sports PA
Movies
The film is based on Michael Lewis' 2003
nonfiction book of the same name, an
account of the Oakland Athletics baseball
team's 2002 season and their general
manager Billy Beane's attempts to assemble
a competitive team.

## In the film, Beane (Brad Pitt) and assistant

GM Peter Brand (Jonah Hill), faced with the
franchise's limited budget for players, build
a team of undervalued talent by taking a
sophisticated sabermetric approach towards
scouting and analyzing players.

## They acquire "submarine" pitcher Chad

Bradford (Casey Bond) and former catcher
Scott Hatteberg (Chris Pratt), and win 20
consecutive games, an American League
record.

## Copyright 2017 Dr. Ash Pahwa 32

Team Selection: Moneyball

## Copyright 2017 Dr. Ash Pahwa 33

Moneyball – Billy Beane &
Paul DePodesta

## William Lamar "Billy" Beane III (born March

29, 1962) is an American former professional
baseball player and current front office
executive.

## He is the Executive Vice President of Baseball

Operations and minority owner of the
Oakland Athletics of Major League Baseball
(MLB).

## The character of Brand is an invention by the

filmmakers; in the excellent Michael Lewis non-fiction
book upon which the movie is based, the real-life
“Brand” is identified as Paul DePodesta.

## Unlike Brand, DePodesta is slender, fit and handsome.

He’s also Harvard-educated (not a Yalie – screenwriter
Aaron Sorkin‘s private joke).

## Copyright 2017 Dr. Ash Pahwa 34

Application of Using Predictive Analytics
Strategies Boston Red Sox
Sports PA won 3 world series in
Boston Red Sox Baseball

## In 2006, Time named Bill James in the

Time 100 as one of the most influential
people in the world. He is a Senior
Advisor on Baseball Operations for the
Boston Red Sox.

## Copyright 2017 Dr. Ash Pahwa 35

Sports Analytics Literature

## Copyright 2017 Dr. Ash Pahwa 36

Predictive Models for Sports
Literature

## Copyright 2017 Dr. Ash Pahwa 37

Statistical Learning
Gareth James
Daniela Witten
Trevor Hastie
Robert Tibshirani

Stanford University

Data Sources

## Copyright 2017 Dr. Ash Pahwa 39

Data Sources

Valid
Accurate
Complete
Contains derived variables

## Copyright 2017 Dr. Ash Pahwa 40

Data Sources
www.NFL.com
www.NBA.com
www.footballOutsiders.com
www.pro-football-reference.com
www.soccerstats.com

## Copyright 2017 Dr. Ash Pahwa 41

Data Source
Football
www.ArmChairAnalysis.com

## Copyright 2017 Dr. Ash Pahwa 42

Sports Predictive Models

## Copyright 2017 Dr. Ash Pahwa 43

Goals of Predictive Analytics Application:
Estimation or Classification
Estimation – Regression modeling Classification
technique is used Logistic Regression
Output is a number Support Vector Machine
House price Discriminant Analysis (Linear,
quarter Naïve Bayes, Decision Trees etc.
GNP growth for the next modeling techniques are used
quarter Output is a categorical variable
How many points a team Sports team will win or lose
will score Email is junk or not
Tweet is positive or negative

## Copyright 2017 Dr. Ash Pahwa 44

Common PA
Techniques
Regression
Linear 2 variables
Linear multi variables
Logistic
Polynomial
Clustering
Decision Trees
Neural Networks
Naïve Bayes
ARIMA
A few more …

Regression Model

## Copyright 2017 Dr. Ash Pahwa 46

History of Linear
Regression
Sir Francis Galton and
Karl Pearson
Developed the concepts on
Regression and Correlation in
1900 - 1930

## Copyright 2017 Dr. Ash Pahwa 47

Definition: Linear Regression
2 Variable Regression
How a response variable “y” changes
y 1 x c
As the predictor (explanatory)
variable “x” changes
Multiple Regression
How a response variable “y” changes
As the predictor (explanatory)
variables “x1”, “x2”, … “xn” change
y x
1 1 x
2 2 x
3 3 ... x
n n c
Copyright 2017 Dr. Ash Pahwa 48
Single Variable Polynomial Regression
First degree to Fifth Degree

## The 5th degree polynomial regression looks

good for training set data
But with test data it may not perform well
PROBLEM: Over fit for this data
Copyright 2017 Dr. Ash Pahwa 49
Overfitting
If there exists a model with estimated parameters 𝑤 ′ such that
Training error (w2) < Training Error (w1)
Generalization (Test) error (w2) > Generalization (Test) Error (w1)

Bias and
Variance

## Goal is to find out a model complexity

where
Generalization (Validation) errors are least
Bias + variance are least
𝑀𝑒𝑎𝑛 𝑆𝑞𝑢𝑎𝑟𝑒 𝐸𝑟𝑟𝑜𝑟 = 𝐵𝑖𝑎𝑠 2 + 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
Just like Generalization Error
We cannot compute Bias and Variance

## Copyright 2017 Dr. Ash Pahwa 51

1st Degree Polynomial Fit
𝑦 = 𝑎1 𝑥 + 𝑐

## > #result = lm(y1~x)

> result = lm(y1 ~ poly(x,1,raw=TRUE))
> summary(result)

Call:
lm(formula = y1 ~ poly(x, 1, raw = TRUE))

Residuals:
Min 1Q Median 3Q Max
-0.9872 -0.7277 -0.1394 0.7842 1.2692

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1919 0.4285 0.448 0.662
poly(x, 1, raw = TRUE) -0.0492 0.1120 -0.439 0.668

## Residual standard error: 0.8449 on 12 degrees of freedom

Multiple R-squared: 0.01581, Adjusted R-squared: -0.0662
F-statistic: 0.1928 on 1 and 12 DF, p-value: 0.6684

> p = predict(result,list(x=xPredict))
> lines(xPredict,p,col='black',lwd=3)
>

## Copyright 2017 Dr. Ash Pahwa 52

16 Degree
th Polynomial Fit
𝑦 = 𝑎1 𝑥 16 + 𝑎2 𝑥 15 + … + 𝑎15 𝑥 2 + 𝑎16 𝑥 + 𝑐
> result = lm(y1 ~ poly(x,16,raw=TRUE))
> summary(result)
Call:
lm(formula = y1 ~ poly(x, 16, raw = TRUE))
Residuals:
ALL 14 residuals are 0: no residual degrees of freedom!
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.820e-01 NA NA NA
poly(x, 16, raw = TRUE)1 1.280e+02 NA NA NA
poly(x, 16, raw = TRUE)2 -7.656e+02 NA NA NA
poly(x, 16, raw = TRUE)3 1.973e+03 NA NA NA
poly(x, 16, raw = TRUE)4 -2.895e+03 NA NA NA
poly(x, 16, raw = TRUE)5 2.675e+03 NA NA NA
poly(x, 16, raw = TRUE)6 -1.631e+03 NA NA NA
poly(x, 16, raw = TRUE)7 6.712e+02 NA NA NA
poly(x, 16, raw = TRUE)8 -1.869e+02 NA NA NA
poly(x, 16, raw = TRUE)9 3.447e+01 NA NA NA
poly(x, 16, raw = TRUE)10 -3.944e+00 NA NA NA
poly(x, 16, raw = TRUE)11 2.282e-01 NA NA NA
poly(x, 16, raw = TRUE)12 NA NA NA NA
poly(x, 16, raw = TRUE)13 -5.901e-04 NA NA NA
poly(x, 16, raw = TRUE)14 NA NA NA NA
poly(x, 16, raw = TRUE)15 1.468e-06 NA NA NA
poly(x, 16, raw = TRUE)16 NA NA NA NA

## Residual standard error: NaN on 0 degrees of freedom

Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 13 and 0 DF, p-value: NA
Copyright 2017 Dr. Ash Pahwa 53
Under fit or Over fit Model
# Degree Under fit
of or Over
Polynom fit
ial
1 1 Under fit
2 2 Under fit
3 4 Under fit
4 8 OK
5 10 Over fit
6 16 Over fit

Lasso Regression

## Least Absolute Shrinkage and

Selection Operator

## Copyright 2017 Dr. Ash Pahwa 55

Cost Function of OLS + Ridge
+ Lasso
OLS

Ridge Regression

Lasso Regression

NFL Model

## Copyright 2017 Dr. Ash Pahwa 57

Data Source
Seasons
2000 – 2016
Weeks 1 – 21
Week#1 – #17: 16 games + 1 bye
Week#18: 4 games: Wild card
Week#19: 4 games: Divisional Playoff
Week#20: 2 games: Conference Championship
Week#21: 1 game: Super Bowl

## Copyright 2017 Dr. Ash Pahwa 58

Data
Tables
Arm Chair Data – 26 Tables

Arm Chair
Game Data
Play Data
Team Data
External Source
City Coordinates (Wikipedia)
City GDP (Wikipedia + US Govt.)

## Copyright 2017 Dr. Ash Pahwa 59

Sonny Moore’s Computer Rating
http://sonnymoorepowerratings.com/archive

## Copyright 2017 Dr. Ash Pahwa 60

USA Today: Jeff Sagarin
http://www.usatoday.com/sports/nfl

NFL Prediction

Model Result

## Copyright 2017 Dr. Ash Pahwa 62

Comparison of Errors
Sonny Moore & Jeff Sagarin

## Copyright 2017 Dr. Ash Pahwa 63

Super Bowl 2016 Prediction
by Nate Silver

## Copyright 2017 Dr. Ash Pahwa 64

Super Bowl 2016 Prediction
by A+

## Copyright 2017 Dr. Ash Pahwa 65

Super Bowl 2016
All Predicted Models were Wrong
Panthers Broncos
Nate Silver 59% 41%
A+ 56.5% 43.5%

## Copyright 2017 Dr. Ash Pahwa 66

2017 NFL Prediction
Wild Card : Playoff
Predictions for all
the 4 games
were 100%
correct

## Copyright 2017 Dr. Ash Pahwa 67

2017 NFL Prediction
Divisional Title: Playoff
www.NFLPrediction.co

Predictions
2 correct
2 incorrect
50% correct

## Copyright 2017 Dr. Ash Pahwa 68

2017 NFL Prediction
Conference Championship: Playoff

## Atlanta Falcons vs Green Bay Packers

Atlanta Falcons 61%
New England Patriots vs Pittsburgh Steelers
New England Patriots 70%

## Copyright 2017 Dr. Ash Pahwa 69

2017 NFL Prediction
Super Bowl

## New England Patriots vs

Atlanta Falcons
New England Patriots 60%

## Copyright 2017 Dr. Ash Pahwa 70

Summary
Sports
Sports Analytics
Applications of Sports Analytics
Sports Analytics Literature
Data Sources
Sports Predictive Models
Regression Model
Multi Variable Regression with Lasso
NFL Prediction Model
Prediction for Super Bowl 2016
Prediction for NFL 2017 Playoffs

## Copyright 2017 Dr. Ash Pahwa 71

UCSD Extension Courses

Winter 2017

72Dr. Ash Pahwa
UCSD Extension courses
Winter 2017

Online

## Sports Predicted Analytics January 30 – March 19, 2017

(7 weeks)

73Dr. Ash Pahwa
Course UCSD Sports Predictive Analytics

Content
L# Date Subject
1 01/30/17 Introduction to Sports Analytics
1.1 What is Sports Analytics?
1.2 Tools for Sports Analytics
1.3 Science of Learning from Data
1.4 Basic Stat : Data Types + Histograms + Std Dev
2 02/06/17 Statistical Methods Applied on Sports Data
2.1 Normal Distribution
2.2 Correlation
2.3 Rank and Partial Correlation
3 02/13/17 Central Limit Theorem + Hypothesis Testing
3.1 CLT + Parameter Estimation
3.2 Hypothesis Testing (*)
4 02/20/17 Inference Stat: Chi-Sq+ANOVA+Model Selec
4.1 Chi-Square (*)
4.2 ANOVA
4.3 Stat Model Selection (*)
5 02/27/17 Ratings and Rankings + Rank Aggregation
5.1 Ratings and Rankings (Elo System)
5.2 Rank Aggregation (Borda Voting)
6 03/06/17 Prediction Using Regression Model
6.1 Introduction to Regression
6.2 Regression 2 Variables
6.3 Regression Multi Variables

## 7 03/13/17 Prediction Using Logistic Regression Model

7.1 Logistic Regression Mathematics
7.2 Logistic Regression Mechanics
7.3 Multivariable Logistic Regression