Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Fitting Distributions
João Neto
October 2015
Introduction
Using fitdistrplus
Choosing which distribution to fit
Fitting a distribution
Parameter estimates
Refs:
fitdistrplus: An R Package for Fitting Distributions (http://www.jstatsoft.org/article/view/v064i04/v64i04.pdf)
Introduction
The typical way to fit a distribution is to use function MASS::fitdistr :
library(MASS)
set.seed(101)
my_data <‐ rnorm(250, mean=1, sd=0.45) # unkonwn distribution parameters
fit <‐ fitdistr(my_data, densfun="normal") # we assume my_data ~ Normal(?,?)
fit
## mean sd
## 0.99613836 0.42405694
## (0.02681972) (0.01896440)
hist(my_data, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 1/12
12/27/2016 Fitting Distributions
log_likelihood <‐ function(params) { ‐sum(dnorm(my_data, params[1], params[2], log=TRUE)) }
fit2 <‐ optim(c(0,1), log_likelihood) # c(0,1) are just initial guesses
fit2
## $par
## [1] 0.9962070 0.4240523
##
## $value
## [1] 140.2628
##
## $counts
## function gradient
## 65 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 2/12
12/27/2016 Fitting Distributions
hist(my_data, pch=20, breaks=25, prob=TRUE)
curve(dnorm(x, fit2$par[1], fit2$par[2]), col="blue", lwd=6, add=T) # optim fit
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T) # fitdistr fit
Using fitdistrplus
This tutorial uses the fitdistrplus package for fitting distributions.
library(fitdistrplus)
Choosing which distribution to fit
Let’s use one of the datasets provided by the packge:
data("groundbeef", package = "fitdistrplus")
my_data <‐ groundbeef$serving
plot(my_data, pch=20)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 3/12
12/27/2016 Fitting Distributions
We can first plot the empirical density and the histogram to gain insight of the data:
plotdist(my_data, histo = TRUE, demp = TRUE)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 4/12
12/27/2016 Fitting Distributions
Another tool is to show some descriptive statistics, like the moments, to help us in making a decision:
descdist(my_data, discrete=FALSE, boot=500)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 5/12
12/27/2016 Fitting Distributions
## summary statistics
## ‐‐‐‐‐‐
## min: 10 max: 200
## median: 79
## mean: 73.64567
## estimated sd: 35.88487
## estimated skewness: 0.7352745
## estimated kurtosis: 3.551384
The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.
Fitting a distribution
Say, in the previous eg, we chose the weibull, gamma and lognormal to fit:
fit_w <‐ fitdist(my_data, "weibull")
fit_g <‐ fitdist(my_data, "gamma")
fit_ln <‐ fitdist(my_data, "lnorm")
summary(fit_ln)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 6/12
12/27/2016 Fitting Distributions
## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## meanlog 4.1693701 0.03366988
## sdlog 0.5366095 0.02380783
## Loglikelihood: ‐1261.319 AIC: 2526.639 BIC: 2533.713
## Correlation matrix:
## meanlog sdlog
## meanlog 1 0
## sdlog 0 1
we can plot the results:
par(mfrow=c(2,2))
plot.legend <‐ c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 7/12
12/27/2016 Fitting Distributions
The fitting can work with other nonbase distribution. It only needs that the correspodent, d, p, q functions are
implemented.
data("endosulfan", package = "fitdistrplus")
my_data <‐ endosulfan$ATV
fit_ln <‐ fitdist(my_data, "lnorm")
cdfcomp(fit_ln, xlogscale = TRUE, ylogscale = TRUE)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 8/12
12/27/2016 Fitting Distributions
To solve this it is used the Burr and Pareto distributions available at package actuar
library(actuar)
fit_ll <‐ fitdist(my_data, "llogis", start = list(shape = 1, scale = 500))
fit_P <‐ fitdist(my_data, "pareto", start = list(shape = 1, scale = 500))
fit_B <‐ fitdist(my_data, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fit_ln, fit_ll, fit_P, fit_B), xlogscale = TRUE, ylogscale = TRUE,
legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"), lwd=2)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 9/12
12/27/2016 Fitting Distributions
The package also provides some goodnessofit statistics:
gofstat(list(fit_ln, fit_ll, fit_P, fit_B), fitnames = c("lnorm", "llogis", "Pareto", "Burr"))
## Goodness‐of‐fit statistics
## lnorm llogis Pareto Burr
## Kolmogorov‐Smirnov statistic 0.1672498 0.1195888 0.08488002 0.06154925
## Cramer‐von Mises statistic 0.6373593 0.3827449 0.13926498 0.06803071
## Anderson‐Darling statistic 3.4721179 2.8315975 0.89206283 0.52393018
##
## Goodness‐of‐fit criteria
## lnorm llogis Pareto Burr
## Aikake's Information Criterion 1068.810 1069.246 1048.112 1045.731
## Bayesian Information Criterion 1074.099 1074.535 1053.400 1053.664
Visually and using the previous statistics, it seems that the Burr distribution seems the preferred one among the
candidates we chose to explore.
Parameter estimates
We can apply a bootstrap to estimate the uncertainty in the parameters:
ests <‐ bootdist(fit_B, niter = 1e3)
summary(ests)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 10/12
12/27/2016 Fitting Distributions
## Parametric bootstrap medians and 95% percentile CI
## Median 2.5% 97.5%
## shape1 0.2024304 0.09504934 0.377547
## shape2 1.5802731 1.01204175 2.959759
## rate 1.4655764 0.65204235 2.702179
##
## The estimation method converged only for 995 among 1000 iterations
plot(ests)
quantile(ests, probs=.05) # 95% percentile bootstrap confidence interval
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 11/12
12/27/2016 Fitting Distributions
## (original) estimated quantiles for each specified probability (non‐censored data)
## p=0.05
## estimate 0.2939259
## Median of bootstrap estimates
## p=0.05
## estimate 0.3077854
##
## two‐sided 95 % CI of each quantile
## p=0.05
## 2.5 % 0.1707940
## 97.5 % 0.4960287
##
## The estimation method converged only for 995 among 1000 bootstrap iterations.
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 12/12