Fitting Distributions

12/27/2016 Fitting Distributions
Fitting Distributions
João Neto
October 2015
Introduction
Using fitdistrplus
Choosing which distribution to fit
Fitting a distribution
Parameter estimates
Refs:
fitdistrplus: An R Package for Fitting Distributions (http://www.jstatsoft.org/article/view/v064i04/v64i04.pdf)
Introduction
The typical way to fit a distribution is to use function MASS::fitdistr :
library(MASS)
set.seed(101)
my_data <‐ rnorm(250, mean=1, sd=0.45) # unkonwn distribution parameters

fit <‐ fitdistr(my_data, densfun="normal") # we assume my_data ~ Normal(?,?)
fit
## mean sd
## 0.99613836 0.42405694
## (0.02681972) (0.01896440)
hist(my_data, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)
http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html 1/12
fitdistr uses optim to estimate the parameter values by maximizing the likelihood function. So it works like

this:
log_likelihood <‐ function(params) { ‐sum(dnorm(my_data, params[1], params[2], log=TRUE)) }
fit2 <‐ optim(c(0,1), log_likelihood) # c(0,1) are just initial guesses
fit2
## $par
## [1] 0.9962070 0.4240523
##
## $value
## [1] 140.2628
##
## $counts
## function gradient
## 65 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
hist(my_data, pch=20, breaks=25, prob=TRUE)
curve(dnorm(x, fit2$par[1], fit2$par[2]), col="blue", lwd=6, add=T) # optim fit
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T) # fitdistr fit
Using fitdistrplus
This tutorial uses the fitdistrplus package for fitting distributions.
library(fitdistrplus)
Choosing which distribution to fit
Let’s use one of the datasets provided by the packge:
data("groundbeef", package = "fitdistrplus")
my_data <‐ groundbeef$serving
plot(my_data, pch=20)
We can first plot the empirical density and the histogram to gain insight of the data:
plotdist(my_data, histo = TRUE, demp = TRUE)
Another tool is to show some descriptive statistics, like the moments, to help us in making a decision:
descdist(my_data, discrete=FALSE, boot=500)
## summary statistics
## ‐‐‐‐‐‐
## min: 10 max: 200
## median: 79
## mean: 73.64567
## estimated sd: 35.88487
## estimated skewness: 0.7352745
## estimated kurtosis: 3.551384
The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.
Fitting a distribution
Say, in the previous eg, we chose the weibull, gamma and lognormal to fit:
fit_w <‐ fitdist(my_data, "weibull")
fit_g <‐ fitdist(my_data, "gamma")
fit_ln <‐ fitdist(my_data, "lnorm")
summary(fit_ln)
## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## meanlog 4.1693701 0.03366988
## sdlog 0.5366095 0.02380783
## Loglikelihood: ‐1261.319 AIC: 2526.639 BIC: 2533.713
## Correlation matrix:
## meanlog sdlog
## meanlog 1 0
## sdlog 0 1
we can plot the results:
par(mfrow=c(2,2))
plot.legend <‐ c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
The fitting can work with other nonbase distribution. It only needs that the correspodent, d, p, q functions are
implemented.
In the next eg, the endosulfan dataset cannot be properly fit by the basic distributions like the lognormal:
data("endosulfan", package = "fitdistrplus")
my_data <‐ endosulfan$ATV

fit_ln <‐ fitdist(my_data, "lnorm")
cdfcomp(fit_ln, xlogscale = TRUE, ylogscale = TRUE)
To solve this it is used the Burr and Pareto distributions available at package actuar
library(actuar)

fit_ll <‐ fitdist(my_data, "llogis", start = list(shape = 1, scale = 500))
fit_P <‐ fitdist(my_data, "pareto", start = list(shape = 1, scale = 500))
fit_B <‐ fitdist(my_data, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fit_ln, fit_ll, fit_P, fit_B), xlogscale = TRUE, ylogscale = TRUE,
legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"), lwd=2)
The package also provides some goodnessofit statistics:
gofstat(list(fit_ln, fit_ll, fit_P, fit_B), fitnames = c("lnorm", "llogis", "Pareto", "Burr"))
## Goodness‐of‐fit statistics
## lnorm llogis Pareto Burr
## Kolmogorov‐Smirnov statistic 0.1672498 0.1195888 0.08488002 0.06154925
## Cramer‐von Mises statistic 0.6373593 0.3827449 0.13926498 0.06803071
## Anderson‐Darling statistic 3.4721179 2.8315975 0.89206283 0.52393018
##
## Goodness‐of‐fit criteria
## lnorm llogis Pareto Burr
## Aikake's Information Criterion 1068.810 1069.246 1048.112 1045.731
## Bayesian Information Criterion 1074.099 1074.535 1053.400 1053.664
Visually and using the previous statistics, it seems that the Burr distribution seems the preferred one among the
candidates we chose to explore.
Parameter estimates
We can apply a bootstrap to estimate the uncertainty in the parameters:
ests <‐ bootdist(fit_B, niter = 1e3)
summary(ests)
## Parametric bootstrap medians and 95% percentile CI
## Median 2.5% 97.5%
## shape1 0.2024304 0.09504934 0.377547
## shape2 1.5802731 1.01204175 2.959759
## rate 1.4655764 0.65204235 2.702179
##
## The estimation method converged only for 995 among 1000 iterations
plot(ests)
quantile(ests, probs=.05) # 95% percentile bootstrap confidence interval
## (original) estimated quantiles for each specified probability (non‐censored data)
## p=0.05
## estimate 0.2939259
## Median of bootstrap estimates
## p=0.05
## estimate 0.3077854
##
## two‐sided 95 % CI of each quantile
## p=0.05
## 2.5 % 0.1707940
## 97.5 % 0.4960287
##
## The estimation method converged only for 995 among 1000 bootstrap iterations.

Fitting Distributions

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Fitting Distributions

Caricato da

Copyright:

Formati disponibili

12/27/2016 Fitting Distributions

fitdistr uses optim to estimate the parameter values by maximizing the likelihood function. So it works like

In the next eg, the endosulfan dataset cannot be properly fit by the basic distributions like the lognormal:

Potrebbero piacerti anche

Fitting Distributions

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Fitting Distributions

Caricato da

Copyright:

Formati disponibili

12/27/2016 Fitting Distributions

fitdistr uses optim to estimate the parameter values by maximizing the likelihood function. So it works like

In the next eg, the endosulfan dataset cannot be properly fit by the basic distributions like the log­normal:

Potrebbero piacerti anche

In the next eg, the endosulfan dataset cannot be properly fit by the basic distributions like the lognormal: