This is a simplified guide for selecting and fitting a probability model to a dataset. For detailed information, a good start point is the Wiki page for probability theory.
Exercise One
Let’s start by creating our own dataset and using the fitdistr function from the MASS package to fit the distribution
First we create 250 observations of sample data from a Weibull Distribution
library(MASS)
set.seed(100)
weibull_data <- rweibull(250, shape = 2, scale = 5)
Next we visualize the data
hist(weibull_data, breaks = 25, main = "Histogram of Weibull Data")

Per the generic steps described above, we can propose a model based on judgment. In this case we know the data can fit a weibull distribution
fit_weibull = fitdistr(weibull_data, densfun = "weibull")
fit_weibull
shape scale
2.0771423 4.9178426
(0.1027845) (0.1577411)
We know that the parameters used to generate the random data were shape = 2, scale = 5. From the results above, the fitdistr function is telling us that the shape is 2.07 and the scale is 4.9. Close enough. Note that fitdistr is using the Method of Maximum Likelihood to estimate the distribution parameters. To learn more, visit the documentation website -> https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html
We can then plot the histogram for the sample data to the fitted probability density function
hist(weibull_data, breaks = 25, probability = TRUE, main = "Histogram vs Fitted Curve Weibull Data")
curve(dweibull(x, fit_weibull$estimate[1], fit_weibull$estimate[2]), col="red", lwd=2, add = T)

Exercise Two
Great, that was a good example. But what if we have no idea of what distribution fits the data? Let’s explore the use of a different library: fitdistrplus
library(fitdistrplus)
fitdistrplus is another method in R for fitting distributions. For this exercise, lets use the data set Murders from the package dslabs. This package is put together by Rafael Irizarry from Harvard University for the edX online HarvardX Data Science program
library(dslabs)
data("murders")
This data set contains murder data in the United States of America.
Let’s do some data exploration
str(murders)
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
head(murders)
We will work with the total number of murders per U.S. state and try to fit a distribution for this data. Let’s visualize it.
hist(murders$total,breaks = 20, main = "Histogram of total murders in the U.S. per State")

Instead of creating a histogram, we could also plot the empirical probability density and cumulative distribution.
plotdist(murders$total, histo = TRUE, demp = TRUE)

Another tool is to show descriptive statistics, like the moments, to help us in making a decision
descdist(murders$total, discrete = FALSE, boot = 100)
summary statistics
------
min: 2 max: 1257
median: 97
mean: 184.3725
estimated sd: 236.1261
estimated skewness: 2.507494
estimated kurtosis: 11.1794

Now, before we continue with the interpretation of the results let’s make something clear: Real data doesn’t necessarily follow any particular distribution. Parametric distributions are not a perfect description of reality.
With that being said, the above results are telling us that the skewness and kurtosis are consistent with gamma, exponential, and lognorm distributions. The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.
Fitting a distribution
From the plot above, we choose lognorm, exponential and gamma distributions to fit the data using fitdist from fitdistrplus.
exponential_fit = fitdist(murders$total, "exp")
gamma_fit = fitdist(murders$total, "gamma")
NaNs producedNaNs producedNaNs producedNaNs producedNaNs produced
lognormal_fit = fitdist(murders$total, "lnorm")
And then we can plot the results
plot.legend <- c("Exponential", "Gamma", "Lognormal")
denscomp(list(exponential_fit, gamma_fit, lognormal_fit), legendtext = plot.legend)

cdfcomp (list(exponential_fit, gamma_fit, lognormal_fit), legendtext = plot.legend)

qqcomp (list(exponential_fit, gamma_fit, lognormal_fit), legendtext = plot.legend)

ppcomp (list(exponential_fit, gamma_fit, lognormal_fit), legendtext = plot.legend)

We know that trying to obtain a solution from a visualization can be subjective, so we use some goodness-of-fit quantitative methods.
gofstat(list(exponential_fit, gamma_fit, lognormal_fit), fitnames = c("Exponential", "Gamma", "LogNorm"))
Goodness-of-fit statistics
Exponential Gamma LogNorm
Kolmogorov-Smirnov statistic 0.1470814 0.07555767 0.1173972
Cramer-von Mises statistic 0.2846769 0.04689658 0.1170886
Anderson-Darling statistic 1.9855232 0.30833778 0.6811503
Goodness-of-fit criteria
Exponential Gamma LogNorm
Akaike's Information Criterion 636.1298 632.9223 635.3197
Bayesian Information Criterion 638.0616 636.7860 639.1833
From all the above statistics, including K-S test, AIC, BIC, etc, we see that the Gamma function is the best fit for the data. Let’s get the parameters of the distribution and plot them against the data.
gamma_fit
Fitting of the distribution ' gamma ' by maximum likelihood
Parameters:
hist(murders$total, breaks = 20, probability = TRUE, main = "Histogram vs Fitted Curve Data")
curve(dgamma(x, shape = gamma_fit$estimate[1], rate = gamma_fit$estimate[2]), col="red", lwd=2, add = T)

Great! we were able to fit a distribution and obtain the paremeters in order to solve any further questions.
LS0tCnRpdGxlOiAiUHJvYmFiaWxpdHkgTW9kZWwgU2VsZWN0aW9uIHVzaW5nIFIiCmF1dGhvcjogIkNocmlzdG9waGVyIE1hcnRpbmV6IFNlcnZpbiIKZGF0ZTogIjgvOS8yMDE4IgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpUaGlzIGlzIGEgc2ltcGxpZmllZCBndWlkZSBmb3Igc2VsZWN0aW5nIGFuZCBmaXR0aW5nIGEgcHJvYmFiaWxpdHkgbW9kZWwgdG8gYSBkYXRhc2V0LiBGb3IgZGV0YWlsZWQgaW5mb3JtYXRpb24sIGEgZ29vZCBzdGFydCBwb2ludCBpcyB0aGUgW1dpa2kgcGFnZV0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvUHJvYmFiaWxpdHlfdGhlb3J5KSBmb3IgKnByb2JhYmlsaXR5IHRoZW9yeSouCgojIyMgR2VuZXJpYyBTdGVwcwo8aHR0cHM6Ly9jcmFuLnItcHJvamVjdC5vcmcvPgoKMSkgQ29tcHV0ZSB0aGUgbW9tZW50cyAobWVhbiwgc3RkLiBkZXYuLCBldGMuKSBmcm9tIGRhdGEgIAoyKSBQcm9wb3NlIGEgbW9kZWwgYmFzZWQgb24ganVkZ21lbnQsIHBhc3QgZXhwZXJpZW5jZSwgYW5kIHBoeXNpY2FsIGNvbnN0cmFpbnRzICAKMykgQ2FsaWJyYXRlIHRoZSBwcm9wb3NlZCBtb2RlbCBhbmQgZml0IGRpc3RyaWJ1dGlvbiBwYXJhbWV0ZXJzIHVzaW5nIHRlY2huaXF1ZXMgc3VjaCBhcyAqKk1ldGhvZCBvZiBNb21lbnRzKiosICoqTWV0aG9kIG9mIE1heGltdW0gTGlrZWxpaG9vZCoqLCAqKkdyYXBoaWNhbCB0ZWNobmlxdWUgJ1Byb2JhYmlsaXR5IFBsb3QnKiosIGV0Yy4KNCkgTWFrZSAqUXVhbGl0YXRpdmVseSBhbmQgUXVhbnRpdGF0aXZlKiBjb21wYXJpc29ucyAgCiAgICAgICBhKSBOb3JtYWxpemVkIFNhbXBsZSBIaXN0b2dyYW0gdnMuIEZpdHRlZCBQREYKICAgICAgIGIpIFNhbXBsZSBFbXBpcmljYWwgQ0RGIHZzLiBGaXR0ZWQgQ0RGCiAgICAgICBjKSBTdGF0aXN0aWNhbCBnb29kbmVzcy1vZi1maXQgdGVzdHMgc3VjaCBhcyAqJFJeMiQgZm9yIHJlZ3Jlc3Npb24qLCAqJFxjaGleMiQtdGVzdCosICpLLVMgdGVzdCosIGV0Yy4KCiMjIyBFeGVyY2lzZSBPbmUKTGV0J3Mgc3RhcnQgYnkgY3JlYXRpbmcgb3VyIG93biBkYXRhc2V0IGFuZCB1c2luZyB0aGUgKipmaXRkaXN0cioqIGZ1bmN0aW9uIGZyb20gdGhlICoqTUFTUyoqIHBhY2thZ2UgdG8gZml0IHRoZSBkaXN0cmlidXRpb24KCkZpcnN0IHdlIGNyZWF0ZSAyNTAgb2JzZXJ2YXRpb25zIG9mIHNhbXBsZSBkYXRhIGZyb20gYSBXZWlidWxsIERpc3RyaWJ1dGlvbiAgCmBgYHtyfQpsaWJyYXJ5KE1BU1MpCnNldC5zZWVkKDEwMCkKd2VpYnVsbF9kYXRhIDwtIHJ3ZWlidWxsKDI1MCwgc2hhcGUgPSAyLCBzY2FsZSA9IDUpCmBgYAoKTmV4dCB3ZSB2aXN1YWxpemUgdGhlIGRhdGEKYGBge3J9Cmhpc3Qod2VpYnVsbF9kYXRhLCBicmVha3MgPSAyNSwgbWFpbiA9ICJIaXN0b2dyYW0gb2YgV2VpYnVsbCBEYXRhIikKYGBgClBlciB0aGUgZ2VuZXJpYyBzdGVwcyBkZXNjcmliZWQgYWJvdmUsIHdlIGNhbiBwcm9wb3NlIGEgbW9kZWwgYmFzZWQgb24ganVkZ21lbnQuIEluIHRoaXMgY2FzZSB3ZSBrbm93IHRoZSBkYXRhIGNhbiBmaXQgYSB3ZWlidWxsIGRpc3RyaWJ1dGlvbgpgYGB7cn0KZml0X3dlaWJ1bGwgPSBmaXRkaXN0cih3ZWlidWxsX2RhdGEsIGRlbnNmdW4gPSAid2VpYnVsbCIpCmZpdF93ZWlidWxsCmBgYAoKV2Uga25vdyB0aGF0IHRoZSBwYXJhbWV0ZXJzIHVzZWQgdG8gZ2VuZXJhdGUgdGhlIHJhbmRvbSBkYXRhIHdlcmUgc2hhcGUgPSAyLCBzY2FsZSA9IDUuIEZyb20gdGhlIHJlc3VsdHMgYWJvdmUsIHRoZSAqKmZpdGRpc3RyKiogZnVuY3Rpb24gaXMgdGVsbGluZyB1cyB0aGF0IHRoZSBzaGFwZSBpcyAyLjA3IGFuZCB0aGUgc2NhbGUgaXMgNC45LiBDbG9zZSBlbm91Z2guIE5vdGUgdGhhdCAqKmZpdGRpc3RyKiogaXMgdXNpbmcgdGhlICpNZXRob2Qgb2YgTWF4aW11bSBMaWtlbGlob29kKiB0byBlc3RpbWF0ZSB0aGUgZGlzdHJpYnV0aW9uIHBhcmFtZXRlcnMuIFRvIGxlYXJuIG1vcmUsIHZpc2l0IHRoZSBkb2N1bWVudGF0aW9uIHdlYnNpdGUgLT4gPGh0dHBzOi8vc3RhdC5ldGh6LmNoL1ItbWFudWFsL1ItZGV2ZWwvbGlicmFyeS9NQVNTL2h0bWwvZml0ZGlzdHIuaHRtbD4KCldlIGNhbiB0aGVuIHBsb3QgdGhlIGhpc3RvZ3JhbSBmb3IgdGhlIHNhbXBsZSBkYXRhIHRvIHRoZSBmaXR0ZWQgcHJvYmFiaWxpdHkgZGVuc2l0eSBmdW5jdGlvbgpgYGB7cn0KaGlzdCh3ZWlidWxsX2RhdGEsIGJyZWFrcyA9IDI1LCBwcm9iYWJpbGl0eSA9IFRSVUUsIG1haW4gPSAiSGlzdG9ncmFtIHZzIEZpdHRlZCBDdXJ2ZSBXZWlidWxsIERhdGEiKQpjdXJ2ZShkd2VpYnVsbCh4LCBmaXRfd2VpYnVsbCRlc3RpbWF0ZVsxXSwgZml0X3dlaWJ1bGwkZXN0aW1hdGVbMl0pLCBjb2w9InJlZCIsIGx3ZD0yLCBhZGQgPSBUKQpgYGAKCiMjIyBFeGVyY2lzZSBUd28KR3JlYXQsIHRoYXQgd2FzIGEgZ29vZCBleGFtcGxlLiBCdXQgd2hhdCBpZiB3ZSBoYXZlIG5vIGlkZWEgb2Ygd2hhdCBkaXN0cmlidXRpb24gZml0cyB0aGUgZGF0YT8gTGV0J3MgZXhwbG9yZSB0aGUgdXNlIG9mIGEgZGlmZmVyZW50IGxpYnJhcnk6ICoqZml0ZGlzdHJwbHVzKioKYGBge3J9CmxpYnJhcnkoZml0ZGlzdHJwbHVzKQpgYGAKKipmaXRkaXN0cnBsdXMqKiBpcyBhbm90aGVyIG1ldGhvZCBpbiBSIGZvciBmaXR0aW5nIGRpc3RyaWJ1dGlvbnMuIEZvciB0aGlzIGV4ZXJjaXNlLCBsZXRzIHVzZSB0aGUgZGF0YSBzZXQgKk11cmRlcnMqIGZyb20gdGhlIHBhY2thZ2UgKmRzbGFicyouIFRoaXMgcGFja2FnZSBpcyBwdXQgdG9nZXRoZXIgYnkgWypSYWZhZWwgSXJpemFycnkqXShodHRwOi8vcmFmYWxhYi5naXRodWIuaW8pIGZyb20gKkhhcnZhcmQgVW5pdmVyc2l0eSogZm9yIHRoZSBlZFggb25saW5lIFsqSGFydmFyZFggRGF0YSBTY2llbmNlIHByb2dyYW0qXShodHRwczovL3d3dy5lZHgub3JnL3Byb2Zlc3Npb25hbC1jZXJ0aWZpY2F0ZS9oYXJ2YXJkeC1kYXRhLXNjaWVuY2UpCmBgYHtyfQpsaWJyYXJ5KGRzbGFicykKZGF0YSgibXVyZGVycyIpCmBgYApUaGlzIGRhdGEgc2V0IGNvbnRhaW5zIG11cmRlciBkYXRhIGluIHRoZSBVbml0ZWQgU3RhdGVzIG9mIEFtZXJpY2EuICAKCkxldCdzIGRvIHNvbWUgZGF0YSBleHBsb3JhdGlvbgpgYGB7cn0Kc3RyKG11cmRlcnMpCmBgYApgYGB7cn0KaGVhZChtdXJkZXJzKQpgYGAKCldlIHdpbGwgd29yayB3aXRoIHRoZSB0b3RhbCBudW1iZXIgb2YgbXVyZGVycyBwZXIgVS5TLiBzdGF0ZSBhbmQgdHJ5IHRvIGZpdCBhIGRpc3RyaWJ1dGlvbiBmb3IgdGhpcyBkYXRhLiBMZXQncyB2aXN1YWxpemUgaXQuCgpgYGB7cn0KaGlzdChtdXJkZXJzJHRvdGFsLGJyZWFrcyA9IDIwLCBtYWluID0gIkhpc3RvZ3JhbSBvZiB0b3RhbCBtdXJkZXJzIGluIHRoZSBVLlMuIHBlciBTdGF0ZSIpCmBgYApJbnN0ZWFkIG9mIGNyZWF0aW5nIGEgaGlzdG9ncmFtLCB3ZSBjb3VsZCBhbHNvIHBsb3QgdGhlIGVtcGlyaWNhbCBwcm9iYWJpbGl0eSBkZW5zaXR5IGFuZCBjdW11bGF0aXZlIGRpc3RyaWJ1dGlvbi4KYGBge3J9CnBsb3RkaXN0KG11cmRlcnMkdG90YWwsIGhpc3RvID0gVFJVRSwgZGVtcCA9IFRSVUUpCmBgYApBbm90aGVyIHRvb2wgaXMgdG8gc2hvdyBkZXNjcmlwdGl2ZSBzdGF0aXN0aWNzLCBsaWtlIHRoZSBtb21lbnRzLCB0byBoZWxwIHVzIGluIG1ha2luZyBhIGRlY2lzaW9uCmBgYHtyfQpkZXNjZGlzdChtdXJkZXJzJHRvdGFsLCBkaXNjcmV0ZSA9IEZBTFNFLCBib290ID0gMTAwKQpgYGAKCk5vdywgYmVmb3JlIHdlIGNvbnRpbnVlIHdpdGggdGhlIGludGVycHJldGF0aW9uIG9mIHRoZSByZXN1bHRzIGxldCdzIG1ha2Ugc29tZXRoaW5nIGNsZWFyOiBSZWFsIGRhdGEgZG9lc24ndCBuZWNlc3NhcmlseSBmb2xsb3cgYW55IHBhcnRpY3VsYXIgZGlzdHJpYnV0aW9uLiAqKlBhcmFtZXRyaWMgZGlzdHJpYnV0aW9ucyBhcmUgbm90IGEgcGVyZmVjdCBkZXNjcmlwdGlvbiBvZiByZWFsaXR5LioqCgpXaXRoIHRoYXQgYmVpbmcgc2FpZCwgdGhlIGFib3ZlIHJlc3VsdHMgYXJlIHRlbGxpbmcgdXMgdGhhdCB0aGUgc2tld25lc3MgYW5kIGt1cnRvc2lzIGFyZSBjb25zaXN0ZW50IHdpdGggZ2FtbWEsIGV4cG9uZW50aWFsLCBhbmQgbG9nbm9ybSBkaXN0cmlidXRpb25zLiBUaGUgcGxvdCBhbHNvIGluY2x1ZGVzIGEgbm9ucGFyYW1ldHJpYyBib290c3RyYXAgcHJvY2VkdXJlIGZvciB0aGUgdmFsdWVzIG9mIGt1cnRvc2lzIGFuZCBza2V3bmVzcy4KCiMjIyMgRml0dGluZyBhIGRpc3RyaWJ1dGlvbgpGcm9tIHRoZSBwbG90IGFib3ZlLCB3ZSBjaG9vc2UgbG9nbm9ybSwgZXhwb25lbnRpYWwgYW5kIGdhbW1hIGRpc3RyaWJ1dGlvbnMgdG8gZml0IHRoZSBkYXRhIHVzaW5nICoqZml0ZGlzdCoqIGZyb20gKipmaXRkaXN0cnBsdXMqKi4KCgpgYGB7cn0KZXhwb25lbnRpYWxfZml0ID0gZml0ZGlzdChtdXJkZXJzJHRvdGFsLCAiZXhwIikKZ2FtbWFfZml0ID0gZml0ZGlzdChtdXJkZXJzJHRvdGFsLCAiZ2FtbWEiKQpsb2dub3JtYWxfZml0ID0gZml0ZGlzdChtdXJkZXJzJHRvdGFsLCAibG5vcm0iKQpgYGAKQW5kIHRoZW4gd2UgY2FuIHBsb3QgdGhlIHJlc3VsdHMKYGBge3J9CnBsb3QubGVnZW5kIDwtIGMoIkV4cG9uZW50aWFsIiwgIkdhbW1hIiwgIkxvZ25vcm1hbCIpCmRlbnNjb21wKGxpc3QoZXhwb25lbnRpYWxfZml0LCBnYW1tYV9maXQsIGxvZ25vcm1hbF9maXQpLCBsZWdlbmR0ZXh0ID0gcGxvdC5sZWdlbmQpCmNkZmNvbXAgKGxpc3QoZXhwb25lbnRpYWxfZml0LCBnYW1tYV9maXQsIGxvZ25vcm1hbF9maXQpLCBsZWdlbmR0ZXh0ID0gcGxvdC5sZWdlbmQpCnFxY29tcCAgKGxpc3QoZXhwb25lbnRpYWxfZml0LCBnYW1tYV9maXQsIGxvZ25vcm1hbF9maXQpLCBsZWdlbmR0ZXh0ID0gcGxvdC5sZWdlbmQpCnBwY29tcCAgKGxpc3QoZXhwb25lbnRpYWxfZml0LCBnYW1tYV9maXQsIGxvZ25vcm1hbF9maXQpLCBsZWdlbmR0ZXh0ID0gcGxvdC5sZWdlbmQpCmBgYApXZSBrbm93IHRoYXQgdHJ5aW5nIHRvIG9idGFpbiBhIHNvbHV0aW9uIGZyb20gYSB2aXN1YWxpemF0aW9uIGNhbiBiZSBzdWJqZWN0aXZlLCBzbyB3ZSB1c2Ugc29tZSBnb29kbmVzcy1vZi1maXQgcXVhbnRpdGF0aXZlIG1ldGhvZHMuCmBgYHtyfQpnb2ZzdGF0KGxpc3QoZXhwb25lbnRpYWxfZml0LCBnYW1tYV9maXQsIGxvZ25vcm1hbF9maXQpLCBmaXRuYW1lcyA9IGMoIkV4cG9uZW50aWFsIiwgIkdhbW1hIiwgIkxvZ05vcm0iKSkKYGBgCkZyb20gYWxsIHRoZSBhYm92ZSBzdGF0aXN0aWNzLCBpbmNsdWRpbmcgSy1TIHRlc3QsIEFJQywgQklDLCBldGMsIHdlIHNlZSB0aGF0IHRoZSBHYW1tYSBmdW5jdGlvbiBpcyB0aGUgYmVzdCBmaXQgZm9yIHRoZSBkYXRhLiBMZXQncyBnZXQgdGhlIHBhcmFtZXRlcnMgb2YgdGhlIGRpc3RyaWJ1dGlvbiBhbmQgcGxvdCB0aGVtIGFnYWluc3QgdGhlIGRhdGEuCmBgYHtyfQpnYW1tYV9maXQKYGBgCmBgYHtyfQpoaXN0KG11cmRlcnMkdG90YWwsIGJyZWFrcyA9IDIwLCBwcm9iYWJpbGl0eSA9IFRSVUUsIG1haW4gPSAiSGlzdG9ncmFtIHZzIEZpdHRlZCBDdXJ2ZSBEYXRhIikKY3VydmUoZGdhbW1hKHgsIHNoYXBlID0gZ2FtbWFfZml0JGVzdGltYXRlWzFdLCByYXRlID0gZ2FtbWFfZml0JGVzdGltYXRlWzJdKSwgY29sPSJyZWQiLCBsd2Q9MiwgYWRkID0gVCkKYGBgCkdyZWF0ISB3ZSB3ZXJlIGFibGUgdG8gZml0IGEgZGlzdHJpYnV0aW9uIGFuZCBvYnRhaW4gdGhlIHBhcmVtZXRlcnMgaW4gb3JkZXIgdG8gc29sdmUgYW55IGZ1cnRoZXIgcXVlc3Rpb25zLgo=