45 Stats: Fitting Distributions
Purpose: We use distributions to model random quantities. However, in order to model physical phenomena, we should fit the distributions using data. In this short exercise you’ll learn some functions for fitting distributions to data.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
45.1 Aside: Masking
Note that when we load the MASS
and tidyverse
packages, we will find that their functions conflict. To deal with this, we’ll need to learn how to specify a namespace when calling a function. To do this, use the ::
notation; i.e. namespace::function
. For instance, to call filter
from dplyr
, we would write dplyr::filter()
.
One of the specific conflicts between MASS
and tidyverse
is the select
function. Try running the chunk below; it will throw an error:
This error occurs because MASS
also provides a select
function.
45.2 Distribution Parameters and Fitting
The function rnorm()
requires values for mean
and sd
; while rnorm()
has
defaults for these arguments, if we are trying to model a random event in the
real world, we should set mean, sd
based on data. The process of estimating
parameters such as mean, sd
for a distribution is called fitting. Fitting a
distribution is often accomplished through maximum likelihood
estimation (MLE);
rather than discuss the gory details of MLE, we will simply use MLE as a
technology to do useful work.
First, let’s look at an example of MLE carried out with the function MASS::fitdistr()
.
## NOTE: No need to edit this setup
set.seed(101)
df_data_norm <- tibble(x = rnorm(50, mean = 2, sd = 1))
## NOTE: Example use of fitdistr()
df_est_norm <-
df_data_norm %>%
pull(x) %>%
fitdistr(densfun = "normal") %>%
tidy()
df_est_norm
## # A tibble: 2 × 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 mean 1.88 0.131
## 2 sd 0.923 0.0923
Notes:
fitdistr()
takes a vector; I use the functionpull(x)
to pull the vectorx
out of the dataframe.fitdistr()
returns a messy output; the functionbroom::tidy()
automagically cleans up the output and provides a tibble.
45.2.1 q1 Compute the sample mean and standard deviation of x
in df_data_norm
. Compare these values to those you computed with fitdistr()
.
## TASK: Compute the sample mean and sd of `df_data_norm %>% pull(x)`
mean_est <- df_data_norm %>% pull(x) %>% mean()
sd_est <- df_data_norm %>% pull(x) %>% sd()
mean_est
## [1] 1.876029
## [1] 0.9321467
Observations:
- The values are exactly the same!
Estimating parameters for a normal distribution is easy because it is parameterized in terms of the mean and standard deviation. The advantage of using fitdistr()
is that it will allow us to work with a much wider selection of distribution models.
45.2.2 q2 Use the function fitdistr()
to fit a "weibull"
distribution to the realizations y
in df_data_weibull
.
Note: The weibull distribution is used to model many physical phenomena, including the strength of composite materials.
## NOTE: No need to edit this setup
set.seed(101)
df_data_weibull <- tibble(y = rweibull(50, shape = 2, scale = 4))
## TASK: Use the `fitdistr()` function to estimate parameters
df_q2 <-
df_data_weibull %>%
pull(y) %>%
fitdistr(densfun = "weibull") %>%
tidy()
df_q2
## # A tibble: 2 × 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 shape 2.18 0.239
## 2 scale 4.00 0.274
Once we’ve fit a distribution, we can use the estimated parameters to approximate quantities like probabilities. If we were using the distribution for y
to model a material strength, we would estimate probabilities to compute the rate of failure for mechanical components—we could then use this information to make design decisions.
45.2.3 q3 Extract the estimates shape_est
and scale_est
from df_q2
, and use them to estimate the probability that Y <= 2
.
Hint: pr_true
contains the true probability; modify that code to compute the estimated probability.
## NOTE: No need to modify this line
pr_true <- pweibull(q = 2, shape = 2, scale = 4)
set.seed(101)
shape_est <-
df_q2 %>%
filter(term == "shape") %>%
pull(estimate)
scale_est <-
df_q2 %>%
filter(term == "scale") %>%
pull(estimate)
pr_est <- pweibull(q = 2, shape = shape_est, scale = scale_est)
pr_true
## [1] 0.2211992
## [1] 0.1988446
You’ll probably find that pr_true != pr_est
! As we saw in e-stat06-clt
we should really compute a confidence interval to assess our degree of confidence in this probability estimate. However, it’s not obvious how we can use the ideas of the Central Limit Theorem to put a confidence interval around `pr_est. In the next exercise we’ll learn a very general technique for estimating confidence intervals.
45.3 Notes
[1] For another tutorial on fitting distributions in R, see this R-bloggers post.