37 Data: Reading Excel Sheets

Purpose: The Tidyverse is built to work with tidy data. Unfortunately, most data in the wild are not tidy. The good news is that we have a lot of tools for wrangling data into tidy form. The bad news is that “every untidy dataset is untidy in its own way.” I can’t show you you every crazy way people decide to store their data. But I can walk you through a worked example to show some common techniques.

In this case study, I’ll take you through the process of wrangling a messy Excel spreadsheet into machine-readable form. You will both learn some general tools for wrangling data, and you can keep this notebook as a recipe for future messy datasets of similar form.

Reading: (None)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl) # For reading Excel sheets
library(httr) # For downloading files

## Use my tidy-exercises copy of UNDOC data for stability
url_undoc <- "https://github.com/zdelrosario/tidy-exercises/blob/master/2019/2019-12-10-news-plots/GSH2013_Homicide_count_and_rate.xlsx?raw=true"
filename <- "./data/undoc.xlsx"

I keep a copy of the example data in a personal repo; download a local copy.

## NOTE: No need to edit
curl::curl_download(
        url_undoc,
        filename
      )

37.1 Wrangling Basics

37.1.1 q1 Run the following code and pay attention to the column names. Open the downloaded Excel sheet and compare. Why are the column names so weird?

## NOTE: No need to edit; run and inspect
df_q1 <- read_excel(filename)

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
## • `` -> `...19`

df_q1 %>% glimpse

## Rows: 447
## Columns: 19
## $ `Intentional homicide count and rate per 100,000 population, by country/territory (2000-2012)` <chr> …
## $ ...2                                                                                           <chr> …
## $ ...3                                                                                           <chr> …
## $ ...4                                                                                           <chr> …
## $ ...5                                                                                           <chr> …
## $ ...6                                                                                           <chr> …
## $ ...7                                                                                           <chr> …
## $ ...8                                                                                           <dbl> …
## $ ...9                                                                                           <dbl> …
## $ ...10                                                                                          <dbl> …
## $ ...11                                                                                          <dbl> …
## $ ...12                                                                                          <dbl> …
## $ ...13                                                                                          <dbl> …
## $ ...14                                                                                          <dbl> …
## $ ...15                                                                                          <dbl> …
## $ ...16                                                                                          <chr> …
## $ ...17                                                                                          <chr> …
## $ ...18                                                                                          <chr> …
## $ ...19                                                                                          <chr> …

Observations:

The top row is filled with expository text. The actual column names are several rows down.

Most read_ functions have a skip argument you can use to skip over the first few lines. Use this argument in the next task to deal with the top of the Excel sheet.

37.1.2 q2 Read the Excel sheet.

Open the target Excel sheet (located at ./data/undoc.xlsx) and find which line (row) at which the year column headers are located. Use the skip keyword to start your read at that line.

## TODO:
df_q2 <- read_excel(
  filename,
  skip = 6
)

## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`

df_q2 %>% glimpse

## Rows: 444
## Columns: 19
## $ ...1   <chr> "Africa", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ...2   <chr> "Eastern Africa", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ...3   <chr> "Burundi", NA, "Comoros", NA, "Djibouti", NA, "Eritrea", NA, "E…
## $ ...4   <chr> "PH", NA, "PH", NA, "PH", NA, "PH", NA, "PH", NA, "CJ", NA, "PH…
## $ ...5   <chr> "WHO", NA, "WHO", NA, "WHO", NA, "WHO", NA, "WHO", NA, "CTS", N…
## $ ...6   <chr> "Rate", "Count", "Rate", "Count", "Rate", "Count", "Rate", "Cou…
## $ `2000` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6.2, 70…
## $ `2001` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7.7, 90…
## $ `2002` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4.8, 56…
## $ `2003` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2.5, 30…
## $ `2004` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 4.0, 1395.0, NA, NA, 3.…
## $ `2005` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3.5, 1260.0, NA, NA, 1.…
## $ `2006` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3.5, 1286.0, NA, NA, 6.…
## $ `2007` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3.4, 1281.0, NA, NA, 5.…
## $ `2008` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3.6, 1413.0, NA, NA, 5.…
## $ `2009` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "5.6", "2218", NA, NA, …
## $ `2010` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "5.5", "2239", NA, NA, …
## $ `2011` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "6.3", "2641", NA, NA, …
## $ `2012` <chr> "8", "790", "10", "72", "10.1", "87", "7.1", "437", "12", "1104…

Use the following test to check your work.

## NOTE: No need to change this
assertthat::assert_that(setequal(
              (df_q2 %>% names() %>% .[7:19]),
              as.character(seq(2000, 2012))
            ))

## [1] TRUE

print("Nice!")

## [1] "Nice!"

Let’s take stock of where we are:

df_q2 %>% head()

## # A tibble: 6 × 19
##   ...1   ...2  ...3  ...4  ...5  ...6  `2000` `2001` `2002` `2003` `2004` `2005`
##   <chr>  <chr> <chr> <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Africa East… Buru… PH    WHO   Rate      NA     NA     NA     NA     NA     NA
## 2 <NA>   <NA>  <NA>  <NA>  <NA>  Count     NA     NA     NA     NA     NA     NA
## 3 <NA>   <NA>  Como… PH    WHO   Rate      NA     NA     NA     NA     NA     NA
## 4 <NA>   <NA>  <NA>  <NA>  <NA>  Count     NA     NA     NA     NA     NA     NA
## 5 <NA>   <NA>  Djib… PH    WHO   Rate      NA     NA     NA     NA     NA     NA
## 6 <NA>   <NA>  <NA>  <NA>  <NA>  Count     NA     NA     NA     NA     NA     NA
## # ℹ 7 more variables: `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <chr>,
## #   `2010` <chr>, `2011` <chr>, `2012` <chr>

We still have problems:

The first few columns don’t have sensible names. The col_names argument allows us to set manual names at the read phase.
Some of the columns are of the wrong type; for instance 2012 is a chr vector. We can use the col_types argument to set manual column types.

37.1.3 q3 Change the column names and types.

Use the provided names in col_names_undoc with the col_names argument to set manual column names. Use the col_types argument to set all years to "numeric", and the rest to "text".

Hint 1: Since you’re providing manual col_names, you will have to adjust your skip value!

Hint 2: You can use a named vector for col_types to help keep type of which type is assigned to which variable, for instance c("variable" = "type").

## NOTE: Use these column names
col_names_undoc <-
  c(
    "region",
    "sub_region",
    "territory",
    "source",
    "org",
    "indicator",
    "2000",
    "2001",
    "2002",
    "2003",
    "2004",
    "2005",
    "2006",
    "2007",
    "2008",
    "2009",
    "2010",
    "2011",
    "2012"
  )

## TASK: Use the arguments `skip`, `col_names`, and `col_types`
df_q3 <- read_excel(
  filename,
  sheet = 1,
  skip = 7,
  col_names = col_names_undoc,
  col_types = c(
    "region"     = "text",
    "sub_region" = "text",
    "territory"  = "text",
    "source"     = "text",
    "org"        = "text",
    "indicator"  = "text",
    "2000" = "numeric",
    "2001" = "numeric",
    "2002" = "numeric",
    "2003" = "numeric",
    "2004" = "numeric",
    "2005" = "numeric",
    "2006" = "numeric",
    "2007" = "numeric",
    "2008" = "numeric",
    "2009" = "numeric",
    "2010" = "numeric",
    "2011" = "numeric",
    "2012" = "numeric"
  )
)

## Warning: Expecting numeric in P315 / R315C16: got '2366*'

## Warning: Expecting numeric in Q315 / R315C17: got '1923*'

## Warning: Expecting numeric in R315 / R315C18: got '1866*'

## Warning: Expecting numeric in S381 / R381C19: got 'x'

## Warning: Expecting numeric in S431 / R431C19: got 'x'

## Warning: Expecting numeric in S433 / R433C19: got 'x'

## Warning: Expecting numeric in S435 / R435C19: got 'x'

## Warning: Expecting numeric in S439 / R439C19: got 'x'

## Warning: Expecting numeric in S445 / R445C19: got 'x'

Use the following test to check your work.

## NOTE: No need to change this
assertthat::assert_that(setequal(
              (df_q3 %>% names()),
              col_names_undoc
            ))

## [1] TRUE

assertthat::assert_that((df_q3 %>% slice(1) %>% pull(`2012`)) == 8)

## [1] TRUE

print("Great!")

## [1] "Great!"

37.2 Danger Zone

Now let’s take a look at the head of the data:

df_q3 %>% head()

## # A tibble: 6 × 19
##   region sub_region territory source org   indicator `2000` `2001` `2002` `2003`
##   <chr>  <chr>      <chr>     <chr>  <chr> <chr>      <dbl>  <dbl>  <dbl>  <dbl>
## 1 Africa Eastern A… Burundi   PH     WHO   Rate          NA     NA     NA     NA
## 2 <NA>   <NA>       <NA>      <NA>   <NA>  Count         NA     NA     NA     NA
## 3 <NA>   <NA>       Comoros   PH     WHO   Rate          NA     NA     NA     NA
## 4 <NA>   <NA>       <NA>      <NA>   <NA>  Count         NA     NA     NA     NA
## 5 <NA>   <NA>       Djibouti  PH     WHO   Rate          NA     NA     NA     NA
## 6 <NA>   <NA>       <NA>      <NA>   <NA>  Count         NA     NA     NA     NA
## # ℹ 9 more variables: `2004` <dbl>, `2005` <dbl>, `2006` <dbl>, `2007` <dbl>,
## #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>

Irritatingly, many of the cell values are left implicit; as humans reading these data, we can tell that the entries in region under Africa also have the value Africa. However, the computer can’t tell this! We need to make these values explicit by filling them in.

To that end, I’m going to guide you through some slightly advanced Tidyverse code to lag-fill the missing values. To that end, I’ll define and demonstrate two helper functions:

First, the following function counts the number of rows with NA entries in a chosen set of columns:

## Helper function to count num rows w/ NA in vars_lagged
rowAny <- function(x) rowSums(x) > 0
countna <- function(df, vars_lagged) {
  df %>%
    filter(rowAny(across(vars_lagged, is.na))) %>%
    dim %>%
    .[[1]]
}

countna(df_q3, c("region"))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `rowAny(across(vars_lagged, is.na))`.
## Caused by warning:
## ! Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(vars_lagged)
## 
##   # Now:
##   data %>% select(all_of(vars_lagged))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

## [1] 435

Ideally we want this count to be zero. To fill-in values, we will use the following function to do one round of lag-filling:

lagfill <- function(df, vars_lagged) {
  df %>%
    mutate(across(
      vars_lagged,
      function(var) {
        if_else(
          is.na(var) & !is.na(lag(var)),
          lag(var),
          var
        )
      }
    ))
}

df_tmp <-
  df_q3 %>%
  lagfill(c("region"))

countna(df_tmp, c("region"))

## [1] 429

We can see that lagfill has filled the Africa value in row 2, as well as a number of other rows as evidenced by the reduced value returned by countna.

What we’ll do is continually run lagfill until we reduce countna to zero. We could do this by repeatedly running the function manually, but that would be silly. Instead, we’ll run a while loop to automatically run the function until countna reaches zero.

37.2.1 q4 I have already provided the `while` loop below; fill in `vars_lagged` with the names of the columns where cell values are implicit.

Hint: Think about which columns have implicit values, and which truly have missing values.

## Choose variables to lag-fill
vars_lagged <- c("region", "sub_region", "territory", "source", "org")

## NOTE: No need to edit this
## Trim head and notes
df_q4 <-
  df_q3 %>%
  slice(-(n()-5:-n()))

## Repeatedly lag-fill until NA's are gone
while (countna(df_q4, vars_lagged) > 0) {
  df_q4 <-
    df_q4 %>%
    lagfill(vars_lagged)
}

And we’re done! All of the particularly tricky wrangling is now done. You could now use pivoting to tidy the data into long form.

Purpose: The sampling distribution is a key object in statistics—it helps us understand how good our estimates are. However, following the definition is an impractical way to compute a sampling distribution. In this exercise we’ll learn about a practical recipe for estimating a sampling distribution—bootstrap resampling.

Reading: (None, this is the reading)

Topics: Resampling and inference

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(rsample)

37.3 The Core Idea: Resampling with Replacement

In the previous exercise (e-stat04-population) we learned about the difference between a population and a sample. We also talked about sampling distributions—a sampling distribution is a theoretical distribution where we repeatedly gather a sample (of fixed size \(n\)) from the population, and compute the same summary statistic each time. The distribution of the summary statistic is the sampling distribution.

However, this theoretical recipe is not a practical idea: It would be silly to gather many small samples and analyze them separately. If we’ve worked hard to gather a large dataset, we should just analyze it all together. To overcome this practical hurdle, we’ll use the idea of resampling—this is based on a key assumption, that our sample is representative of our population.

Here’s a sample from a normal distribution (population). For ease of comparison, we draw the normal distribution as a solid curve.

## NOTE: No need to edit this setup
set.seed(101) # Set a seed for reproducibility

mu <- 0
sd <- 1
n <- 50

df_data_norm <- tibble(x = rnorm(n, mean = mu, sd = sd))

df_data_norm %>%
  ggplot(aes(x)) +
  geom_histogram(aes(y = after_stat(..density..))) +
  geom_line(
    aes(y = d),
    data = tibble(x = seq(-2.5, +1.5, length.out = 500)) %>% 
      mutate(d = dnorm(x, mean = mu, sd = sd)),
    color = "salmon"
  )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The sample (histogram) does not look identical to the population (solid curve). But it can’t, because we don’t have all the observations in the population. The key assumption we make in statistical inference is that the sample “looks like” the population. In technical terms, such a sample is representative of a chosen population.

The way we use this assumption is to treat our sample as though it were the population. This is subtle though. If we resample our entire sample, we end up with exactly the same sample. That’s not helpful!

## NOTE: Do not edit this code
# This is the **WRONG** way to do resampling
df_data_norm %>% 
  # Resample *without* replacement
  pull(x) %>% 
  sample(., size = n, replace = FALSE) %>% 
  tibble(x = ., source = "Resampled") %>% 
  # Visualize against the original sample
  bind_rows(., df_data_norm %>% mutate(source = "Original")) %>% 
  ggplot(aes(x)) +
  geom_freqpoly(aes(color = source, linetype = source), bins = 15)

Note that the Resampled set looks exactly like the Original sample. This is not what we want. Instead, we have to resample with replacement.

## NOTE: Do not edit this code
# This is the **RIGHT** way to do resampling
df_data_norm %>% 
  # Resample *with* replacement
  pull(x) %>% 
  sample(., size = n, replace = TRUE) %>% 
  tibble(x = ., source = "Resampled") %>% 
  # Visualize against the original sample
  bind_rows(., df_data_norm %>% mutate(source = "Original")) %>% 
  ggplot(aes(x)) +
  geom_freqpoly(aes(color = source, linetype = source), bins = 15)

Once we have resampled the sample, we can compute any statistics we care about:

## NOTE: Do not edit this code
# This is the **RIGHT** way to do resampling
df_data_norm %>% 
  # Resample *with* replacement
  pull(x) %>% 
  sample(., size = n, replace = TRUE) %>% 
  tibble(x = ., source = "Resampled") %>% 
  # Compute statistics we care about
  bind_rows(., df_data_norm %>% mutate(source = "Original")) %>% 
  group_by(source) %>% 
  summarize(
    sample_mean = mean(x),
    sample_sd = sd(x)
  )

## # A tibble: 2 × 3
##   source    sample_mean sample_sd
##   <chr>           <dbl>     <dbl>
## 1 Original       -0.124     0.932
## 2 Resampled       0.103     0.763

This is the core idea of resampling. But to simulate a sampling distribution, we need to do this many times.

37.3.1 Efficient resampling

The code we just saw is computationally wasteful, since it stores the same data in many copies. So, programmers have written efficient packages to support resampling methods. We will use the rsample package for resampling in this class. The bootstraps() function resamples with replacement a desired number of times.

Note: Bootstrap resampling is just one of many kinds of resampling methods. It’s called “bootstrapping” because we’re “pulling ourselves up by our own bootstraps”, in the sense that we’re only using the sample we have at hand.

df_data_norm %>% 
  bootstraps(., times = 1000)

## # Bootstrap sampling 
## # A tibble: 1,000 × 2
##    splits          id           
##    <list>          <chr>        
##  1 <split [50/16]> Bootstrap0001
##  2 <split [50/19]> Bootstrap0002
##  3 <split [50/16]> Bootstrap0003
##  4 <split [50/21]> Bootstrap0004
##  5 <split [50/18]> Bootstrap0005
##  6 <split [50/16]> Bootstrap0006
##  7 <split [50/15]> Bootstrap0007
##  8 <split [50/17]> Bootstrap0008
##  9 <split [50/22]> Bootstrap0009
## 10 <split [50/23]> Bootstrap0010
## # ℹ 990 more rows

This looks weird; we have 1000 rows, corresponding to times = 1000. The splits column is where the programming magic happens: Rather than storing a copy of the original data, a split is a list of observations to use in a resample. To take advantage of bootstraps(), we have to use the anaysis() function on the splits column. Here’s an example of accessing a single resample:

## NOTE: No need to edit
df_data_norm %>% 
  bootstraps(., times = 1) %>% 
  # Pull out a single `splits` object
  pull(splits) %>% 
  .[[1]] %>% 
  # Use analysis() to turn the split into a usable dataset
  analysis()

## # A tibble: 50 × 1
##         x
##     <dbl>
##  1 -1.46 
##  2 -0.223
##  3  0.758
##  4  1.17 
##  5  0.920
##  6  0.279
##  7  0.569
##  8 -0.223
##  9  0.448
## 10  0.403
## # ℹ 40 more rows

The following code uses bootstraps() and analysis() to show the first 9 resamplings.

## NOTE: No need to edit this setup
df_resample_norm <-
  bootstraps(df_data_norm, times = 1000) %>%
  mutate(df = map(splits, ~ analysis(.x)))

# Visualize the first 9 of the resamples
df_resample_norm %>%
  slice(1:9) %>%
  unnest(df) %>%
  ggplot(aes(x)) +
  geom_histogram() +
  facet_wrap(~ id)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In the example above, I set times = 1000 but show only the first nine. Generally, a larger value of times is better, but a good rule of thumb is to do at least 1000 resamples.

Side Notes:

The bootstraps() function comes from the rsample package, which implements many different resampling strategies (beyond the bootstrap).
The analysis() function also comes from rsample; this is a special function we need to call when working with a resampling of the data [1].
We saw the map() function in e-data10-map; using map() above is necessary in part because we need to call analysis(). Since analysis() is not vectorized, we need the map to use this function on every split in splits.

37.3.2 Using resampling

Working with bootstraps() can be a little finicky. Here’s some example code you can adapt for future use.

Fundamentally, this code is resampling the data with replacement, and carrying out an analysis on each (resampled) dataset using a function we provide.

## NOTE: No need to edit this example
# Start with our dataset
df_demo <- 
  df_data_norm %>% 
  # Create the bootstrap resamplings
  bootstraps(., times = 1000) %>% 
  # Finicky code: Compute an estimate for each resampling (`splits` column)
  mutate(
    # The `map_*` family of functions iterates a provided function over a column
    sample_mean = map_dbl(
      # Chosen column
      splits,
      # Function we provide (defined inline)
      function(split_df) {
        # First, we use `analysis()` to translate the split into usable data
        split_df %>% 
          analysis() %>% 
          # Then, we carry out whatever analysis we want
          summarize(sample_mean = mean(x)) %>% 
          # One last wrangling step
          pull(sample_mean)
      }
    )
  )

df_demo

## # Bootstrap sampling 
## # A tibble: 1,000 × 3
##    splits          id            sample_mean
##    <list>          <chr>               <dbl>
##  1 <split [50/18]> Bootstrap0001     -0.118 
##  2 <split [50/22]> Bootstrap0002     -0.301 
##  3 <split [50/19]> Bootstrap0003     -0.0764
##  4 <split [50/17]> Bootstrap0004     -0.180 
##  5 <split [50/20]> Bootstrap0005      0.0279
##  6 <split [50/19]> Bootstrap0006     -0.195 
##  7 <split [50/17]> Bootstrap0007     -0.222 
##  8 <split [50/17]> Bootstrap0008     -0.188 
##  9 <split [50/22]> Bootstrap0009      0.0463
## 10 <split [50/18]> Bootstrap0010     -0.184 
## # ℹ 990 more rows

37.3.3 q1 Modify the code

Copy and modify the previous code to estimate a sampling distribution for the standard deviation. Provide this as a column sample_sd.

df_q1 <-
  df_data_norm %>% 
  bootstraps(., times = 1000) %>% 
  mutate(
    sample_sd = map_dbl(
      splits,
      function(split_df) {
        split_df %>% 
          analysis() %>% 
          summarize(sample_sd = sd(x)) %>% 
          pull(sample_sd)
      }
    )
  )

df_q1

## # Bootstrap sampling 
## # A tibble: 1,000 × 3
##    splits          id            sample_sd
##    <list>          <chr>             <dbl>
##  1 <split [50/17]> Bootstrap0001     0.762
##  2 <split [50/16]> Bootstrap0002     0.990
##  3 <split [50/15]> Bootstrap0003     0.919
##  4 <split [50/20]> Bootstrap0004     0.985
##  5 <split [50/22]> Bootstrap0005     0.898
##  6 <split [50/17]> Bootstrap0006     0.909
##  7 <split [50/17]> Bootstrap0007     0.916
##  8 <split [50/21]> Bootstrap0008     1.08 
##  9 <split [50/17]> Bootstrap0009     0.968
## 10 <split [50/17]> Bootstrap0010     1.10 
## # ℹ 990 more rows

The following test will verify that your df_q1 is correct:

## NOTE: No need to change this!
assertthat::assert_that(
    df_q1 %>% 
      mutate(
        test_sd = map_dbl(splits, ~analysis(.) %>% summarize(s = sd(x)) %>% pull(s)),
        test = test_sd == sample_sd
      ) %>% 
      pull(test) %>% 
      all()
)

## [1] TRUE

print("Great job!")

## [1] "Great job!"

The following code will visualize your results.

## NOTE: No need to edit
df_q1 %>% 
  ggplot(aes(sample_sd)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot gives us an (approximate) sampling distribution for the standard deviation. A reasonable next step would be to compute the standard error.

37.3.4 q2 Compute the standard error

Compute the standard error for the mean by completing the code below.

Hint: We defined the standard error in the previous stat exercise (e-stat04).

## TASK: Compute the standard error
df_q2 <- 
  df_data_norm %>% 
  bootstraps(., times = 1000) %>% 
  mutate(
    sample_mean = map_dbl(
      splits,
      function(split_df) {
        split_df %>% 
          analysis() %>% 
          summarize(sample_mean = mean(x)) %>% 
          pull(sample_mean)
      }
    )
  ) %>% 
  summarize(se = sd(sample_mean))

df_q2

## # A tibble: 1 × 1
##      se
##   <dbl>
## 1 0.134

The following test will verify that your df_q2 is correct:

## NOTE: No need to change this!
assertthat::assert_that(
    (df_q2 %>% pull(se)) > 0.1,
    msg = "Take a closer look at the definition of the standard error. We only divide by sqrt(n) when using the standard deviation of the *population* (not the standard deviation of the *samping distribution*)."
)

## [1] TRUE

print("Great job!")

## [1] "Great job!"

We can use a sampling distribution to build a “reasonable interval” for the population value we’re trying to estimate. We can do this by taking quantiles of the distribution.

37.3.5 q3 Quantiles of the sampling distribution

Compute the lower and upper quantiles of the sampling distribution for the mean, such that 0.99 of the distribution is contained within the endpoints, and an equal amount of the distribution lands on either side outside the endpoints.

Hint: You should use the alpha provided below, but you need to select half of alpha from the left tail, and half from the right tail.

## TASK: Compute the lower and upper quantiles of the sampling distribution
alpha <- 0.01

df_q3 <- 
  df_demo %>% 
  summarize(
# task_begin
    # mean_lo = ?,
    # mean_up = ?,
# task_end
    mean_lo = quantile(sample_mean, alpha / 2),
    mean_up = quantile(sample_mean, 1 - alpha / 2),
  )

The following will visualize your results against the sampling distribution:

## NOTE: Do not edit this code
df_demo %>% 
  ggplot(aes(sample_mean)) +
  geom_histogram() +
  geom_vline(
    data = df_q3,
    mapping = aes(xintercept = mean_lo)
  ) +
  geom_vline(
    data = df_q3,
    mapping = aes(xintercept = mean_up)
  )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The following test will verify that your df_q2 is correct:

## NOTE: No need to change this!
assertthat::assert_that(
    abs(
      df_q3 %>% pull(mean_lo) %>% .[[1]]
    - quantile(df_demo$sample_mean, 0.005)
    ) < 1e-3,
    msg = "Your value of mean_lo is incorrect"
)

## [1] TRUE

assertthat::assert_that(
    abs(
      df_q3 %>% pull(mean_up) %>% .[[1]]
    - quantile(df_demo$sample_mean, 0.995)
    ) < 1e-3,
    msg = "Your value of mean_up is incorrect"
)

## [1] TRUE

print("Great job!")

## [1] "Great job!"

These upper and lower quantiles construct a confidence interval, an important tool for quantifying uncertainty. We’ll discuss the interpretation of confidence intervals in e-stat07-clt. However, if you’d like a preview, this animation is the best visual explain I’ve found on how confidence intervals are constructed [3].

37.4 Notes

[1] This is because rsample does some fancy stuff under the hood. Basically bootstraps does not make any additional copies of the data; the price we pay for this efficiency is the need to call analysis().

[2] For a slightly more mathematical treatment of the bootstrap, try these MIT course notes

[3] This the best visualization of the confidence interval concept that I have ever found. Click through Frequentist Inference > Confidence Interval to see the animation.

37 Data: Reading Excel Sheets

37.1 Wrangling Basics

37.1.1 q1 Run the following code and pay attention to the column names. Open the downloaded Excel sheet and compare. Why are the column names so weird?

37.1.2 q2 Read the Excel sheet.

37.1.3 q3 Change the column names and types.

37.2 Danger Zone

37.2.1 q4 I have already provided the while loop below; fill in vars_lagged with the names of the columns where cell values are implicit.

37.3 The Core Idea: Resampling with Replacement

37.3.1 Efficient resampling

37.3.2 Using resampling

37.3.3 q1 Modify the code

37.3.4 q2 Compute the standard error

37.3.5 q3 Quantiles of the sampling distribution

37.4 Notes

37.2.1 q4 I have already provided the `while` loop below; fill in `vars_lagged` with the names of the columns where cell values are implicit.