19 Data: Deriving Quantities

Purpose: Often our data will not tell us directly what we want to know; in these cases we need to derive new quantities from our data. In this exercise, we’ll work with mutate() to create new columns by operating on existing variables, and use group_by() with summarize() to compute aggregate statistics (summaries!) of our data.

Reading: (None, this is the reading)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

19.1 A pipe reminder

Remember the pipe %>%? The pipe takes an input on the left, and “inserts” it into a function on the right. Let’s put this idea to work with a little refresher exercise.

19.1.1 q1 A little refresher

What is the difference between these two versions of code? How are they the same? How are they different?

## Version 1
filter(diamonds, cut == "Ideal")

## # A tibble: 21,551 × 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
##  3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
##  4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
##  5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
##  7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
## 10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
## # ℹ 21,541 more rows

## Version 2
diamonds %>% filter(cut == "Ideal")

## # A tibble: 21,551 × 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
##  3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
##  4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
##  5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
##  6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
##  7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
##  8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
##  9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
## 10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
## # ℹ 21,541 more rows

These two lines carry out the same computation. However, Version 2 uses the pipe %>%, which takes its left-hand-side (LHS) and inserts the LHS as the first argument of the right-hand-side (RHS).

19.2 Summarizing

The summarize() function applies a user-provided summary function over all rows, and returns a single row. We can use summarize() to compute things like an average:

## NOTE: No need to edit
diamonds %>% 
  summarize(price_mean = mean(price))

## # A tibble: 1 × 1
##   price_mean
##        <dbl>
## 1      3933.

We can also compute multiple summaries at once:

## NOTE: No need to edit
diamonds %>% 
  summarize(
    price_min = min(price),
    price_mean = mean(price),
    price_max = max(price),
  )

## # A tibble: 1 × 3
##   price_min price_mean price_max
##       <int>      <dbl>     <int>
## 1       326      3933.     18823

19.2.1 Summary Functions

There are a variety of functions that work with summarize():

Type	Functions
Location	`mean(x), median(x), quantile(x, p), min(x), max(x)`
Spread	`sd(x), var(x), IQR(x), mad(x)`
Position	`first(x), nth(x, n), last(x)`
Counts	`n_distinct(x), n()`
Logical	`sum(!is.na(x)), mean(y == 0)`

We can combine computation with a summary function to calculate more complex quantities. For instance, the following code computes the number of one (or more) carat diamonds in the diamonds dataset:

## NOTE: No need to edit
diamonds %>% 
  summarize(n_onecarat = sum(carat >= 1))

## # A tibble: 1 × 1
##   n_onecarat
##        <int>
## 1      19060

19.2.2 q2 Count the ideal diamonds

Using summarize() and a logical summary function, determine the number of rows with Ideal cut. Save this value to a column called n_ideal.

Hint: We learned about checking for equality in e-data01-isolate.

df_q2 <-
  diamonds %>%
  summarize(n_ideal = sum(cut == "Ideal"))
df_q2

## # A tibble: 1 × 1
##   n_ideal
##     <int>
## 1   21551

The following test will verify that your df_q2 is correct:

## NOTE: No need to change this!
assertthat::assert_that(
  assertthat::are_equal(
    df_q2 %>% pull(n_ideal),
    21551
  )
)

## [1] TRUE

print("Great job!")

## [1] "Great job!"

19.3 Grouping

Grouping allows us to modify the behavior of summarize(): It computes the given summaries within the groups defined by columns listed in group_by(). This is perhaps easiest to show with an example:

## NOTE: No need to edit
diamonds %>% 
  group_by(cut) %>% 
  summarize(n = n()) # Count the rows in each group of `cut`

## # A tibble: 5 × 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551

Aside: Within-group counting is so useful that dplyr provides a shortcut for this: count().

## NOTE: This does the same as the previous code chunk
diamonds %>% 
  count(cut)

## # A tibble: 5 × 2
##   cut           n
##   <ord>     <int>
## 1 Fair       1610
## 2 Good       4906
## 3 Very Good 12082
## 4 Premium   13791
## 5 Ideal     21551

Grouping is a helpful way to get a sense for how a summary changes with another variable (or multiple variabels).

19.3.1 q3 Experiment with `group_by()`

The function group_by() modifies how other dplyr verbs function. Uncomment the group_by() below, and describe how the result changes.

diamonds %>%
  ## group_by(color, clarity) %>%
  summarize(price_mean = mean(price))

## # A tibble: 1 × 1
##   price_mean
##        <dbl>
## 1      3933.

The commented version computes a summary over the entire dataframe
The uncommented version computes summaries over groups of color and clarity

19.4 Row computations with `mutate()`

Rather than compute summaries, we can apply computations to every row in the dataset using mutate(). This is useful for things like unit conversions. For instance, since 1 carat = \(0.2\) grams, we can compute the mass of every diamond:

## NOTE: No need to edit
diamonds %>% 
  mutate(`mass(g)` = carat / 5) %>% 
  select(`mass(g)`, everything())

## # A tibble: 53,940 × 11
##    `mass(g)` carat cut       color clarity depth table price     x     y     z
##        <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1     0.046  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2     0.042  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3     0.046  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4     0.058  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5     0.062  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6     0.048  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7     0.048  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8     0.052  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9     0.044  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10     0.046  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

19.4.1 Vectorized Functions

There are many vectorized functions that we can use with mutate(). A vectorized function is one that operates on an entire vector. If we want to use a non-vectorized function with mutate(), we’ll have to use a map() function instead (introduced in e-data10-map).

Type	Functions
Arithmetic ops.	`+, -, *, /, ^`
Modular arith.	`%/%, %%`
Logical comp.	`<, <=, >, >=, !=, ==`
Logarithms	`log(x), log2(x), log10(x)`
Offsets	`lead(x), lag(x)`
Cumulants	`cumsum(x), cumprod(x), cummin(x), cummax(x), cummean(x)`
Ranking	`min_rank(x), row_number(x), dense_rank(x), percent_rank(x), cume_dist(x), ntile(x)`

19.4.2 q4 Comment on why the difference is so large

The depth variable is supposedly computed via depth_computed = 100 * 2 * z / (x + y). Compute diff = depth - depth_computed: This is a measure of discrepancy between the given and computed depth.

Additionally, compute the coefficient of variation cov = sd(x) / mean(x) for both depth and diff: This is a dimensionless measure of variation in a variable.

Assign the resulting tibble to df_q4, and assign the appropriate values to cov_depth and cov_diff. Comment on the relative values of cov_depth and cov_diff; why is cov_diff so large?

Note: Confusingly, the documentation for diamonds leaves out the factor of 100 in the computation of depth.

df_q4 <-
  diamonds %>%
  mutate(
    depth_computed = 100 * 2 * z / (x + y),
    diff = depth - depth_computed
  ) %>%
  summarize(
    depth_mean = mean(depth, na.rm = TRUE),
    depth_sd = sd(depth, na.rm = TRUE),
    cov_depth = depth_sd / depth_mean,

    diff_mean = mean(diff, na.rm = TRUE),
    diff_sd = sd(diff, na.rm = TRUE),
    cov_diff = diff_sd / diff_mean,
    c_diff = IQR(diff, na.rm = TRUE) / median(diff, na.rm = TRUE)
  )
df_q4

## # A tibble: 1 × 7
##   depth_mean depth_sd cov_depth diff_mean diff_sd cov_diff   c_diff
##        <dbl>    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1       61.7     1.43    0.0232   0.00528    2.63     498. -7.50e12

Observations

cov_depth is tiny; there’s not much variation in the depth, compared to its scale.
cov_diff is enormous! This is because the mean difference between depth and depth_computed is small, which causes the cov to blow up.

The following test will verify that your df_q4 is correct:

## NOTE: No need to change this!
assertthat::assert_that(abs(df_q4 %>% pull(cov_depth) - 0.02320057) < 1e-3)

## [1] TRUE

assertthat::assert_that(abs(df_q4 %>% pull(cov_diff) - 497.5585) < 1e-3)

## [1] TRUE

print("Nice!")

## [1] "Nice!"

There is nonzero difference between depth and the computed depth; this raises some questions about how depth was actually computed! It’s often a good idea to try to check derived quantities in your data, if possible. These can sometimes uncover errors in the data!

19.4.3 q5 Compute and observe

Compute the price_mean = mean(price), price_sd = sd(price), and price_cov = price_sd / price_mean for each cut of diamond. What observations can you make about the various cuts? Do those observations match your expectations?

df_q5 <-
  diamonds %>%
  group_by(cut) %>%
  summarize(
    price_mean = mean(price),
    price_sd = sd(price),
    price_cov = price_sd / price_mean
  )
df_q5

## # A tibble: 5 × 4
##   cut       price_mean price_sd price_cov
##   <ord>          <dbl>    <dbl>     <dbl>
## 1 Fair           4359.    3560.     0.817
## 2 Good           3929.    3682.     0.937
## 3 Very Good      3982.    3936.     0.988
## 4 Premium        4584.    4349.     0.949
## 5 Ideal          3458.    3808.     1.10

The following test will verify that your df_q5 is correct:

## NOTE: No need to change this!
assertthat::assert_that(
  assertthat::are_equal(
    df_q5 %>%
      select(cut, price_cov) %>%
      mutate(price_cov = round(price_cov, digits = 3)),
    tibble(
      cut = c("Fair", "Good", "Very Good", "Premium", "Ideal"),
      price_cov = c(0.817, 0.937, 0.988, 0.949, 1.101)
    ) %>%
    mutate(cut = fct_inorder(cut, ordered = TRUE))
  )
)

## [1] TRUE

print("Excellent!")

## [1] "Excellent!"

Observations: - I would expect the Ideal diamonds to have the highest price, but that’s not the case! - The COV for each cut is very large, on the order of 80 to 110 percent! To me, this implies that the other variables clarity, carat, color account for a large portion of the diamond price.