23 Data: Pivoting Data
Purpose: Data is easiest to use when it is tidy. In fact, the tidyverse (including ggplot, dplyr, etc.) is specifically designed to use tidy data. But not all data we’ll encounter is tidy! To that end, in this exercise we’ll learn how to tidy our data by pivoting.
As a result of learning how to quickly tidy data, you’ll vastly expand the set of datasets you can analyze. Rather than fighting with data, you’ll be able to quickly wrangle and extract insights.
Reading: Reshape Data Topics: Welcome, Tidy Data (skip Gathering and Spreading columns) Reading Time: ~10 minutes (this exercise contains more reading material) Optional readings: - selection language
Note: Unfortunately, the RStudio primers have not been updated to use the most
up-to-date dplyr tools. Rather than learning the out-of-date tools gather, spread
, we will instead learn how to use pivot_longer
and pivot_wider
.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
23.1 Tidy Data
Tidy data is a form of data where:
- Each variable is in its own column
- Each observation is in its own row
- Each value is in its own cell
Not all data are presented in tidy form; in this case it can be difficult to tell what the variables are. Let’s practice distinguishing between the columns and the variables.
23.1.1 q1 What are the variables in the following dataset?
## NOTE: No need to edit; execute
cases <- tribble(
~Country, ~`2011`, ~`2012`, ~`2013`,
"FR", 7000, 6900, 7000,
"DE", 5800, 6000, 6200,
"US", 15000, 14000, 13000
)
cases
## # A tibble: 3 × 4
## Country `2011` `2012` `2013`
## <chr> <dbl> <dbl> <dbl>
## 1 FR 7000 6900 7000
## 2 DE 5800 6000 6200
## 3 US 15000 14000 13000
- Country, 2011, 2012, and 2013
- Country, year, and some unknown quantity (n, count, etc.)
- FR, DE, and US
## TODO: Modify with your multiple choice number answer
q1_answer <- 0
## NOTE: The following will test your answer
if (((q1_answer + 56) %% 3 == 1) & (q1_answer > 0)) {
"Correct!"
} else {
"Incorrect!"
}
## [1] "Incorrect!"
23.1.2 q2 What are the variables in the following dataset?
## NOTE: No need to edit; execute
alloys <- tribble(
~thick, ~E_00, ~mu_00, ~E_45, ~mu_45, ~rep,
0.022, 10600, 0.321, 10700, 0.329, 1,
0.022, 10600, 0.323, 10500, 0.331, 2,
0.032, 10400, 0.329, 10400, 0.318, 1,
0.032, 10300, 0.319, 10500, 0.326, 2
)
alloys
## # A tibble: 4 × 6
## thick E_00 mu_00 E_45 mu_45 rep
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.022 10600 0.321 10700 0.329 1
## 2 0.022 10600 0.323 10500 0.331 2
## 3 0.032 10400 0.329 10400 0.318 1
## 4 0.032 10300 0.319 10500 0.326 2
- thick, E_00, mu_00, E_45, mu_45, rep
- thick, E, mu, rep
- thick, E, mu, rep, angle
## TODO: Modify with your multiple choice number answer
q2_answer <- 0
## NOTE: The following will test your answer
if (((q2_answer + 38) %% 3 == 2) & (q2_answer > 0)) {
"Correct!"
} else {
"Incorrect!"
}
## [1] "Incorrect!"
23.2 Pivoting: Examples
The dplyr package comes with tools to pivot our data into tidy form. There are
two key tools: pivot_longer
and pivot_wider
. The names are suggestive of
their use. When our data are too wide we should pivot_longer
, and when our
data are too long, we should pivot_wider
.
23.2.1 Pivot longer
First, let’s see how pivot_longer
works on the cases
data. Run the following
code chunk:
## NOTE: No need to edit; execute
cases %>%
pivot_longer(
names_to = "Year",
values_to = "n",
cols = c(`2011`, `2012`, `2013`)
)
## # A tibble: 9 × 3
## Country Year n
## <chr> <chr> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
## 6 DE 2013 6200
## 7 US 2011 15000
## 8 US 2012 14000
## 9 US 2013 13000
Now these data are tidy! The variable Year
is now the name of a column, and
its values appear in the cells.
Let’s break down the key inputs to pivot_longer
:
names_to
is what we’re going to call the new column whose values will be the original column namesvalues_to
is what we’re going to call the new column that will hold the values associated with the original columnscols
is the set of columns in the original dataset that we’re going to modify. This takes the same inputs as a call toselect
, so we can use functions likestarts_with, ends_with, contains
, etc., or a list of column names enclosed withc()
- Note that in our case, we had to enclose each column name with ticks so that dplyr does not interpret the integer values as column positions (rather than column names)
- For more details on selecting variables, see the selection language page
However, there’s a problem with the Year
column:
## NOTE: No need to edit; execute
cases %>%
pivot_longer(
names_to = "Year",
values_to = "n",
c(`2011`, `2012`, `2013`)
) %>%
summarize(Year = mean(Year))
## Warning in mean.default(Year): argument is not numeric or logical: returning NA
## # A tibble: 1 × 1
## Year
## <dbl>
## 1 NA
The summary failed! That’s because the Year
column is full of strings, rather
than integers. We can fix this via mutation:
## NOTE: No need to edit; execute
cases %>%
pivot_longer(
names_to = "Year",
values_to = "n",
c(`2011`, `2012`, `2013`)
) %>%
mutate(Year = as.integer(Year))
## # A tibble: 9 × 3
## Country Year n
## <chr> <int> <dbl>
## 1 FR 2011 7000
## 2 FR 2012 6900
## 3 FR 2013 7000
## 4 DE 2011 5800
## 5 DE 2012 6000
## 6 DE 2013 6200
## 7 US 2011 15000
## 8 US 2012 14000
## 9 US 2013 13000
Now the data are tidy and of the proper type.
Let’s look at a built-in dataset:
## NOTE: No need to edit; execute
ansc <-
tribble(
~`x-1`, ~`x-2`, ~`y-1`, ~`y-2`,
10, 10, 8.04, 9.14,
8, 8, 6.95, 8.14,
13, 13, 7.58, 8.74,
9, 9, 8.81, 8.77,
11, 11, 8.33, 9.26,
14, 14, 9.96, 8.10,
6, 6, 7.24, 6.13,
4, 4, 4.26, 3.10,
12, 12, 10.84, 9.13,
7, 7, 4.82, 7.26,
5, 5, 5.68, 4.74
)
ansc
## # A tibble: 11 × 4
## `x-1` `x-2` `y-1` `y-2`
## <dbl> <dbl> <dbl> <dbl>
## 1 10 10 8.04 9.14
## 2 8 8 6.95 8.14
## 3 13 13 7.58 8.74
## 4 9 9 8.81 8.77
## 5 11 11 8.33 9.26
## 6 14 14 9.96 8.1
## 7 6 6 7.24 6.13
## 8 4 4 4.26 3.1
## 9 12 12 10.8 9.13
## 10 7 7 4.82 7.26
## 11 5 5 5.68 4.74
This dataset is too wide; the digit after each x
or y
denotes a different
dataset. The case is tricky to pivot though: We need to separate the trailing
digits while preserving the x, y
column names. We can use the special “.value”
entry in names_to
in order to handle this:
## NOTE: No need to edit; execute
ansc %>%
pivot_longer(
names_to = c(".value", "set"),
names_sep = "-",
cols = everything()
)
## # A tibble: 22 × 3
## set x y
## <chr> <dbl> <dbl>
## 1 1 10 8.04
## 2 2 10 9.14
## 3 1 8 6.95
## 4 2 8 8.14
## 5 1 13 7.58
## 6 2 13 8.74
## 7 1 9 8.81
## 8 2 9 8.77
## 9 1 11 8.33
## 10 2 11 9.26
## # … with 12 more rows
Note that:
- With .value
in names_to
, we do not provide the values_to
column names. We are instead signaling that the value names come from the column names
- everything()
is a convenient way to select all columns
Let’s look at one more use of pivot_longer
on the alloys
dataset.
## NOTE: No need to edit; execute
alloys %>%
pivot_longer(
names_to = c("var", "angle"),
names_sep = "_",
values_to = "val",
cols = c(-thick, -rep)
)
## # A tibble: 16 × 5
## thick rep var angle val
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 0.022 1 E 00 10600
## 2 0.022 1 mu 00 0.321
## 3 0.022 1 E 45 10700
## 4 0.022 1 mu 45 0.329
## 5 0.022 2 E 00 10600
## 6 0.022 2 mu 00 0.323
## 7 0.022 2 E 45 10500
## 8 0.022 2 mu 45 0.331
## 9 0.032 1 E 00 10400
## 10 0.032 1 mu 00 0.329
## 11 0.032 1 E 45 10400
## 12 0.032 1 mu 45 0.318
## 13 0.032 2 E 00 10300
## 14 0.032 2 mu 00 0.319
## 15 0.032 2 E 45 10500
## 16 0.032 2 mu 45 0.326
Note a few differences from the call of pivot_longer
on the cases
data:
- here
names_to
contains two names; this is to deal with the two components of the merged column namesE_00, mu_00, E_45,
etc. names_sep
allows us to specify a character that separates the components of the merged column names. In our case, the column names are merged with an underscore_
- We use the
-column
syntax withcols
to signal that we don’t want the specified columns. This allows us to excludethick, rep
- As an alternative, we could have used the more verbose
cols = starts_with("E") | starts_with("mu")
, which means “starts with”E” OR starts with “mu””
- As an alternative, we could have used the more verbose
This looks closer to tidy—we’ve taken care of the merged column names—but
now we have a different problem: The variables E, mu
are now in cells, rather
than column names! This is an example of a dataset that is too long. For this,
we’ll need to use pivot_wider
.
23.2.2 Pivot wider
We’ll continue tidying the alloys
dataset with pivot_wider
.
## NOTE: No need to edit; execute
alloys %>%
pivot_longer(
names_to = c("var", "angle"),
names_sep = "_",
values_to = "val",
starts_with("E") | starts_with("mu")
) %>%
pivot_wider(
names_from = var, # Cell entries to turn into new column names
values_from = val # Values to associate with the new column(s)
)
## # A tibble: 8 × 5
## thick rep angle E mu
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 0.022 1 00 10600 0.321
## 2 0.022 1 45 10700 0.329
## 3 0.022 2 00 10600 0.323
## 4 0.022 2 45 10500 0.331
## 5 0.032 1 00 10400 0.329
## 6 0.032 1 45 10400 0.318
## 7 0.032 2 00 10300 0.319
## 8 0.032 2 45 10500 0.326
Note the differences between pivot_longer
and pivot_wider
:
- Rather than
names_to
, we specifynames_from
; this takes a tidyselect specification. We specify the column(s) of values to turn into new column names - Rather than
values_to
, we specifyvalues_from
; this takes a tidyselect specification. We specify the column(s) of values to turn into new values
What we just saw above is a general strategy: If you see merged column names, you can:
- First,
pivot_longer
withnames_sep
ornames_pattern
to unmerge the column names. - Next,
pivot_wider
to tidy the data.
Both pivot_longer
and pivot_wider
have a lot of features; see their
documentation for more info.
23.3 Pivoting: Exercises
To practice using pivot_longer
and pivot_wider
, we’re going to work with the
following small dataset:
## NOTE: No need to edit; this is setup for the exercises
df_base <-
tribble(
~`X-0`, ~`X-1`, ~key,
1, 9, "A",
2, 8, "B",
3, 7, "C"
)
We’re going to play a game: I’m going to modify the data, and your job is to
pivot it back to equal df_base
.
23.3.1 q3 Recover df_base
from df_q3
by using a single pivot and no other functions.
## NOTE: No need to edit; this is setup for the exercise
df_q3 <-
df_base %>%
pivot_longer(
names_to = "id",
names_pattern = "(\\d)",
names_transform = list(id = as.integer),
values_to = "value",
cols = -key
)
df_q3
## # A tibble: 6 × 3
## key id value
## <chr> <int> <dbl>
## 1 A 0 1
## 2 A 1 9
## 3 B 0 2
## 4 B 1 8
## 5 C 0 3
## 6 C 1 7
Undo the modification using a single pivot. Don’t worry about column order.
df_q3_res <-
df_q3 %>%
pivot_wider(
names_from = id,
names_prefix = "X-",
values_from = value
)
df_q3_res
## # A tibble: 3 × 3
## key `X-0` `X-1`
## <chr> <dbl> <dbl>
## 1 A 1 9
## 2 B 2 8
## 3 C 3 7
## New names:
## New names:
## • `X-0` -> `X.0`
## • `X-1` -> `X.1`
## [1] TRUE
23.3.2 q4 Recover df_base
from df_q4
by using a single pivot and no other functions.
## NOTE: No need to edit; this is setup for the exercise
df_q4 <-
df_base %>%
pivot_wider(
names_from = key,
values_from = `X-0`
)
df_q4
## # A tibble: 3 × 4
## `X-1` A B C
## <dbl> <dbl> <dbl> <dbl>
## 1 9 1 NA NA
## 2 8 NA 2 NA
## 3 7 NA NA 3
Undo the modification using a single pivot. Don’t worry about column order.
Hint: You’ll need a way to drop NA
values in the pivot (without filtering).
Check the documentation for pivot_longer
.
df_q4_res <-
df_q4 %>%
pivot_longer(
names_to = "key",
values_to = "X-0",
values_drop_na = TRUE,
cols = c(A, B, C)
)
df_q4_res
## # A tibble: 3 × 3
## `X-1` key `X-0`
## <dbl> <chr> <dbl>
## 1 9 A 1
## 2 8 B 2
## 3 7 C 3
## New names:
## New names:
## • `X-0` -> `X.0`
## • `X-1` -> `X.1`
## [1] TRUE
23.3.3 q5 Recover df_base
from df_q5
by using a single pivot and no other functions.
## NOTE: No need to edit; this is setup for the exercise
df_q5 <-
df_base %>%
pivot_wider(
names_from = key,
values_from = -key
)
df_q5
## # A tibble: 1 × 6
## `X-0_A` `X-0_B` `X-0_C` `X-1_A` `X-1_B` `X-1_C`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 3 9 8 7
Undo the modification using a single pivot. Don’t worry about column order.
Hint: For this one, you’ll need to use the special .value
entry in names_to
.
df_q5_res <-
df_q5 %>%
pivot_longer(
names_to = c(".value", "key"),
names_sep = "_",
cols = everything()
)
df_q5_res
## # A tibble: 3 × 3
## key `X-0` `X-1`
## <chr> <dbl> <dbl>
## 1 A 1 9
## 2 B 2 8
## 3 C 3 7
## New names:
## New names:
## • `X-0` -> `X.0`
## • `X-1` -> `X.1`
## [1] TRUE
23.3.4 q6 Make your own!
Using a single pivot on df_base
create your own challenge dataframe. You will share this with the rest of the class as a puzzle, so make sure to solve your own challenge so you have a solution!
## Rows: 3
## Columns: 3
## $ `X-0` <dbl> 1, 2, 3
## $ `X-1` <dbl> 9, 8, 7
## $ key <chr> "A", "B", "C"
## # A tibble: 3 × 3
## `X-0` `X-1` key
## <dbl> <dbl> <chr>
## 1 1 9 A
## 2 2 8 B
## 3 3 7 C
Don’t forget to create a solution!
## Rows: 3
## Columns: 3
## $ `X-0` <dbl> 1, 2, 3
## $ `X-1` <dbl> 9, 8, 7
## $ key <chr> "A", "B", "C"
## # A tibble: 3 × 3
## `X-0` `X-1` key
## <dbl> <dbl> <chr>
## 1 1 9 A
## 2 2 8 B
## 3 3 7 C
## New names:
## New names:
## • `X-0` -> `X.0`
## • `X-1` -> `X.1`
## [1] TRUE