Data: Pivoting Data

Data: Pivoting Data#

Purpose: Data is easiest to use when it is tidy. In fact, grama is specifically designed to use tidy data. But not all data we’ll encounter is tidy! To that end, in this exercise we’ll learn how to tidy our data by pivoting.

As a result of learning how to quickly tidy data, you’ll vastly expand the set of datasets you can analyze. Rather than fighting with data, you’ll be able to quickly wrangle and extract insights.

Setup#

import grama as gr
DF = gr.Intention()
%matplotlib inline

# For assertion
from pandas.api.types import is_integer_dtype

Tidy Data#

Tidy Data is a very simple—but very powerful—concept for structuring a dataset.

Artwork by @allison_horst

Tidy Data: Definition#

A tidy dataset has three properties:

each variable forms a column
each observation forms a row
each cell is a single measurement

As an example, the following dataset is tidy:

from grama.data import df_stang
df_stang.head()

	thick	alloy	E	mu
0	0.022	al_24st	10600	0.321
1	0.022	al_24st	10600	0.323
2	0.032	al_24st	10400	0.329
3	0.032	al_24st	10300	0.319
4	0.064	al_24st	10500	0.323

The observations are all measured material properties taken at a particular angle (with respect to the direction in which the specimens were rolled). Each column reports values for just one variable, each row corresponds to a single observation, and every cell reports just one measurement.

However, the following form of the same dataset is not tidy:

from grama.data import df_stang_wide
df_stang_wide

	thick	E_00	mu_00	E_45	mu_45	E_90	mu_90	alloy
0	0.022	10600	0.321	10700	0.329	10500	0.310	al_24st
1	0.022	10600	0.323	10500	0.331	10700	0.323	al_24st
2	0.032	10400	0.329	10400	0.318	10300	0.322	al_24st
3	0.032	10300	0.319	10500	0.326	10400	0.330	al_24st
4	0.064	10500	0.323	10400	0.331	10400	0.327	al_24st
5	0.064	10700	0.328	10500	0.328	10500	0.320	al_24st
6	0.081	10000	0.315	10000	0.320	9900	0.314	al_24st
7	0.081	10100	0.312	9900	0.312	10000	0.316	al_24st
8	0.081	10000	0.311	-1	-1.000	9900	0.314	al_24st

This dataset is not tidy: The angle of each measurement 00, 45, 90 is a variable, but these numerical values are expressed as column names. Put differently, some of the values are not in cells, but rather in the column names.

Why tidy data?#

Tidy data makes analysis easier. Putting our data in tidy form means we can use a consistent set of tools to work with any dataset.

Artwork by @allison_horst

Note that untidy data is not bad data; untidy data are simply harder to work with when doing data analysis. Data often come in untidy form when they are reported, say in a paper or a presentation. For instance, the wide form of the Stang et al. dataset can easily fit on one page:

df_stang_wide

	thick	E_00	mu_00	E_45	mu_45	E_90	mu_90	alloy
0	0.022	10600	0.321	10700	0.329	10500	0.310	al_24st
1	0.022	10600	0.323	10500	0.331	10700	0.323	al_24st
2	0.032	10400	0.329	10400	0.318	10300	0.322	al_24st
3	0.032	10300	0.319	10500	0.326	10400	0.330	al_24st
4	0.064	10500	0.323	10400	0.331	10400	0.327	al_24st
5	0.064	10700	0.328	10500	0.328	10500	0.320	al_24st
6	0.081	10000	0.315	10000	0.320	9900	0.314	al_24st
7	0.081	10100	0.312	9900	0.312	10000	0.316	al_24st
8	0.081	10000	0.311	-1	-1.000	9900	0.314	al_24st

However, the tidy form of the same dataset is far less compact:

df_stang

	thick	alloy	E	mu	ang
0	0.022	al_24st	10600	0.321	0
1	0.022	al_24st	10600	0.323	0
2	0.032	al_24st	10400	0.329	0
3	0.032	al_24st	10300	0.319	0
4	0.064	al_24st	10500	0.323	0
...	...	...	...	...	...
71	0.064	al_24st	10400	0.327	90
72	0.064	al_24st	10500	0.320	90
73	0.081	al_24st	9900	0.314	90
74	0.081	al_24st	10000	0.316	90
75	0.081	al_24st	9900	0.314	90

76 rows × 5 columns

Exercises#

Let’s practice identifying tidy and untidy data.

q1 Identify#

Inspect the following dataset; answer the questions under observations below.

## TASK: No need to edit; run and inspect
df_cases = gr.df_make(
    country=["FR", "DE", "US"],
    year2011=[7000, 5800, 15000],
    year2012=[6900, 6000, 14000],
    year2013=[7000, 6200, 13000],
)
df_cases

	country	year2011	year2012	year2013
0	FR	7000	6900	7000
1	DE	5800	6000	6200
2	US	15000	14000	13000

Observations

What are the variables in this dataset?
- Country, year, and some unknown quantity (n, count, etc.)
Is this dataset tidy? Why or why not?
- No; the year values are in the column names.

q2 Identify#

Inspect the following dataset; answer the questions under observations below.

## TASK: No need to edit; run and inspect
df_alloys1 = gr.df_make(
    thick=[0.022, 0.022, 0.032, 0.032],
    E_00=[10600, 10600, 10400, 10300],
    mu_00=[0.321, 0.323, 0.329, 0.319],
    E_45=[10700, 10500, 10400, 10500],
    mu_45=[0.329, 0.331, 0.318, 0.326],
    rep=[1, 2, 1, 2],
)
df_alloys1

	thick	E_00	mu_00	E_45	mu_45	rep
0	0.022	10600	0.321	10700	0.329	1
1	0.022	10600	0.323	10500	0.331	2
2	0.032	10400	0.329	10400	0.318	1
3	0.032	10300	0.319	10500	0.326	2

Observations

What are the variables in this dataset?
- Thickness thick, elasticity E, poisson’s ratio mu, angle (in column names), replication rep
Is this dataset tidy? Why or why not?
- No; the angle values are in the column names.

q3 Identify#

Inspect the following dataset; answer the questions under observations below.

## TASK: No need to edit; run and inspect
df_alloys2 = gr.df_make(
    thick=[0.022, 0.022, 0.032, 0.032],
    var=["E", "mu", "E", "mu"],
    value=[10700, 0.321, 10500, 0.323],
    rep=[1, 1, 2, 2],
    angle=[0, 0, 0, 0],
)
df_alloys2

	thick	var	value	rep
0	0.022	E	10700.000	1
1	0.022	mu	0.321	1
2	0.032	E	10500.000	2
3	0.032	mu	0.323	2

Observations

What are the variables in this dataset?
- Thickness thick, elasticity E, poisson’s ratio mu, angle, replication rep
Is this dataset tidy? Why or why not?
- No; the column value contains values of two different variables E and mu.

Pivoting Data#

The untidy datasets above fail to be tidy because they have the wrong shape; we can tidy these datasets by pivoting the data. There are just two pivots we need to learn about: tf_pivot_longer() and tf_pivot_wider().

Pivot Longer#

Let’s take another look at the df_cases example dataset:

## NOTE: No need to edit
df_cases

	country	year2011	year2012	year2013
0	FR	7000	6900	7000
1	DE	5800	6000	6200
2	US	15000	14000	13000

This dataset is too wide; the column names year2011,year2012,year2013 should really be numbers inside a single year column. We can pivot longer to move from a wide structure to a longer one:

## NOTE: No need to edit
(
    df_cases
    >> gr.tf_pivot_longer(
        columns=["year2011", "year2012", "year2013"],
        names_to="year",
        values_to="count",
    )
)

	country	year	count
0	FR	year2011	7000
1	DE	year2011	5800
2	US	year2011	15000
3	FR	year2012	6900
4	DE	year2012	6000
5	US	year2012	14000
6	FR	year2013	7000
7	DE	year2013	6200
8	US	year2013	13000

Note what we had to provide as arguments to gr.tf_pivot_longer():

columns specifies the columns to involve in the pivoting
names_to specifies what new column will contain the previous column names
value_to specifies what new column will contain the previous cell values

Often, a pivot alone is not enough to fully clean a dataset; for instance, we would still need to remove the year string from each cell entry. However, we can do this much more easily once the data are pivoted into a longer format:

## NOTE: No need to edit
(
    df_cases
    >> gr.tf_pivot_longer(
        columns=["year2011", "year2012", "year2013"],
        names_to="year",
        values_to="count",
    )
    >> gr.tf_mutate(year=gr.str_replace(DF.year, "year", ""))
)

	country	year	count
0	FR	2011	7000
1	DE	2011	5800
2	US	2011	15000
3	FR	2012	6900
4	DE	2012	6000
5	US	2012	14000
6	FR	2013	7000
7	DE	2013	6200
8	US	2013	13000

q4 Pivot the data#

Pivot the dataset df_alloys1 to place each of E_00, mu_00, E_45, mu_45 as values in a new variable column, and move the original values to a new value column. Answer the questions under observations below.

## TASK: Tidy the data, make sure to provide `variable` and `value` columns
df_res1 = (
    df_alloys1

    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45"],
        names_to="variable",
        values_to="value",
    )
)

## NOTE: Use this to check your work
print(df_res1)

assert \
    {"thick", "rep", "variable", "value"} == set(df_res1.columns), \
    'columns of df_res1 are not ["thick", "rep", "variable", "value"]'

assert \
    {"E_00", "mu_00", "E_45", "mu_45"} == set(df_res1.variable), \
    'Entries in df_res1.variable are incorrect'

    thick  rep variable      value
 0.022    1     E_00  10600.000
 0.022    2     E_00  10600.000
 0.032    1     E_00  10400.000
 0.032    2     E_00  10300.000
 0.022    1    mu_00      0.321
 0.022    2    mu_00      0.323
 0.032    1    mu_00      0.329
 0.032    2    mu_00      0.319
 0.022    1     E_45  10700.000
 0.022    2     E_45  10500.000
0.032    1     E_45  10400.000
0.032    2     E_45  10500.000
0.022    1    mu_45      0.329
0.022    2    mu_45      0.331
0.032    1    mu_45      0.318
0.032    2    mu_45      0.326

Observations

Is this dataset tidy? Why or why not?
- This dataset is not tidy; the entries in the value column include values of E and mu (two different variables). Also, the variable column contains both variable names and angle values.

Using selection helpers#

One useful feature of gr.tf_pivot_longer() is that we can use selection helpers to choose which columns to use in a pivot. This can save us a lot of typing, and allows us to write code that can react to multiple different datasets. Let’s practice using selection helpers by re-doing the previous task.

q5 Simplify with selection helpers#

Use a single selection helper to simplify the columns argument to gr.tf_pivot_longer().

Hint: There are multiple ways to do this: The string "E|mu" will match the strings "E" or "mu". The string "\\d+" will match any string of consecutive digits.

## TASK: Replace the `columns` argument with a selection helper
df_simplified = (
    df_alloys1
    >> gr.tf_pivot_longer(

        columns=gr.matches("E|mu"), 
        names_to="variable",
        values_to="value",
    )
)

## NOTE: Use this to check your work
print(df_simplified)

assert \
    {"thick", "rep", "variable", "value"} == set(df_simplified.columns), \
    'columns of df_res1 are not ["thick", "rep", "variable", "value"]'

assert \
    {"E_00", "mu_00", "E_45", "mu_45"} == set(df_simplified.variable), \
    'Entries in df_res1.variable are incorrect'

    thick  rep variable      value
 0.022    1     E_00  10600.000
 0.022    2     E_00  10600.000
 0.032    1     E_00  10400.000
 0.032    2     E_00  10300.000
 0.022    1    mu_00      0.321
 0.022    2    mu_00      0.323
 0.032    1    mu_00      0.329
 0.032    2    mu_00      0.319
 0.022    1     E_45  10700.000
 0.022    2     E_45  10500.000
0.032    1     E_45  10400.000
0.032    2     E_45  10500.000
0.022    1    mu_45      0.329
0.022    2    mu_45      0.331
0.032    1    mu_45      0.318
0.032    2    mu_45      0.326

Pivot Wider#

Just as a dataset can be “too wide”, it can also be “too long.” In this case, we can use gr.tf_pivot_wider(). Let’s look at an illustrative example:

## NOTE: No need to edit
df_long = gr.df_make(
    variable=["x", "y", "f", "x", "y", "f"],
    value=[0, 0, 1, 0, 1, 0],
    observation=[0, 0, 0, 1, 1, 1],
)
df_long

	variable	value	observation
0	x	0	0
1	y	0	0
2	f	1	0
3	x	0	1
4	y	1	1
5	f	0	1

Here, the value column has mixed variables; values of x, y, and f comingle in the value column. Really we should have the columns x, y, and f: In this sense the data is too long. Let’s pivot wider to tidy the data.

## NOTE: No need to edit
(
    df_long
    >> gr.tf_pivot_wider(
        names_from="variable",
        values_from="value",
    )
)

	observation	f	x	y
0	0	1	0	0
1	1	0	0	1

Note that we had to provide just two inputs to gr.tf_pivot_wider():

names_from specifies which column will provide the new column names
value_from specifies which column will provide the new column values

Let’s get some practice!

q6 Pivot the data#

Pivot the dataset df_alloys2 to provide E and mu as columns in df_res2. Answer the questions under observations below.

## TASK: Pivot the dataset wider
df_res2 = (
    df_alloys2
    >> gr.tf_pivot_wider(
        names_from="var",
        values_from="value",
    )
)

## NOTE: Use the following to check your work
print(df_res2)

assert \
    {"thick", "rep", "angle", "E", "mu"} == set(df_res2.columns), \
    'df_res2 does not have the columns ["thick", "rep", "angle", "E", "mu"]'

   thick  rep  angle        E     mu
0  0.022    1      0  10700.0  0.321
1  0.032    2      0  10500.0  0.323

Observations

Is this dataset tidy? Why or why not?
- This dataset is tidy! Every column is one variable, every row is one observation, and every cell contains one value.

Warning: No observation identifier#

While the arguments to gr.tf_pivot_wider() only target two columns, we do need to think carefully about what other columns are in the dataset. It is important that the other columns help identify which values are associated with the same observation—the easiest way to do this is with some sort of observation identifier column.

Let’s take a look at what happens when we do not have a way to associate values with the same observation:

## NOTE: No need to edit
(
    gr.df_make(
        variable=["x", "y", "f", "x", "y", "f"],
        value=[0, 0, 1, 0, 1, 0],
        # observation=[0, 0, 0, 1, 1, 1], # Remove the observation identifier
    )
    >> gr.tf_pivot_wider(
        names_from="variable",
        values_from="value",
    )
)

	f	x	y
0	NaN	0.0	NaN
1	NaN	NaN	0.0
2	1.0	NaN	NaN
3	NaN	0.0	NaN
4	NaN	NaN	1.0
5	0.0	NaN	NaN

Note that gr.tf_pivot_wider() placed all of the values in the “correct” column, but filled all over column values with NaNs. If you get strange behavior with gr.tf_pivot_wider(), you should think about whether you are providing the information necessary to associate common values with the same observation. We’ll learn more about this in the next exercise.

	f	x	y
0	NaN	0.0	NaN
1	NaN	NaN	0.0
2	1.0	NaN	NaN
3	NaN	0.0	NaN
4	NaN	NaN	1.0
5	0.0	NaN	NaN

	f	x	y
0	NaN	0.0	NaN
1	NaN	NaN	0.0
2	1.0	NaN	NaN
3	NaN	0.0	NaN
4	NaN	NaN	1.0
5	0.0	NaN	NaN

Data: Pivoting Data

Contents

Data: Pivoting Data#

Setup#

Tidy Data#

Tidy Data: Definition#

Why tidy data?#

Exercises#

q1 Identify#

q2 Identify#

q3 Identify#

Pivoting Data#

Pivot Longer#

q4 Pivot the data#

Using selection helpers#

q5 Simplify with selection helpers#

Pivot Wider#

q6 Pivot the data#

Warning: No observation identifier#

	f	x	y
0	NaN	0.0	NaN
1	NaN	NaN	0.0
2	1.0	NaN	NaN
3	NaN	0.0	NaN
4	NaN	NaN	1.0
5	0.0	NaN	NaN