Data: Separating and Uniting Columns#

Purpose: Data is easiest to use when it is tidy. In fact, grama is specifically designed to use tidy data. Last time we learned how to pivot data, but data can be untidy in other ways. Pivoting helped us when data were locked up in the column headers: This time, we’ll learn how to use separate and unite to deal with cell values that are untidy.

Setup#

import grama as gr
DF = gr.Intention()
%matplotlib inline

# For assertion
from pandas.api.types import is_integer_dtype

Separate and Unite#

Sometimes data are presented in a wide format in order to fit many values on a page. You’ll often see this form of data when looking at large datasets that have been prepared for a report (say, a UN population survey). For example, the following dataset is clearly not tidy:

## NOTE: No need to edit
gr.df_make(
    country=["Afghanistan", "Albania", "Algeria"],
    population_1995=[17586073, 3357858, 29315463],
    population_1996=[18415307, 3341043, 29845208],
)

	country	population_1995	population_1996
0	Afghanistan	17586073	18415307
1	Albania	3357858	3341043
2	Algeria	29315463	29845208

The year of each population measurement is “locked up” in the column names. However, pivoting this dataset only gets us so far:

## NOTE: No need to edit
(
    gr.df_make(
        country=["Afghanistan", "Albania", "Algeria"],
        population_1995=[17586073, 3357858, 29315463],
        population_1996=[18415307, 3341043, 29845208],
    )
    >> gr.tf_pivot_longer(
        columns=["population_1995", "population_1996"],
        names_to="name",
        values_to="value",
    )
)

	country	name	value
0	Afghanistan	population_1995	17586073
1	Albania	population_1995	3357858
2	Algeria	population_1995	29315463
3	Afghanistan	population_1996	18415307
4	Albania	population_1996	3341043
5	Algeria	population_1996	29845208

The column names are not simply cell values; they’re a combination of both column names and cell values. To fully tidy this datset, we need to separate the names from the data.

The `gr.tf_separate()` verb#

The gr.tf_separate() verb allows us to separate string-valued columns on a target character. As a simple example, consider the following (silly) dataset:

## NOTE: No need to edit
(
    gr.df_make(s=["A-123", "B-456", "C-789"])
)

	s
0	A-123
1	B-456
2	C-789

If we wanted to separate the column s into two columns on the - character, we could do that with gr.tf_separate().

## NOTE: No need to edit
(
    gr.df_make(s=["A-123", "B-456", "C-789"])
    >> gr.tf_separate(
        column="s",
        sep="-",
        into=["letter", "numbers"],
    )
)

	letter	numbers
0	A	123
1	B	456
2	C	789

Note that we have two new columns with the separated values. We can use this approach to help tidy datasets.

q1 Separate the `population` and `year`#

Use gr.tf_separate() to separate the population and year values into their own columns.

## TASK: Separate the population and year columns
df_pop = (
    gr.df_make(
        country=["Afghanistan", "Albania", "Algeria"],
        population_1995=[17586073, 3357858, 29315463],
        population_1996=[18415307, 3341043, 29845208],
    )
    >> gr.tf_pivot_longer(
        columns=["population_1995", "population_1996"],
        names_to="pop_year",
        values_to="value",
    )
    >> gr.tf_separate(
        column="pop_year",
        sep="_",
        into=["population", "year"],
    )
)

## NOTE: No need to edit; use this to check your work
print(df_pop)

assert \
    "population" in df_pop.columns, \
    "df_pop does not have a `population` column"
assert \
    "year" in df_pop.columns, \
    "df_pop does not have a `year` column"
assert \
    df_pop.shape == (6, 4), \
    "df_pop is the wrong shape"

       country     value  population  year
Afghanistan  17586073  population  1995
    Albania   3357858  population  1995
    Algeria  29315463  population  1995
Afghanistan  18415307  population  1996
    Albania   3341043  population  1996
    Algeria  29845208  population  1996

The `gr.tf_unite()` verb#

The gr.tf_unite() verb performs the inverse of gr.tf_separate(); gr.tf_unite() allows us to combine multiple string columns into a single column. For example, we can undo the separation you performed above:

## NOTE: No need to edit
(
    df_pop
    >> gr.tf_unite(
        "population-year",
        DF.population,
        DF.year,
        sep="-",
    )
)

	country	value	population-year
0	Afghanistan	17586073	population-1995
1	Albania	3357858	population-1995
2	Algeria	29315463	population-1995
3	Afghanistan	18415307	population-1996
4	Albania	3341043	population-1996
5	Algeria	29845208	population-1996

Note that we have to choose what to name the new united column, the set of columns to combine, and a separator character with the sep argument. While the example above is not terribly practical, there are many cases where we’ll have many strings that we might like to join.

q2 Unite the digits#

Add the area code to the front of each localnumber to complete the following phone numbers. Provide the full number as the column number. Make sure to use the separator character -.

## TASK:
df_numbers = (
    gr.df_make(
        localnumber=["255-2112", "255-4200", "867-5309"],
        area=["212", "650", "814"],
    )
    >> gr.tf_unite(
        "number",
        DF.area,
        DF.localnumber,
        sep="-"
    )
)

## NOTE: Use this to check your work
print(df_numbers)

assert \
    "number" in df_numbers.columns, \
    "df_numbers does not have a `number` column"
assert \
    (df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "_"))).shape[0] == 0, \
    "Separator `_` detected; make sure to use `-` as your separator"
assert \
    (df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "^212|^650|^814"))).shape[0] == 3, \
    "The area codes are not at the *front* of the numbers"

         number
212-255-2112
650-255-4200
814-867-5309

Case Study: Tidying an Alloy Dataset#

Now let’s make use of gr.tf_separate() in tidying a more engineering-focused dataset. The following is a small dataset of measured aluminum alloy material properties.

## NOTE: No need to edit
df_alloys = gr.df_make(
    thick=[0.022]*2 + [0.032]*2,
    E_00=[10600, 10600, 10400, 10300],
    mu_00=[0.321, 0.323, 0.329, 0.319],
    E_45=[10700, 10500, 10400, 10500],
    mu_45=[0.329, 0.331, 0.318, 0.326],
)

df_alloys

	thick	E_00	mu_00	E_45	mu_45
0	0.022	10600	0.321	10700	0.329
1	0.022	10600	0.323	10500	0.331
2	0.032	10400	0.329	10400	0.318
3	0.032	10300	0.319	10500	0.326

Note that this dataset is not tidy: The digits in the column names are the angles at which the alloys were tested. In order to tidy this dataset, we’ll have to go through a few steps:

q3 Add a unique observation identifier#

Later, when we attempt a gr.tf_pivot_wider() down the line, we’ll need a unique identifier for each row in the original dataset. Create a new column id_obs that uniquely identifies each row.

Hint: You could just use the DataFrame’s index column to do this….

## TASK: Add a unique identifier (UID) to each row
df_alloys_id = (
    df_alloys

    >> gr.tf_mutate(id_obs=DF.index)
)

## NOTE: Use this to check your work
print(df_alloys_id)

assert \
    "id_obs" in df_alloys_id.columns, \
    "df_alloys_id has no `id_obs` column"

assert \
    len(set(df_alloys_id.id_obs)) == len(df_alloys_id.id_obs), \
    "The values in df_alloys_id.id_obs are not unique"

   thick   E_00  mu_00   E_45  mu_45  id_obs
0.022  10600  0.321  10700  0.329       0
0.022  10600  0.323  10500  0.331       1
0.032  10400  0.329  10400  0.318       2
0.032  10300  0.319  10500  0.326       3

q4 Separate the `variable` and `angle` data#

The following code pivots the data longer; use gr.tf_separate() to separate the variable and angle data.

## TASK: Separate the `variable` and `angle` data
df_alloys_separated = (
    df_alloys_id
    >> gr.tf_pivot_longer(
        columns=["E_00", "E_45", "mu_00", "mu_45"],
        names_to="variable_angle",
        values_to="value"
    )
    >> gr.tf_separate(
        column="variable_angle",
        sep="_",
        into=["variable", "angle"],
    )
)

## NOTE: use this to check your work
print(df_alloys_separated)

assert \
    "variable" in df_alloys_separated.columns, \
    "df_alloys_separated has no `variable` column"

assert \
    "angle" in df_alloys_separated.columns, \
    "df_alloys_separated has no `angle` column"

    thick  id_obs      value variable angle
 0.022       0  10600.000        E    00
 0.022       1  10600.000        E    00
 0.032       2  10400.000        E    00
 0.032       3  10300.000        E    00
 0.022       0  10700.000        E    45
 0.022       1  10500.000        E    45
 0.032       2  10400.000        E    45
 0.032       3  10500.000        E    45
 0.022       0      0.321       mu    00
 0.022       1      0.323       mu    00
0.032       2      0.329       mu    00
0.032       3      0.319       mu    00
0.022       0      0.329       mu    45
0.022       1      0.331       mu    45
0.032       2      0.318       mu    45
0.032       3      0.326       mu    45

q5 Finish tidying the data#

Finish tidying the data: Use a pivot to move the variable names back to being column names, and make sure the angle values are integers.

## TASK: Finish tidying the data
df_alloys_tidy = (
    df_alloys_separated
    >> gr.tf_pivot_wider(
        names_from="variable",
        values_from="value",
    )
    >> gr.tf_mutate(angle=gr.as_int(DF.angle))
)

## NOTE: Use this to check your work
print(df_alloys_tidy)

assert \
    "E" in df_alloys_tidy.columns, \
    "df_alloys_tidy does not have an `E` column"
assert \
    "mu" in df_alloys_tidy.columns, \
    "df_alloys_tidy does not have an `mu` column"
assert \
    is_integer_dtype(df_alloys_tidy.angle), \
    "df_alloys_tidy.angle is not an integer column"

   thick  id_obs  angle        E     mu
0.022       0      0  10600.0  0.321
0.022       0     45  10700.0  0.329
0.022       1      0  10600.0  0.323
0.022       1     45  10500.0  0.331
0.032       2      0  10400.0  0.329
0.032       2     45  10400.0  0.318
0.032       3      0  10300.0  0.319
0.032       3     45  10500.0  0.326

Aside: What happens without the unique id?#

Above, I made a fuss about adding a unique id column to the dataset before we start pivoting. What happens if we don’t do that?

# NOTE: Uncomment the code below and run this cell to see the error
# (
#     df_alloys
#     # NOTE: I didn't add a UID here; let's see what happens....
    
#     >> gr.tf_pivot_longer(
#         columns=["E_00", "E_45", "mu_00", "mu_45"],
#         names_to="varang",
#         values_to="value"
#     )
#     >> gr.tf_separate(
#         column="varang",
#         sep="_",
#         into=["var", "ang"],
#     )
#     >> gr.tf_pivot_wider(
#         names_from="var",
#         values_from="value",
#     )
# )

We had to do a lot of work to tidy this dataset. Perhaps there’s a better way?

A magic option: `".value"`#

To help minimize some of the work necessary for tidying, gr.tf_pivot_longer() provides some special functionality. To illustrate this, let’s take a look at a full tidying effort on a dataset similar to the alloy one:

## NOTE: No need to edit, this demonstrates the manual approach we've already seen
(
    gr.df_make(
        A_01=[1,2,3,4],
        A_02=[5,6,7,8],
        B_01=[1,3,5,7],
    )
    # Generate a unique observation id
    >> gr.tf_mutate(uid=DF.index)
    # Pivot longer to turn column names into cell values
    >> gr.tf_pivot_longer(
        columns=["A_01", "A_02", "B_01"],
        names_to="varidx",
        values_to="value",
    )
    # Separate the "true" column names from the cell entries
    >> gr.tf_separate(
        column="varidx",
        sep="_",
        into=["var", "idx"],
    )
    # Pivot wider to place correct column names in header
    >> gr.tf_pivot_wider(
        names_from="var",
        values_from="value",
    )
)

	uid	idx	A	B
0	0	01	1.0	1.0
1	0	02	5.0	NaN
2	1	01	2.0	3.0
3	1	02	6.0	NaN
4	2	01	3.0	5.0
5	2	02	7.0	NaN
6	3	01	4.0	7.0
7	3	02	8.0	NaN

This gets the job done. However, there is a much more succinct way to tidy this dataset using just one call to gr.tf_pivot_longer() with the ".value" argument.

## NOTE: No need to edit, this demonstrates the ".value" approach
(
    gr.df_make(
        A_01=[1,2,3,4],
        A_02=[5,6,7,8],
        B_01=[1,3,5,7],
    )
    
>> gr.tf_pivot_longer(
        columns=["A_01", "A_02", "B_01"],
        names_to=[".value", "idx"],
        names_sep="_",
    )
)

	idx	A	B
0	01	1.0	1.0
1	02	5.0	NaN
2	01	2.0	3.0
3	02	6.0	NaN
4	01	3.0	5.0
5	02	7.0	NaN
6	01	4.0	7.0
7	02	8.0	NaN

This style of calling gr.tf_pivot_longer() with the names_sep argument effectively embeds a call to gr.tf_separate() within the pivot. The verb then automates the tedious steps of creating a unique ID and pivoting wider.

In order to use this special functionality, we have to target the final column names using the special ".value" argument. Note that this shows up in the names_to set of arguments; where we would have specified the name of the column that we would eventually pivot wider:

    names_to=["var", "idx"],
    names_sep="_",

we instead flag this column as the target to pivot wider with the special argument

    names_to=[".value", "idx"],
    names_sep="_",

q6 Use the `".value"` approach#

Use the ".value" approach to tidy the alloy dataset in one pivot.

## TASK: Use the ".value" approach to tidy the alloy dataset in one pivot
(
    gr.df_make(
        thick=[0.022]*2 + [0.032]*2,
        E_00=[10600, 10600, 10400, 10300],
        mu_00=[0.321, 0.323, 0.329, 0.319],
        E_45=[10700, 10500, 10400, 10500],
        mu_45=[0.329, 0.331, 0.318, 0.326],
    )
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45"],
        names_to=[".value", "angle"],
        names_sep="_",
    )
)

/Users/zach/Git/py_grama/grama/tran_pivot.py:471: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	thick	angle	E	mu
0	0.022	00	10600.0	0.321
1	0.022	45	10700.0	0.329
2	0.022	00	10600.0	0.323
3	0.022	45	10500.0	0.331
4	0.032	00	10400.0	0.329
5	0.032	45	10400.0	0.318
6	0.032	00	10300.0	0.319
7	0.032	45	10500.0	0.326

Data Science for Engineers

Data: Separating and Uniting Columns

Contents

Data: Separating and Uniting Columns#

Setup#

Separate and Unite#

The `gr.tf_separate()` verb#

q1 Separate the `population` and `year`#

The `gr.tf_unite()` verb#

q2 Unite the digits#

Case Study: Tidying an Alloy Dataset#

q3 Add a unique observation identifier#

q4 Separate the `variable` and `angle` data#

q5 Finish tidying the data#

Aside: What happens without the unique id?#

A magic option: `".value"`#

q6 Use the `".value"` approach#

	uid	idx	A	B
0	0	01	1.0	1.0
1	0	02	5.0	NaN
2	1	01	2.0	3.0
3	1	02	6.0	NaN
4	2	01	3.0	5.0
5	2	02	7.0	NaN
6	3	01	4.0	7.0
7	3	02	8.0	NaN

	idx	A	B
0	01	1.0	1.0
1	02	5.0	NaN
2	01	2.0	3.0
3	02	6.0	NaN
4	01	3.0	5.0
5	02	7.0	NaN
6	01	4.0	7.0
7	02	8.0	NaN

	uid	idx	A	B
0	0	01	1.0	1.0
1	0	02	5.0	NaN
2	1	01	2.0	3.0
3	1	02	6.0	NaN
4	2	01	3.0	5.0
5	2	02	7.0	NaN
6	3	01	4.0	7.0
7	3	02	8.0	NaN

	idx	A	B
0	01	1.0	1.0
1	02	5.0	NaN
2	01	2.0	3.0
3	02	6.0	NaN
4	01	3.0	5.0
5	02	7.0	NaN
6	01	4.0	7.0
7	02	8.0	NaN

Data Science for Engineers

Data: Separating and Uniting Columns

Contents

Data: Separating and Uniting Columns#

Setup#

Separate and Unite#

The gr.tf_separate() verb#

q1 Separate the population and year#

The gr.tf_unite() verb#

q2 Unite the digits#

Case Study: Tidying an Alloy Dataset#

q3 Add a unique observation identifier#

q4 Separate the variable and angle data#

q5 Finish tidying the data#

Aside: What happens without the unique id?#

A magic option: ".value"#

q6 Use the ".value" approach#

The `gr.tf_separate()` verb#

q1 Separate the `population` and `year`#

The `gr.tf_unite()` verb#

q4 Separate the `variable` and `angle` data#

A magic option: `".value"`#

q6 Use the `".value"` approach#

	uid	idx	A	B
0	0	01	1.0	1.0
1	0	02	5.0	NaN
2	1	01	2.0	3.0
3	1	02	6.0	NaN
4	2	01	3.0	5.0
5	2	02	7.0	NaN
6	3	01	4.0	7.0
7	3	02	8.0	NaN

	idx	A	B
0	01	1.0	1.0
1	02	5.0	NaN
2	01	2.0	3.0
3	02	6.0	NaN
4	01	3.0	5.0
5	02	7.0	NaN
6	01	4.0	7.0
7	02	8.0	NaN