Data: Separating and Uniting Columns#

Purpose: Data is easiest to use when it is tidy. In fact, grama is specifically designed to use tidy data. Last time we learned how to pivot data, but data can be untidy in other ways. Pivoting helped us when data were locked up in the column headers: This time, we’ll learn how to use separate and unite to deal with cell values that are untidy.

Setup#

import grama as gr
DF = gr.Intention()
%matplotlib inline

# For assertion
from pandas.api.types import is_integer_dtype

Separate and Unite#

Sometimes data are presented in a wide format in order to fit many values on a page. You’ll often see this form of data when looking at large datasets that have been prepared for a report (say, a UN population survey). For example, the following dataset is clearly not tidy:

## NOTE: No need to edit
gr.df_make(
    country=["Afghanistan", "Albania", "Algeria"],
    population_1995=[17586073, 3357858, 29315463],
    population_1996=[18415307, 3341043, 29845208],
)
country population_1995 population_1996
0 Afghanistan 17586073 18415307
1 Albania 3357858 3341043
2 Algeria 29315463 29845208

The year of each population measurement is “locked up” in the column names. However, pivoting this dataset only gets us so far:

## NOTE: No need to edit
(
    gr.df_make(
        country=["Afghanistan", "Albania", "Algeria"],
        population_1995=[17586073, 3357858, 29315463],
        population_1996=[18415307, 3341043, 29845208],
    )
    >> gr.tf_pivot_longer(
        columns=["population_1995", "population_1996"],
        names_to="name",
        values_to="value",
    )
)
country name value
0 Afghanistan population_1995 17586073
1 Albania population_1995 3357858
2 Algeria population_1995 29315463
3 Afghanistan population_1996 18415307
4 Albania population_1996 3341043
5 Algeria population_1996 29845208

The column names are not simply cell values; they’re a combination of both column names and cell values. To fully tidy this datset, we need to separate the names from the data.

The gr.tf_separate() verb#

The gr.tf_separate() verb allows us to separate string-valued columns on a target character. As a simple example, consider the following (silly) dataset:

## NOTE: No need to edit
(
    gr.df_make(s=["A-123", "B-456", "C-789"])
)
s
0 A-123
1 B-456
2 C-789

If we wanted to separate the column s into two columns on the - character, we could do that with gr.tf_separate().

## NOTE: No need to edit
(
    gr.df_make(s=["A-123", "B-456", "C-789"])
    >> gr.tf_separate(
        column="s",
        sep="-",
        into=["letter", "numbers"],
    )
)
letter numbers
0 A 123
1 B 456
2 C 789

Note that we have two new columns with the separated values. We can use this approach to help tidy datasets.

q1 Separate the population and year#

Use gr.tf_separate() to separate the population and year values into their own columns.

## TASK: Separate the population and year columns
df_pop = (
    gr.df_make(
        country=["Afghanistan", "Albania", "Algeria"],
        population_1995=[17586073, 3357858, 29315463],
        population_1996=[18415307, 3341043, 29845208],
    )
    >> gr.tf_pivot_longer(
        columns=["population_1995", "population_1996"],
        names_to="pop_year",
        values_to="value",
    )
    >> gr.tf_separate(
        column="pop_year",
        sep="_",
        into=["population", "year"],
    )
)

## NOTE: No need to edit; use this to check your work
print(df_pop)

assert \
    "population" in df_pop.columns, \
    "df_pop does not have a `population` column"
assert \
    "year" in df_pop.columns, \
    "df_pop does not have a `year` column"
assert \
    df_pop.shape == (6, 4), \
    "df_pop is the wrong shape"
       country     value  population  year
0  Afghanistan  17586073  population  1995
1      Albania   3357858  population  1995
2      Algeria  29315463  population  1995
3  Afghanistan  18415307  population  1996
4      Albania   3341043  population  1996
5      Algeria  29845208  population  1996

The gr.tf_unite() verb#

The gr.tf_unite() verb performs the inverse of gr.tf_separate(); gr.tf_unite() allows us to combine multiple string columns into a single column. For example, we can undo the separation you performed above:

## NOTE: No need to edit
(
    df_pop
    >> gr.tf_unite(
        "population-year",
        DF.population,
        DF.year,
        sep="-",
    )
)
country value population-year
0 Afghanistan 17586073 population-1995
1 Albania 3357858 population-1995
2 Algeria 29315463 population-1995
3 Afghanistan 18415307 population-1996
4 Albania 3341043 population-1996
5 Algeria 29845208 population-1996

Note that we have to choose what to name the new united column, the set of columns to combine, and a separator character with the sep argument. While the example above is not terribly practical, there are many cases where we’ll have many strings that we might like to join.

q2 Unite the digits#

Add the area code to the front of each localnumber to complete the following phone numbers. Provide the full number as the column number. Make sure to use the separator character -.

## TASK:
df_numbers = (
    gr.df_make(
        localnumber=["255-2112", "255-4200", "867-5309"],
        area=["212", "650", "814"],
    )
    >> gr.tf_unite(
        "number",
        DF.area,
        DF.localnumber,
        sep="-"
    )
)

## NOTE: Use this to check your work
print(df_numbers)

assert \
    "number" in df_numbers.columns, \
    "df_numbers does not have a `number` column"
assert \
    (df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "_"))).shape[0] == 0, \
    "Separator `_` detected; make sure to use `-` as your separator"
assert \
    (df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "^212|^650|^814"))).shape[0] == 3, \
    "The area codes are not at the *front* of the numbers"
         number
0  212-255-2112
1  650-255-4200
2  814-867-5309

Case Study: Tidying an Alloy Dataset#

Now let’s make use of gr.tf_separate() in tidying a more engineering-focused dataset. The following is a small dataset of measured aluminum alloy material properties.

## NOTE: No need to edit
df_alloys = gr.df_make(
    thick=[0.022]*2 + [0.032]*2,
    E_00=[10600, 10600, 10400, 10300],
    mu_00=[0.321, 0.323, 0.329, 0.319],
    E_45=[10700, 10500, 10400, 10500],
    mu_45=[0.329, 0.331, 0.318, 0.326],
)

df_alloys
thick E_00 mu_00 E_45 mu_45
0 0.022 10600 0.321 10700 0.329
1 0.022 10600 0.323 10500 0.331
2 0.032 10400 0.329 10400 0.318
3 0.032 10300 0.319 10500 0.326

Note that this dataset is not tidy: The digits in the column names are the angles at which the alloys were tested. In order to tidy this dataset, we’ll have to go through a few steps:

q3 Add a unique observation identifier#

Later, when we attempt a gr.tf_pivot_wider() down the line, we’ll need a unique identifier for each row in the original dataset. Create a new column id_obs that uniquely identifies each row.

Hint: You could just use the DataFrame’s index column to do this….

## TASK: Add a unique identifier (UID) to each row
df_alloys_id = (
    df_alloys

    >> gr.tf_mutate(id_obs=DF.index)
)

## NOTE: Use this to check your work
print(df_alloys_id)

assert \
    "id_obs" in df_alloys_id.columns, \
    "df_alloys_id has no `id_obs` column"

assert \
    len(set(df_alloys_id.id_obs)) == len(df_alloys_id.id_obs), \
    "The values in df_alloys_id.id_obs are not unique"
   thick   E_00  mu_00   E_45  mu_45  id_obs
0  0.022  10600  0.321  10700  0.329       0
1  0.022  10600  0.323  10500  0.331       1
2  0.032  10400  0.329  10400  0.318       2
3  0.032  10300  0.319  10500  0.326       3

q4 Separate the variable and angle data#

The following code pivots the data longer; use gr.tf_separate() to separate the variable and angle data.

## TASK: Separate the `variable` and `angle` data
df_alloys_separated = (
    df_alloys_id
    >> gr.tf_pivot_longer(
        columns=["E_00", "E_45", "mu_00", "mu_45"],
        names_to="variable_angle",
        values_to="value"
    )
    >> gr.tf_separate(
        column="variable_angle",
        sep="_",
        into=["variable", "angle"],
    )
)

## NOTE: use this to check your work
print(df_alloys_separated)

assert \
    "variable" in df_alloys_separated.columns, \
    "df_alloys_separated has no `variable` column"

assert \
    "angle" in df_alloys_separated.columns, \
    "df_alloys_separated has no `angle` column"
    thick  id_obs      value variable angle
0   0.022       0  10600.000        E    00
1   0.022       1  10600.000        E    00
2   0.032       2  10400.000        E    00
3   0.032       3  10300.000        E    00
4   0.022       0  10700.000        E    45
5   0.022       1  10500.000        E    45
6   0.032       2  10400.000        E    45
7   0.032       3  10500.000        E    45
8   0.022       0      0.321       mu    00
9   0.022       1      0.323       mu    00
10  0.032       2      0.329       mu    00
11  0.032       3      0.319       mu    00
12  0.022       0      0.329       mu    45
13  0.022       1      0.331       mu    45
14  0.032       2      0.318       mu    45
15  0.032       3      0.326       mu    45

q5 Finish tidying the data#

Finish tidying the data: Use a pivot to move the variable names back to being column names, and make sure the angle values are integers.

## TASK: Finish tidying the data
df_alloys_tidy = (
    df_alloys_separated
    >> gr.tf_pivot_wider(
        names_from="variable",
        values_from="value",
    )
    >> gr.tf_mutate(angle=gr.as_int(DF.angle))
)

## NOTE: Use this to check your work
print(df_alloys_tidy)

assert \
    "E" in df_alloys_tidy.columns, \
    "df_alloys_tidy does not have an `E` column"
assert \
    "mu" in df_alloys_tidy.columns, \
    "df_alloys_tidy does not have an `mu` column"
assert \
    is_integer_dtype(df_alloys_tidy.angle), \
    "df_alloys_tidy.angle is not an integer column"
   thick  id_obs  angle        E     mu
0  0.022       0      0  10600.0  0.321
1  0.022       0     45  10700.0  0.329
2  0.022       1      0  10600.0  0.323
3  0.022       1     45  10500.0  0.331
4  0.032       2      0  10400.0  0.329
5  0.032       2     45  10400.0  0.318
6  0.032       3      0  10300.0  0.319
7  0.032       3     45  10500.0  0.326

Aside: What happens without the unique id?#

Above, I made a fuss about adding a unique id column to the dataset before we start pivoting. What happens if we don’t do that?

# NOTE: Uncomment the code below and run this cell to see the error
# (
#     df_alloys
#     # NOTE: I didn't add a UID here; let's see what happens....
    
#     >> gr.tf_pivot_longer(
#         columns=["E_00", "E_45", "mu_00", "mu_45"],
#         names_to="varang",
#         values_to="value"
#     )
#     >> gr.tf_separate(
#         column="varang",
#         sep="_",
#         into=["var", "ang"],
#     )
#     >> gr.tf_pivot_wider(
#         names_from="var",
#         values_from="value",
#     )
# )

We had to do a lot of work to tidy this dataset. Perhaps there’s a better way?

A magic option: ".value"#

To help minimize some of the work necessary for tidying, gr.tf_pivot_longer() provides some special functionality. To illustrate this, let’s take a look at a full tidying effort on a dataset similar to the alloy one:

## NOTE: No need to edit, this demonstrates the manual approach we've already seen
(
    gr.df_make(
        A_01=[1,2,3,4],
        A_02=[5,6,7,8],
        B_01=[1,3,5,7],
    )
    # Generate a unique observation id
    >> gr.tf_mutate(uid=DF.index)
    # Pivot longer to turn column names into cell values
    >> gr.tf_pivot_longer(
        columns=["A_01", "A_02", "B_01"],
        names_to="varidx",
        values_to="value",
    )
    # Separate the "true" column names from the cell entries
    >> gr.tf_separate(
        column="varidx",
        sep="_",
        into=["var", "idx"],
    )
    # Pivot wider to place correct column names in header
    >> gr.tf_pivot_wider(
        names_from="var",
        values_from="value",
    )
)
uid idx A B
0 0 01 1.0 1.0
1 0 02 5.0 NaN
2 1 01 2.0 3.0
3 1 02 6.0 NaN
4 2 01 3.0 5.0
5 2 02 7.0 NaN
6 3 01 4.0 7.0
7 3 02 8.0 NaN

This gets the job done. However, there is a much more succinct way to tidy this dataset using just one call to gr.tf_pivot_longer() with the ".value" argument.

## NOTE: No need to edit, this demonstrates the ".value" approach
(
    gr.df_make(
        A_01=[1,2,3,4],
        A_02=[5,6,7,8],
        B_01=[1,3,5,7],
    )
    
>> gr.tf_pivot_longer(
        columns=["A_01", "A_02", "B_01"],
        names_to=[".value", "idx"],
        names_sep="_",
    )
)
idx A B
0 01 1.0 1.0
1 02 5.0 NaN
2 01 2.0 3.0
3 02 6.0 NaN
4 01 3.0 5.0
5 02 7.0 NaN
6 01 4.0 7.0
7 02 8.0 NaN

This style of calling gr.tf_pivot_longer() with the names_sep argument effectively embeds a call to gr.tf_separate() within the pivot. The verb then automates the tedious steps of creating a unique ID and pivoting wider.

In order to use this special functionality, we have to target the final column names using the special ".value" argument. Note that this shows up in the names_to set of arguments; where we would have specified the name of the column that we would eventually pivot wider:

    names_to=["var", "idx"],
    names_sep="_",

we instead flag this column as the target to pivot wider with the special argument

    names_to=[".value", "idx"],
    names_sep="_",

q6 Use the ".value" approach#

Use the ".value" approach to tidy the alloy dataset in one pivot.

## TASK: Use the ".value" approach to tidy the alloy dataset in one pivot
(
    gr.df_make(
        thick=[0.022]*2 + [0.032]*2,
        E_00=[10600, 10600, 10400, 10300],
        mu_00=[0.321, 0.323, 0.329, 0.319],
        E_45=[10700, 10500, 10400, 10500],
        mu_45=[0.329, 0.331, 0.318, 0.326],
    )
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45"],
        names_to=[".value", "angle"],
        names_sep="_",
    )
)
/Users/zach/Git/py_grama/grama/tran_pivot.py:471: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
thick angle E mu
0 0.022 00 10600.0 0.321
1 0.022 45 10700.0 0.329
2 0.022 00 10600.0 0.323
3 0.022 45 10500.0 0.331
4 0.032 00 10400.0 0.329
5 0.032 45 10400.0 0.318
6 0.032 00 10300.0 0.319
7 0.032 45 10500.0 0.326