Data: Separating and Uniting Columns#
Purpose: Data is easiest to use when it is tidy. In fact, grama is specifically designed to use tidy data. Last time we learned how to pivot data, but data can be untidy in other ways. Pivoting helped us when data were locked up in the column headers: This time, we’ll learn how to use separate and unite to deal with cell values that are untidy.
Setup#
import grama as gr
DF = gr.Intention()
%matplotlib inline
# For assertion
from pandas.api.types import is_integer_dtype
Separate and Unite#
Sometimes data are presented in a wide format in order to fit many values on a page. You’ll often see this form of data when looking at large datasets that have been prepared for a report (say, a UN population survey). For example, the following dataset is clearly not tidy:
## NOTE: No need to edit
gr.df_make(
    country=["Afghanistan", "Albania", "Algeria"],
    population_1995=[17586073, 3357858, 29315463],
    population_1996=[18415307, 3341043, 29845208],
)
| country | population_1995 | population_1996 | |
|---|---|---|---|
| 0 | Afghanistan | 17586073 | 18415307 | 
| 1 | Albania | 3357858 | 3341043 | 
| 2 | Algeria | 29315463 | 29845208 | 
The year of each population measurement is “locked up” in the column names. However, pivoting this dataset only gets us so far:
## NOTE: No need to edit
(
    gr.df_make(
        country=["Afghanistan", "Albania", "Algeria"],
        population_1995=[17586073, 3357858, 29315463],
        population_1996=[18415307, 3341043, 29845208],
    )
    >> gr.tf_pivot_longer(
        columns=["population_1995", "population_1996"],
        names_to="name",
        values_to="value",
    )
)
| country | name | value | |
|---|---|---|---|
| 0 | Afghanistan | population_1995 | 17586073 | 
| 1 | Albania | population_1995 | 3357858 | 
| 2 | Algeria | population_1995 | 29315463 | 
| 3 | Afghanistan | population_1996 | 18415307 | 
| 4 | Albania | population_1996 | 3341043 | 
| 5 | Algeria | population_1996 | 29845208 | 
The column names are not simply cell values; they’re a combination of both column names and cell values. To fully tidy this datset, we need to separate the names from the data.
The gr.tf_separate() verb#
The gr.tf_separate() verb allows us to separate string-valued columns on a target character. As a simple example, consider the following (silly) dataset:
## NOTE: No need to edit
(
    gr.df_make(s=["A-123", "B-456", "C-789"])
)
| s | |
|---|---|
| 0 | A-123 | 
| 1 | B-456 | 
| 2 | C-789 | 
If we wanted to separate the column s into two columns on the - character, we could do that with gr.tf_separate().
## NOTE: No need to edit
(
    gr.df_make(s=["A-123", "B-456", "C-789"])
    >> gr.tf_separate(
        column="s",
        sep="-",
        into=["letter", "numbers"],
    )
)
| letter | numbers | |
|---|---|---|
| 0 | A | 123 | 
| 1 | B | 456 | 
| 2 | C | 789 | 
Note that we have two new columns with the separated values. We can use this approach to help tidy datasets.
q1 Separate the population and year#
Use gr.tf_separate() to separate the population and year values into their own columns.
## TASK: Separate the population and year columns
df_pop = (
    gr.df_make(
        country=["Afghanistan", "Albania", "Algeria"],
        population_1995=[17586073, 3357858, 29315463],
        population_1996=[18415307, 3341043, 29845208],
    )
    >> gr.tf_pivot_longer(
        columns=["population_1995", "population_1996"],
        names_to="pop_year",
        values_to="value",
    )
    >> gr.tf_separate(
        column="pop_year",
        sep="_",
        into=["population", "year"],
    )
)
## NOTE: No need to edit; use this to check your work
print(df_pop)
assert \
    "population" in df_pop.columns, \
    "df_pop does not have a `population` column"
assert \
    "year" in df_pop.columns, \
    "df_pop does not have a `year` column"
assert \
    df_pop.shape == (6, 4), \
    "df_pop is the wrong shape"
       country     value  population  year
0  Afghanistan  17586073  population  1995
1      Albania   3357858  population  1995
2      Algeria  29315463  population  1995
3  Afghanistan  18415307  population  1996
4      Albania   3341043  population  1996
5      Algeria  29845208  population  1996
The gr.tf_unite() verb#
The gr.tf_unite() verb performs the inverse of gr.tf_separate(); gr.tf_unite() allows us to combine multiple string columns into a single column. For example, we can undo the separation you performed above:
## NOTE: No need to edit
(
    df_pop
    >> gr.tf_unite(
        "population-year",
        DF.population,
        DF.year,
        sep="-",
    )
)
| country | value | population-year | |
|---|---|---|---|
| 0 | Afghanistan | 17586073 | population-1995 | 
| 1 | Albania | 3357858 | population-1995 | 
| 2 | Algeria | 29315463 | population-1995 | 
| 3 | Afghanistan | 18415307 | population-1996 | 
| 4 | Albania | 3341043 | population-1996 | 
| 5 | Algeria | 29845208 | population-1996 | 
Note that we have to choose what to name the new united column, the set of columns to combine, and a separator character with the sep argument. While the example above is not terribly practical, there are many cases where we’ll have many strings that we might like to join.
q2 Unite the digits#
Add the area code to the front of each localnumber to complete the following phone numbers. Provide the full number as the column number. Make sure to use the separator character -.
## TASK:
df_numbers = (
    gr.df_make(
        localnumber=["255-2112", "255-4200", "867-5309"],
        area=["212", "650", "814"],
    )
    >> gr.tf_unite(
        "number",
        DF.area,
        DF.localnumber,
        sep="-"
    )
)
## NOTE: Use this to check your work
print(df_numbers)
assert \
    "number" in df_numbers.columns, \
    "df_numbers does not have a `number` column"
assert \
    (df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "_"))).shape[0] == 0, \
    "Separator `_` detected; make sure to use `-` as your separator"
assert \
    (df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "^212|^650|^814"))).shape[0] == 3, \
    "The area codes are not at the *front* of the numbers"
         number
0  212-255-2112
1  650-255-4200
2  814-867-5309
Case Study: Tidying an Alloy Dataset#
Now let’s make use of gr.tf_separate() in tidying a more engineering-focused dataset. The following is a small dataset of measured aluminum alloy material properties.
## NOTE: No need to edit
df_alloys = gr.df_make(
    thick=[0.022]*2 + [0.032]*2,
    E_00=[10600, 10600, 10400, 10300],
    mu_00=[0.321, 0.323, 0.329, 0.319],
    E_45=[10700, 10500, 10400, 10500],
    mu_45=[0.329, 0.331, 0.318, 0.326],
)
df_alloys
| thick | E_00 | mu_00 | E_45 | mu_45 | |
|---|---|---|---|---|---|
| 0 | 0.022 | 10600 | 0.321 | 10700 | 0.329 | 
| 1 | 0.022 | 10600 | 0.323 | 10500 | 0.331 | 
| 2 | 0.032 | 10400 | 0.329 | 10400 | 0.318 | 
| 3 | 0.032 | 10300 | 0.319 | 10500 | 0.326 | 
Note that this dataset is not tidy: The digits in the column names are the angles at which the alloys were tested. In order to tidy this dataset, we’ll have to go through a few steps:
q3 Add a unique observation identifier#
Later, when we attempt a gr.tf_pivot_wider() down the line, we’ll need a unique identifier for each row in the original dataset. Create a new column id_obs that uniquely identifies each row.
Hint: You could just use the DataFrame’s index column to do this….
## TASK: Add a unique identifier (UID) to each row
df_alloys_id = (
    df_alloys
    >> gr.tf_mutate(id_obs=DF.index)
)
## NOTE: Use this to check your work
print(df_alloys_id)
assert \
    "id_obs" in df_alloys_id.columns, \
    "df_alloys_id has no `id_obs` column"
assert \
    len(set(df_alloys_id.id_obs)) == len(df_alloys_id.id_obs), \
    "The values in df_alloys_id.id_obs are not unique"
   thick   E_00  mu_00   E_45  mu_45  id_obs
0  0.022  10600  0.321  10700  0.329       0
1  0.022  10600  0.323  10500  0.331       1
2  0.032  10400  0.329  10400  0.318       2
3  0.032  10300  0.319  10500  0.326       3
q4 Separate the variable and angle data#
The following code pivots the data longer; use gr.tf_separate() to separate the variable and angle data.
## TASK: Separate the `variable` and `angle` data
df_alloys_separated = (
    df_alloys_id
    >> gr.tf_pivot_longer(
        columns=["E_00", "E_45", "mu_00", "mu_45"],
        names_to="variable_angle",
        values_to="value"
    )
    >> gr.tf_separate(
        column="variable_angle",
        sep="_",
        into=["variable", "angle"],
    )
)
## NOTE: use this to check your work
print(df_alloys_separated)
assert \
    "variable" in df_alloys_separated.columns, \
    "df_alloys_separated has no `variable` column"
assert \
    "angle" in df_alloys_separated.columns, \
    "df_alloys_separated has no `angle` column"
    thick  id_obs      value variable angle
0   0.022       0  10600.000        E    00
1   0.022       1  10600.000        E    00
2   0.032       2  10400.000        E    00
3   0.032       3  10300.000        E    00
4   0.022       0  10700.000        E    45
5   0.022       1  10500.000        E    45
6   0.032       2  10400.000        E    45
7   0.032       3  10500.000        E    45
8   0.022       0      0.321       mu    00
9   0.022       1      0.323       mu    00
10  0.032       2      0.329       mu    00
11  0.032       3      0.319       mu    00
12  0.022       0      0.329       mu    45
13  0.022       1      0.331       mu    45
14  0.032       2      0.318       mu    45
15  0.032       3      0.326       mu    45
q5 Finish tidying the data#
Finish tidying the data: Use a pivot to move the variable names back to being column names, and make sure the angle values are integers.
## TASK: Finish tidying the data
df_alloys_tidy = (
    df_alloys_separated
    >> gr.tf_pivot_wider(
        names_from="variable",
        values_from="value",
    )
    >> gr.tf_mutate(angle=gr.as_int(DF.angle))
)
## NOTE: Use this to check your work
print(df_alloys_tidy)
assert \
    "E" in df_alloys_tidy.columns, \
    "df_alloys_tidy does not have an `E` column"
assert \
    "mu" in df_alloys_tidy.columns, \
    "df_alloys_tidy does not have an `mu` column"
assert \
    is_integer_dtype(df_alloys_tidy.angle), \
    "df_alloys_tidy.angle is not an integer column"
   thick  id_obs  angle        E     mu
0  0.022       0      0  10600.0  0.321
1  0.022       0     45  10700.0  0.329
2  0.022       1      0  10600.0  0.323
3  0.022       1     45  10500.0  0.331
4  0.032       2      0  10400.0  0.329
5  0.032       2     45  10400.0  0.318
6  0.032       3      0  10300.0  0.319
7  0.032       3     45  10500.0  0.326
Aside: What happens without the unique id?#
Above, I made a fuss about adding a unique id column to the dataset before we start pivoting. What happens if we don’t do that?
# NOTE: Uncomment the code below and run this cell to see the error
# (
#     df_alloys
#     # NOTE: I didn't add a UID here; let's see what happens....
    
#     >> gr.tf_pivot_longer(
#         columns=["E_00", "E_45", "mu_00", "mu_45"],
#         names_to="varang",
#         values_to="value"
#     )
#     >> gr.tf_separate(
#         column="varang",
#         sep="_",
#         into=["var", "ang"],
#     )
#     >> gr.tf_pivot_wider(
#         names_from="var",
#         values_from="value",
#     )
# )
We had to do a lot of work to tidy this dataset. Perhaps there’s a better way?
A magic option: ".value"#
To help minimize some of the work necessary for tidying, gr.tf_pivot_longer() provides some special functionality. To illustrate this, let’s take a look at a full tidying effort on a dataset similar to the alloy one:
## NOTE: No need to edit, this demonstrates the manual approach we've already seen
(
    gr.df_make(
        A_01=[1,2,3,4],
        A_02=[5,6,7,8],
        B_01=[1,3,5,7],
    )
    # Generate a unique observation id
    >> gr.tf_mutate(uid=DF.index)
    # Pivot longer to turn column names into cell values
    >> gr.tf_pivot_longer(
        columns=["A_01", "A_02", "B_01"],
        names_to="varidx",
        values_to="value",
    )
    # Separate the "true" column names from the cell entries
    >> gr.tf_separate(
        column="varidx",
        sep="_",
        into=["var", "idx"],
    )
    # Pivot wider to place correct column names in header
    >> gr.tf_pivot_wider(
        names_from="var",
        values_from="value",
    )
)
| uid | idx | A | B | |
|---|---|---|---|---|
| 0 | 0 | 01 | 1.0 | 1.0 | 
| 1 | 0 | 02 | 5.0 | NaN | 
| 2 | 1 | 01 | 2.0 | 3.0 | 
| 3 | 1 | 02 | 6.0 | NaN | 
| 4 | 2 | 01 | 3.0 | 5.0 | 
| 5 | 2 | 02 | 7.0 | NaN | 
| 6 | 3 | 01 | 4.0 | 7.0 | 
| 7 | 3 | 02 | 8.0 | NaN | 
This gets the job done. However, there is a much more succinct way to tidy this dataset using just one call to gr.tf_pivot_longer() with the ".value" argument.
## NOTE: No need to edit, this demonstrates the ".value" approach
(
    gr.df_make(
        A_01=[1,2,3,4],
        A_02=[5,6,7,8],
        B_01=[1,3,5,7],
    )
    
>> gr.tf_pivot_longer(
        columns=["A_01", "A_02", "B_01"],
        names_to=[".value", "idx"],
        names_sep="_",
    )
)
| idx | A | B | |
|---|---|---|---|
| 0 | 01 | 1.0 | 1.0 | 
| 1 | 02 | 5.0 | NaN | 
| 2 | 01 | 2.0 | 3.0 | 
| 3 | 02 | 6.0 | NaN | 
| 4 | 01 | 3.0 | 5.0 | 
| 5 | 02 | 7.0 | NaN | 
| 6 | 01 | 4.0 | 7.0 | 
| 7 | 02 | 8.0 | NaN | 
This style of calling gr.tf_pivot_longer() with the names_sep argument effectively embeds a call to gr.tf_separate() within the pivot. The verb then automates the tedious steps of creating a unique ID and pivoting wider.
In order to use this special functionality, we have to target the final column names using the special ".value" argument. Note that this shows up in the names_to set of arguments; where we would have specified the name of the column that we would eventually pivot wider:
    names_to=["var", "idx"],
    names_sep="_",
we instead flag this column as the target to pivot wider with the special argument
    names_to=[".value", "idx"],
    names_sep="_",
q6 Use the ".value" approach#
Use the ".value" approach to tidy the alloy dataset in one pivot.
## TASK: Use the ".value" approach to tidy the alloy dataset in one pivot
(
    gr.df_make(
        thick=[0.022]*2 + [0.032]*2,
        E_00=[10600, 10600, 10400, 10300],
        mu_00=[0.321, 0.323, 0.329, 0.319],
        E_45=[10700, 10500, 10400, 10500],
        mu_45=[0.329, 0.331, 0.318, 0.326],
    )
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45"],
        names_to=[".value", "angle"],
        names_sep="_",
    )
)
/Users/zach/Git/py_grama/grama/tran_pivot.py:471: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| thick | angle | E | mu | |
|---|---|---|---|---|
| 0 | 0.022 | 00 | 10600.0 | 0.321 | 
| 1 | 0.022 | 45 | 10700.0 | 0.329 | 
| 2 | 0.022 | 00 | 10600.0 | 0.323 | 
| 3 | 0.022 | 45 | 10500.0 | 0.331 | 
| 4 | 0.032 | 00 | 10400.0 | 0.329 | 
| 5 | 0.032 | 45 | 10400.0 | 0.318 | 
| 6 | 0.032 | 00 | 10300.0 | 0.319 | 
| 7 | 0.032 | 45 | 10500.0 | 0.326 |