Data: Separating and Uniting Columns
Contents
Data: Separating and Uniting Columns#
Purpose: Data is easiest to use when it is tidy. In fact, grama is specifically designed to use tidy data. Last time we learned how to pivot data, but data can be untidy in other ways. Pivoting helped us when data were locked up in the column headers: This time, we’ll learn how to use separate and unite to deal with cell values that are untidy.
Setup#
import grama as gr
DF = gr.Intention()
%matplotlib inline
# For assertion
from pandas.api.types import is_integer_dtype
Separate and Unite#
Sometimes data are presented in a wide format in order to fit many values on a page. You’ll often see this form of data when looking at large datasets that have been prepared for a report (say, a UN population survey). For example, the following dataset is clearly not tidy:
## NOTE: No need to edit
gr.df_make(
country=["Afghanistan", "Albania", "Algeria"],
population_1995=[17586073, 3357858, 29315463],
population_1996=[18415307, 3341043, 29845208],
)
country | population_1995 | population_1996 | |
---|---|---|---|
0 | Afghanistan | 17586073 | 18415307 |
1 | Albania | 3357858 | 3341043 |
2 | Algeria | 29315463 | 29845208 |
The year of each population measurement is “locked up” in the column names. However, pivoting this dataset only gets us so far:
## NOTE: No need to edit
(
gr.df_make(
country=["Afghanistan", "Albania", "Algeria"],
population_1995=[17586073, 3357858, 29315463],
population_1996=[18415307, 3341043, 29845208],
)
>> gr.tf_pivot_longer(
columns=["population_1995", "population_1996"],
names_to="name",
values_to="value",
)
)
country | name | value | |
---|---|---|---|
0 | Afghanistan | population_1995 | 17586073 |
1 | Albania | population_1995 | 3357858 |
2 | Algeria | population_1995 | 29315463 |
3 | Afghanistan | population_1996 | 18415307 |
4 | Albania | population_1996 | 3341043 |
5 | Algeria | population_1996 | 29845208 |
The column names are not simply cell values; they’re a combination of both column names and cell values. To fully tidy this datset, we need to separate the names from the data.
The gr.tf_separate()
verb#
The gr.tf_separate()
verb allows us to separate string-valued columns on a target character. As a simple example, consider the following (silly) dataset:
## NOTE: No need to edit
(
gr.df_make(s=["A-123", "B-456", "C-789"])
)
s | |
---|---|
0 | A-123 |
1 | B-456 |
2 | C-789 |
If we wanted to separate the column s
into two columns on the -
character, we could do that with gr.tf_separate()
.
## NOTE: No need to edit
(
gr.df_make(s=["A-123", "B-456", "C-789"])
>> gr.tf_separate(
column="s",
sep="-",
into=["letter", "numbers"],
)
)
letter | numbers | |
---|---|---|
0 | A | 123 |
1 | B | 456 |
2 | C | 789 |
Note that we have two new columns with the separated values. We can use this approach to help tidy datasets.
q1 Separate the population
and year
#
Use gr.tf_separate()
to separate the population
and year
values into their own columns.
## TASK: Separate the population and year columns
df_pop = (
gr.df_make(
country=["Afghanistan", "Albania", "Algeria"],
population_1995=[17586073, 3357858, 29315463],
population_1996=[18415307, 3341043, 29845208],
)
>> gr.tf_pivot_longer(
columns=["population_1995", "population_1996"],
names_to="pop_year",
values_to="value",
)
>> gr.tf_separate(
column="pop_year",
sep="_",
into=["population", "year"],
)
)
## NOTE: No need to edit; use this to check your work
print(df_pop)
assert \
"population" in df_pop.columns, \
"df_pop does not have a `population` column"
assert \
"year" in df_pop.columns, \
"df_pop does not have a `year` column"
assert \
df_pop.shape == (6, 4), \
"df_pop is the wrong shape"
country value population year
0 Afghanistan 17586073 population 1995
1 Albania 3357858 population 1995
2 Algeria 29315463 population 1995
3 Afghanistan 18415307 population 1996
4 Albania 3341043 population 1996
5 Algeria 29845208 population 1996
The gr.tf_unite()
verb#
The gr.tf_unite()
verb performs the inverse of gr.tf_separate()
; gr.tf_unite()
allows us to combine multiple string columns into a single column. For example, we can undo the separation you performed above:
## NOTE: No need to edit
(
df_pop
>> gr.tf_unite(
"population-year",
DF.population,
DF.year,
sep="-",
)
)
country | value | population-year | |
---|---|---|---|
0 | Afghanistan | 17586073 | population-1995 |
1 | Albania | 3357858 | population-1995 |
2 | Algeria | 29315463 | population-1995 |
3 | Afghanistan | 18415307 | population-1996 |
4 | Albania | 3341043 | population-1996 |
5 | Algeria | 29845208 | population-1996 |
Note that we have to choose what to name the new united column, the set of columns to combine, and a separator character with the sep
argument. While the example above is not terribly practical, there are many cases where we’ll have many strings that we might like to join.
q2 Unite the digits#
Add the area
code to the front of each localnumber
to complete the following phone numbers. Provide the full number as the column number
. Make sure to use the separator character -
.
## TASK:
df_numbers = (
gr.df_make(
localnumber=["255-2112", "255-4200", "867-5309"],
area=["212", "650", "814"],
)
>> gr.tf_unite(
"number",
DF.area,
DF.localnumber,
sep="-"
)
)
## NOTE: Use this to check your work
print(df_numbers)
assert \
"number" in df_numbers.columns, \
"df_numbers does not have a `number` column"
assert \
(df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "_"))).shape[0] == 0, \
"Separator `_` detected; make sure to use `-` as your separator"
assert \
(df_numbers >> gr.tf_filter(gr.str_detect(DF.number, "^212|^650|^814"))).shape[0] == 3, \
"The area codes are not at the *front* of the numbers"
number
0 212-255-2112
1 650-255-4200
2 814-867-5309
Case Study: Tidying an Alloy Dataset#
Now let’s make use of gr.tf_separate()
in tidying a more engineering-focused dataset. The following is a small dataset of measured aluminum alloy material properties.
## NOTE: No need to edit
df_alloys = gr.df_make(
thick=[0.022]*2 + [0.032]*2,
E_00=[10600, 10600, 10400, 10300],
mu_00=[0.321, 0.323, 0.329, 0.319],
E_45=[10700, 10500, 10400, 10500],
mu_45=[0.329, 0.331, 0.318, 0.326],
)
df_alloys
thick | E_00 | mu_00 | E_45 | mu_45 | |
---|---|---|---|---|---|
0 | 0.022 | 10600 | 0.321 | 10700 | 0.329 |
1 | 0.022 | 10600 | 0.323 | 10500 | 0.331 |
2 | 0.032 | 10400 | 0.329 | 10400 | 0.318 |
3 | 0.032 | 10300 | 0.319 | 10500 | 0.326 |
Note that this dataset is not tidy: The digits in the column names are the angles at which the alloys were tested. In order to tidy this dataset, we’ll have to go through a few steps:
q3 Add a unique observation identifier#
Later, when we attempt a gr.tf_pivot_wider()
down the line, we’ll need a unique identifier for each row in the original dataset. Create a new column id_obs
that uniquely identifies each row.
Hint: You could just use the DataFrame’s index
column to do this….
## TASK: Add a unique identifier (UID) to each row
df_alloys_id = (
df_alloys
>> gr.tf_mutate(id_obs=DF.index)
)
## NOTE: Use this to check your work
print(df_alloys_id)
assert \
"id_obs" in df_alloys_id.columns, \
"df_alloys_id has no `id_obs` column"
assert \
len(set(df_alloys_id.id_obs)) == len(df_alloys_id.id_obs), \
"The values in df_alloys_id.id_obs are not unique"
thick E_00 mu_00 E_45 mu_45 id_obs
0 0.022 10600 0.321 10700 0.329 0
1 0.022 10600 0.323 10500 0.331 1
2 0.032 10400 0.329 10400 0.318 2
3 0.032 10300 0.319 10500 0.326 3
q4 Separate the variable
and angle
data#
The following code pivots the data longer; use gr.tf_separate()
to separate the variable
and angle
data.
## TASK: Separate the `variable` and `angle` data
df_alloys_separated = (
df_alloys_id
>> gr.tf_pivot_longer(
columns=["E_00", "E_45", "mu_00", "mu_45"],
names_to="variable_angle",
values_to="value"
)
>> gr.tf_separate(
column="variable_angle",
sep="_",
into=["variable", "angle"],
)
)
## NOTE: use this to check your work
print(df_alloys_separated)
assert \
"variable" in df_alloys_separated.columns, \
"df_alloys_separated has no `variable` column"
assert \
"angle" in df_alloys_separated.columns, \
"df_alloys_separated has no `angle` column"
thick id_obs value variable angle
0 0.022 0 10600.000 E 00
1 0.022 1 10600.000 E 00
2 0.032 2 10400.000 E 00
3 0.032 3 10300.000 E 00
4 0.022 0 10700.000 E 45
5 0.022 1 10500.000 E 45
6 0.032 2 10400.000 E 45
7 0.032 3 10500.000 E 45
8 0.022 0 0.321 mu 00
9 0.022 1 0.323 mu 00
10 0.032 2 0.329 mu 00
11 0.032 3 0.319 mu 00
12 0.022 0 0.329 mu 45
13 0.022 1 0.331 mu 45
14 0.032 2 0.318 mu 45
15 0.032 3 0.326 mu 45
q5 Finish tidying the data#
Finish tidying the data: Use a pivot to move the variable
names back to being column names, and make sure the angle
values are integers.
## TASK: Finish tidying the data
df_alloys_tidy = (
df_alloys_separated
>> gr.tf_pivot_wider(
names_from="variable",
values_from="value",
)
>> gr.tf_mutate(angle=gr.as_int(DF.angle))
)
## NOTE: Use this to check your work
print(df_alloys_tidy)
assert \
"E" in df_alloys_tidy.columns, \
"df_alloys_tidy does not have an `E` column"
assert \
"mu" in df_alloys_tidy.columns, \
"df_alloys_tidy does not have an `mu` column"
assert \
is_integer_dtype(df_alloys_tidy.angle), \
"df_alloys_tidy.angle is not an integer column"
thick id_obs angle E mu
0 0.022 0 0 10600.0 0.321
1 0.022 0 45 10700.0 0.329
2 0.022 1 0 10600.0 0.323
3 0.022 1 45 10500.0 0.331
4 0.032 2 0 10400.0 0.329
5 0.032 2 45 10400.0 0.318
6 0.032 3 0 10300.0 0.319
7 0.032 3 45 10500.0 0.326
Aside: What happens without the unique id?#
Above, I made a fuss about adding a unique id column to the dataset before we start pivoting. What happens if we don’t do that?
# NOTE: Uncomment the code below and run this cell to see the error
# (
# df_alloys
# # NOTE: I didn't add a UID here; let's see what happens....
# >> gr.tf_pivot_longer(
# columns=["E_00", "E_45", "mu_00", "mu_45"],
# names_to="varang",
# values_to="value"
# )
# >> gr.tf_separate(
# column="varang",
# sep="_",
# into=["var", "ang"],
# )
# >> gr.tf_pivot_wider(
# names_from="var",
# values_from="value",
# )
# )
We had to do a lot of work to tidy this dataset. Perhaps there’s a better way?
A magic option: ".value"
#
To help minimize some of the work necessary for tidying, gr.tf_pivot_longer()
provides some special functionality. To illustrate this, let’s take a look at a full tidying effort on a dataset similar to the alloy one:
## NOTE: No need to edit, this demonstrates the manual approach we've already seen
(
gr.df_make(
A_01=[1,2,3,4],
A_02=[5,6,7,8],
B_01=[1,3,5,7],
)
# Generate a unique observation id
>> gr.tf_mutate(uid=DF.index)
# Pivot longer to turn column names into cell values
>> gr.tf_pivot_longer(
columns=["A_01", "A_02", "B_01"],
names_to="varidx",
values_to="value",
)
# Separate the "true" column names from the cell entries
>> gr.tf_separate(
column="varidx",
sep="_",
into=["var", "idx"],
)
# Pivot wider to place correct column names in header
>> gr.tf_pivot_wider(
names_from="var",
values_from="value",
)
)
uid | idx | A | B | |
---|---|---|---|---|
0 | 0 | 01 | 1.0 | 1.0 |
1 | 0 | 02 | 5.0 | NaN |
2 | 1 | 01 | 2.0 | 3.0 |
3 | 1 | 02 | 6.0 | NaN |
4 | 2 | 01 | 3.0 | 5.0 |
5 | 2 | 02 | 7.0 | NaN |
6 | 3 | 01 | 4.0 | 7.0 |
7 | 3 | 02 | 8.0 | NaN |
This gets the job done. However, there is a much more succinct way to tidy this dataset using just one call to gr.tf_pivot_longer()
with the ".value"
argument.
## NOTE: No need to edit, this demonstrates the ".value" approach
(
gr.df_make(
A_01=[1,2,3,4],
A_02=[5,6,7,8],
B_01=[1,3,5,7],
)
>> gr.tf_pivot_longer(
columns=["A_01", "A_02", "B_01"],
names_to=[".value", "idx"],
names_sep="_",
)
)
idx | A | B | |
---|---|---|---|
0 | 01 | 1.0 | 1.0 |
1 | 02 | 5.0 | NaN |
2 | 01 | 2.0 | 3.0 |
3 | 02 | 6.0 | NaN |
4 | 01 | 3.0 | 5.0 |
5 | 02 | 7.0 | NaN |
6 | 01 | 4.0 | 7.0 |
7 | 02 | 8.0 | NaN |
This style of calling gr.tf_pivot_longer()
with the names_sep
argument effectively embeds a call to gr.tf_separate()
within the pivot. The verb then automates the tedious steps of creating a unique ID and pivoting wider.
In order to use this special functionality, we have to target the final column names using the special ".value"
argument. Note that this shows up in the names_to
set of arguments; where we would have specified the name of the column that we would eventually pivot wider:
names_to=["var", "idx"],
names_sep="_",
we instead flag this column as the target to pivot wider with the special argument
names_to=[".value", "idx"],
names_sep="_",
q6 Use the ".value"
approach#
Use the ".value"
approach to tidy the alloy dataset in one pivot.
## TASK: Use the ".value" approach to tidy the alloy dataset in one pivot
(
gr.df_make(
thick=[0.022]*2 + [0.032]*2,
E_00=[10600, 10600, 10400, 10300],
mu_00=[0.321, 0.323, 0.329, 0.319],
E_45=[10700, 10500, 10400, 10500],
mu_45=[0.329, 0.331, 0.318, 0.326],
)
>> gr.tf_pivot_longer(
columns=["E_00", "mu_00", "E_45", "mu_45"],
names_to=[".value", "angle"],
names_sep="_",
)
)
/Users/zach/Git/py_grama/grama/tran_pivot.py:471: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
thick | angle | E | mu | |
---|---|---|---|---|
0 | 0.022 | 00 | 10600.0 | 0.321 |
1 | 0.022 | 45 | 10700.0 | 0.329 |
2 | 0.022 | 00 | 10600.0 | 0.323 |
3 | 0.022 | 45 | 10500.0 | 0.331 |
4 | 0.032 | 00 | 10400.0 | 0.329 |
5 | 0.032 | 45 | 10400.0 | 0.318 |
6 | 0.032 | 00 | 10300.0 | 0.319 |
7 | 0.032 | 45 | 10500.0 | 0.326 |