Day 2 (Take-Home) Programmatic Data Operations

Author: Zach del Rosario

The purpose of this exercise is to give you some tools to work with data programmatically; that is, using a programming language. While you can carry out many data operations by hand or with spreadsheet programs, you will see that doing things programmatically is extremely powerful.

Learning Outcomes

By working through this notebook, you will be able to:

  • Use Pandas’ DataFrame object to represent data

  • Use DataFrame operations from the package py-grama to operate on data

  • Use basic data checks: shapes, data types, head and tail

  • Perform fundamental data wrangling: type conversions, pivoting, filtering, mutating

import numpy as np
import pandas as pd
import grama as gr

DF = gr.Intention()

# For downloading data
import os
import requests

DataFrames


A DataFrame is a data structure provided by Pandas. In contrast with lists (which we saw in a previous exercise), DataFrames are explicitly designed to facilitate data analysis. Accordingly, they provide a number of helpful features that aid in data analysis and operations.

A DataFrame is a rectangular representation of data – it consists of rows and columns. If the data are tidy, then each row represents an observation, each column represents a variable, and each cell represents a single measurement.

For instance, the following code chunk downloads a alloy dataset into the DataFrame df_mpea—here each row is an alloy, and each column is some physical property of that alloy. These data come from the report Borg et al. 2020.

# Filename for local data
filename_data = "./data/mpea.csv"

# The following code downloads the data, or (after downloaded)
# loads the data from a cached CSV on your machine
if not os.path.exists(filename_data):
    # Make request for data
    url_data = "https://docs.google.com/spreadsheets/u/1/d/1MsF4_jhWtEuZSvWfXLDHWEqLMScGCVXYWtqHW9Y7Yt0/export?format=csv"
    r = requests.get(url_data, allow_redirects=True)
    open(filename_data, 'wb').write(r.content)
    print("   MPEA data downloaded from public Google sheet")
else:
    # Note data already exists
    print("    MPEA data loaded locally")
    
# Read the data into memory
df_mpea = pd.read_csv(filename_data)

Let’s use some of the basic attributes of the data to get some basic facts:

# Check the shape
df_mpea.shape

We have 1653 rows (also called observations) on 20 columns (also called variables or features).

df_mpea.head()

The head method shows just the top of the DataFrame; this is useful for “getting a sense” for what’s in the data.

df_mpea.dtypes

The dtypes attribute gives us the data type for each column. Depending on the dataset, you might find that your data loads in with strange datatypes. This can happen, for instance, if your numeric values are contained within string characters (e.g. "1.23"). If this happens, you can catch the fact with a call to df_data.dtypes.

Q1: Inspect a DataFrame

Consult the pandas documentation (it might be useful to use a page search) and use some basic calls on df_mpea to answer the following questions:

  • What are the last five observations in the DataFrame?

  • How many rows are in df_mpea? How many columns?

  • How would you access the column PROPERTY: Microstructure?

###
# TASK: Inspect df_data
# TODO: Show the last five observations of df_mpea
###

# -- WRITE YOUR CODE BELOW -----
###
# TASK: Inspect df_mpea
# TODO: Determine the number of rows and columns in df_mpea
###

# -- WRITE YOUR CODE BELOW -----
###
# TASK: Inspect df_data
# TODO: Grab the column `PROPERTY: Microstructure` alone
###

# -- WRITE YOUR CODE BELOW -----

These manipulations are simple, but they are bread-and-butter for studying new datasets.

Grama


The py-grama package builds on top of Pandas to provide a pipeline-based data (and model) infrastructure. Grama provides many of the same functions as Pandas (really, just different ways to use the same Pandas functions):

(
   df_mpea
   >> gr.tf_head()
)

One of the advantages of using py-grama is that we can write data pipelines to organize our data operations. For instance, the following code filters the MPEA dataset to only those cases that have a valid Yield Strength (YS) and Ultimate Tensile Strength (UTS), and computes a correlation coefficient between those two quantities.

## NOTE: No need to edit; run and see the result
(
    df_mpea
    >> gr.tf_filter(
        gr.not_nan(DF["PROPERTY: YS (MPa)"]),
        gr.not_nan(DF["PROPERTY: UTS (MPa)"]),
    )
    >> gr.tf_summarize(
        rho_YS_UTS=gr.corr(
            DF["PROPERTY: YS (MPa)"],
            DF["PROPERTY: UTS (MPa)"],
        )
    )
)

As we might expect, these two properties are strongly correlated.

This code shows off a few concepts, which we’ll explore below: The pipe operator >>, Grama verbs (such as tf_filter), and the data pronoun DF.

The pipe operator >>

It’s helpful to think of the pipe operator >> as the words “and then”. That means code like this:

(
    df_mpea
    >> gr.tf_filter( ... )
    >> gr.tf_mutate( ... )
    >> gr.tf_pivot_longer( ... )
)

Can be read something like an English sentence, where we are using various verbs to operate on the data:

(
    Start with df_mpea
    and then filter the data
    and then mutate the data
    and then pivot the data in to a longer format
)

We don’t yet know what these verbs do; we’ll learn more in the exercises below!

Selecting

The tf_select verb allows us to select one or more columns; this is helpful when we want to focus on just a handful of properties, such as the chemical formulas.

(
    df_mpea
    >> gr.tf_select("FORMULA")
)

We can select multiple columns by providing multiple arguments.

Q2: Use tf_select to select the formula and microstructure columns only.

###
# TASK: Select the formula and microstructure columns only
###

# -- WRITE YOUR CODE BELOW -----

We can also use some selection helpers to make tf_select even more convenient. For instance, the gr.everything() function just selects all the columns, which at first seems silly:

(
    df_mpea
    >> gr.tf_select(gr.everything())
    >> gr.tf_head()
)

However, when we use gr.everything() along with specific columns, we can re-arrange the columns to make quick comparisons easier. For instance, let’s move the reference information to the left. We could then easily copy the DOI’s to find the original reference for each observation.

(
    df_mpea
    >> gr.tf_select("REFERENCE: doi", gr.everything())
    >> gr.tf_head()
)

There are a variety of other selection helpers, including:

  • gr.starts_with(...) will select all columns that start with a given string

  • gr.ends_with(...) will select all columns that end with a given string

  • gr.contains(...) will select all columns that contain a given string

  • gr.matches(...) will select all columns that match a given regular expression

You’ll practice using selection helpers in the next task.

Q3 Use a selection helper to find all of the columns with the string "REFERENCE"

###
# TASK: Select the formula and microstructure columns only
###

# -- WRITE YOUR CODE BELOW -----

Renaming

Aside from selecting columns, we can also make convenience modifications to the data. The verb tf_rename allows us to rename columns, usually to create a more compact, convenient name:

## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
    >> gr.tf_head()
)

This is a good time to step aside from verbs to talk about the data pronoun.

Interlude: Pipelines and the “Data Pronoun”


Imagine we wanted to search through the dataset to find only those materials with a FCC microstructure. Above, we gave the microstructure column a new, convenient name. We might like to use that new, convenient name when searching for FCC materials. However, we’re going to run into an issue:

## NOTE: Try uncommenting and running the following code; it WILL break!
# (
#     df_mpea
#     >> gr.tf_rename(
#         microstructure="PROPERTY: Microstructure",
#     )
#     >> gr.tf_filter(
#         df_mpea["microstructure"] == "FCC"
#     )
# )

If we want to refer to the data now—as it is currently in the pipeline—we need a name to refer to that DataFrame. This is where the data pronoun comes in; remember when we ran this line way up above in the setup chunk?

DF = gr.Intention()

This assigns the data pronoun to the name DF. The data pronoun represents a DataFrame, so we can use things like column access DF["column name"]. We can use this to take advantage of the new (shorter) name we gave to the microstructure column:

(
    df_mpea
    >> gr.tf_rename(
        microstructure="PROPERTY: Microstructure",
    )
    >> gr.tf_filter(
        DF["microstructure"] == "FCC"
    )
)

Together, the pipe operator >> and the data pronoun DF form a powerful team that helps us do sophisticated data operations.

Q4 Re-write the following code using the pipe operator and data pronoun

###
# TASK: Eliminate the intermediate variables by using the data pronoun
###

# -- NO NEED TO EDIT; REWRITE THIS CODE -----
# Set up a simple dataset
df_initial = gr.df_make(
    A=[1, 2, 3],
    longcolumnname=[4, 5, 6],
)
print(df_initial)

df_new = (
    df_initial
    >> gr.tf_rename(B="longcolumnname")
)

(
    df_new
    >> gr.tf_filter(df_new.B == 5)
)

# -- WRITE YOUR CODE BELOW -----
(
    df_initial
    # Use pipes >> and the data pronoun DF;
    # you should only need two lines of code

)

Back to Verbs


Filtering

We saw tf_filter above; this allows us to filter a dataset to only those rows satisfying some logical criterion. This makes answering basic questions about the data very easy. For instance, we might be interested in a particular processing method; we could find only those rows matching a specified method:

## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_filter(DF["PROPERTY: Processing method"] == "POWDER")
)

Notice that not all of the cells have useful values; some have NaN as their value (which means Not a Number). These could be due to any of a number of potential issues; perhaps the original reference did not report that value, meaning that information exists but is missing.

There are some helper functions to help deal with NaN values in filters: gr.not_nan(df.column) will return True when its input is not NaN, while gr.is_nan(df.column) will do the reverse.

Q5 Filter the MPEA dataset to only those rows with a valid Yield Strength. Compare the original number of rows with the number of valid rows.

###
# TASK: Filter the data to find the non-NaN Yield Strength values
###

# -- NO NEED TO EDIT; USE FOR COMPARISON -----
print("Original shape: {}".format(df_mpea.shape))

# -- WRITE YOUR CODE BELOW -----

Mutating

The tf_mutate verb allows us to create / modify columns based on existing column values. For instance, we could use a mutation to convert the units in a column:

## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_mutate(
        E_MPa = DF["PROPERTY: Young modulus (GPa)"] * 1000
    )
    >> gr.tf_select("E_MPa", gr.everything())
)

This might be useful if we aimed to compare two quantities; elasticity and ultimate tensile strength are somewhat related properties, so we might want to compare them.

## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_mutate(
        E_MPa = DF["PROPERTY: Young modulus (GPa)"] * 1000
    )
    >> gr.tf_rename(
        UTS_MPa = "PROPERTY: UTS (MPa)"
    )
    >> gr.tf_filter(
        gr.not_nan(DF.UTS_MPa),
        gr.not_nan(DF.E_MPa),
    )
    >> gr.tf_select(
        "UTS_MPa",
        "E_MPa",
        gr.everything(),
    )
)

Q6 Convert the weight parts per million (wppm) of Oxygen to a (weight) percent.

###
# TASK: Convert wppm to a weight percentage
# NOTE: There is some scaffolding code; you need only
#       write the call to tf_mutate
###

(
    df_mpea
    >> gr.tf_filter(
        gr.not_nan(DF["PROPERTY: O content (wppm)"]),
    )
# -- WRITE YOUR CODE HERE -----
    ## TODO: Use gr.tf_mutate to do the conversion

    >> gr.tf_select("O_percent", gr.everything())
    >> gr.tf_head()
)

Wrangling Data


With these basic tools—data pipelines and verbs—we have many of the tools we need to do data wrangling. Very frequently, data are messy and unusable; we need to do some wrangling to get our data into-shape for analysis. The last few tasks will focus on key steps in data wrangling.

Data Converting

There’s something wrong with the Nitrogen content column:

(
    df_mpea
    >> gr.tf_select(gr.contains("content"))
).dtypes

The Oxygen and Carbon content columns are fine—they’re float64, which is a numeric type as we’d expect. But the Nitrogen content is an object. Let’s see what specific values this column takes:

set(
    df_mpea["PROPERTY: N content (wppm)"]
)

Use of set

The set datatype only allows one of each unique value; calling set() on a column is a simple way to find all the unique values in a column.

It seems that the original data are mixed; some values are a numeric ppm value, while others are the qualitative statement "undetectable" (and yet others are simply missing). We can use the Grama helper gr.as_numeric() to help convert the data to a numeric type.

## NOTE: No need to edit
(
    df_mpea
    >> gr.tf_rename(N_wppm="PROPERTY: N content (wppm)")
    >> gr.tf_mutate(N_converted = gr.as_numeric(DF.N_wppm))
    >> gr.tf_select("N_wppm", "N_converted")
).dtypes

With this type conversion, we can express all three element columns as percentages.

Q7 Fix the conversion from wppm to a weight percentage

###
# TASK: Fix the conversion of N wppm
# NOTE: There is some scaffolding code; you need only
#       edit one line
###

(
    df_mpea
    >> gr.tf_mutate(
        O_percent=DF["PROPERTY: O content (wppm)"] * 100 / 1e6,
        C_percent=DF["PROPERTY: C content (wppm)"] * 100 / 1e6,

## --- WRITE YOUR CODE HERE -----
        ## TODO: Fix the following line of code
#         N_percent=DF["PROPERTY: N content (wppm)"] * 100 / 1e6
    )
    >> gr.tf_select(gr.contains("percent"), gr.everything())
)

Pivoting Data

Another common data issue is when our data are in the wrong shape. To illustrate, let’s look at another dataset:

from grama.data import df_stang_wide
df_stang_wide

These are observations on different samples of the same rolled aluminum alloy, with measurements taken at different angles relative to the direction of rolling. Note that the relative angle 0, 45, 90 is in the column names, rather than in cells. This means we would need to write special-purpose code in order to analyze these data.

Rather than re-invent our analysis code for every new dataset, we can instead reshape our data into a single, consistent format for a single set of analysis tools. To that end, we are going to reshape the data into a tidy format.

Our goal will be to wrangle this messy, wide dataset into tidy, long format, shown below:

from grama.data import df_stang
df_stang

To carry out this reshaping, we will use a set of pivoting tools. As a simple example, gr.tf_pivot_longer() takes a wide dataset and makes it longer.

df_tmp = (
    gr.df_make(
        A=[1, 2, 3],
        B=[4, 5, 6],
        C=[7, 8, 9],
    )
)
print(df_tmp)

(
    df_tmp
    >> gr.tf_pivot_longer(
        columns=["A", "B", "C"],
        names_to="name",
        values_to="value",
    )
)

Q8 Pivot df_stang_wide longer to put all the angle values in cells

Hint 1: Make sure to add an observation column with the index_to argument.

###
# TASK: Pivot the data longer
###

# -- WRITE YOUR CODE HERE -----

df_q8

Execute the following to check your work.

try:
    assert(df_q8.shape[0] == 54)
except AssertionError:
    raise AssertionError("The DataFrame is not sufficiently long; did you pivot?")
    
try:
    assert(df_q8.shape[1] == 5)
except AssertionError:
    raise AssertionError("The DataFrame should have five columns")
    
try:
    assert("observation" in df_q8.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have have an `'observation'` column")
    
print("Success!")

Next, we need to separate the measurement identifiers "E", "mu" from the angle measurements. For that, we can use the gr.tf_separate() verb. This allows us to take string values and split them into separate columns, based on a separator character:

(
    gr.df_make(
        combined=["a-1", "b-2", "c-3"],
    )
    >> gr.tf_separate(
        column="combined",
        into=["letter", "number"],
        sep="-",
    )
)

Q9 Use gr.tf_separate() to separate the measurement identifiers "E", "mu" from the measurement angles. Make sure to call the angle column "angle".

###
# TASK: Pivot the data longer
###

df_q9 = (
    df_q8
# -- WRITE YOUR CODE HERE -----
    ## Use gr.tf_separate()

)
df_q9

Use the following code to check your work.

try:
    assert(df_q9.shape[0] == df_q8.shape[0])
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert(df_q9.shape[1] == 6)
except AssertionError:
    raise AssertionError("The DataFrame should have six columns")
    
try:
    assert("angle" in df_q9.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
print("Success!")

We’re nearly there! Finally, we need to turn the measurement identifiers "E", "mu" back into column names. We can do this by pivoting wider.

Q10 Pivot the data wider to turn the measurement identifiers "E", "mu" back into column names.

Hint: You should only need to set the names_from and values_from arguments with this function.

###
# TASK: Pivot the data wider
###

df_q10 = (
    df_q9
# -- WRITE YOUR CODE HERE -----
    ## Use gr.tf_pivot_wider()

)
df_q10

Use the following code to check your work.

try:
    assert(df_q10.shape[0] == 27)
except AssertionError:
    raise AssertionError("The DataFrame is not the right length; how did that happen?")
    
try:
    assert("angle" in df_q10.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'angle' column")
    
try:
    assert("E" in df_q10.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'E' column")
    
try:
    assert("mu" in df_q10.columns)
except AssertionError:
    raise AssertionError("The DataFrame should have an 'mu' column")
    
print("Success!")

Bonus: One-step pivot

As a closing, bonus demonstration, we illustrate some advanced options in gr.tf_pivot_longer() that would allow you to reshape the data in single call. This uses the ".value" special argument to signal that the values that would be placed in that names_to column should really be column names themselves; put differently, the ".value" keyword is like adding a pivot wider after the pivot longer.

## NOTE: No need to edit
(
    df_stang_wide
    >> gr.tf_pivot_longer(
        columns=["E_00", "mu_00", "E_45", "mu_45", "E_90", "mu_90"],
        names_to=[".value", "angle"],
        names_sep="_",
    )
)

Survey


Once you complete this activity, please fill out the following 30-second survey:

Endnotes

  • The data portions of Grama make heavy use of ideas from the Tidyverse; specifically the dplyr package. However, those packages are for the R programming language.

  • The HEA dataset is from:

Borg, C.K.H., Frey, C., Moh, J. et al. Expanded dataset of mechanical properties and observed phases of multi-principal element alloys. Sci Data 7, 430 (2020). https://doi.org/10.1038/s41597-020-00768-9