Intro to Data Wrangling and Tidy Data

Author: Zach del Rosario

Learning outcomes

By working through this notebook, you will be able to:

  • Understand basic principles of working with data in software

  • State the basic ideas of tidy data

  • State the basic ideas of data wrangling

Setup

In this day exercise, we’ll make use of the Pandas and Grama packages to work with data. Pandas is a package for data analysis, and it supplies the DataFrame object for representing data. Grama builds on top of Pandas to provide pipeline-based tools for data and machine learning.

This notebook focuses on introducing ideas, the evening notebook will teach you the mechanics of how to use these tools.

import numpy as np
import pandas as pd
import grama as gr
%matplotlib inline

DF = gr.Intention()

# For downloading data
import os
import requests

The following code downloads the same data you extracted in the previous day’s Tabula exercise.

# Filename for local data
filename_data = "./data/tabula-weibull.csv"

# The following code downloads the data, or (after downloaded)
# loads the data from a cached CSV on your machine
if not os.path.exists(filename_data):
    # Make request for data
    url_data = "https://raw.githubusercontent.com/zdelrosario/mi101/main/mi101/data/tabula-weibull1939-table4.csv"
    r = requests.get(url_data, allow_redirects=True)
    open(filename_data, 'wb').write(r.content)
    print("   Tabula-extracted data downloaded from public Google sheet")
else:
    # Note data already exists
    print("   Tabula-extracted data loaded locally")
    
# Read the data into memory
df_tabula = pd.read_csv(filename_data)

These are data on the tensile strength of specimens of stearic acid and plaster-of-paris.

df_tabula

Note that Pandas renamed some columns to avoid giving us duplicate column names. The names are useful for holding metadata, but shorter column names are far easier to work with in a computational environment.

Q1: Complete the code below to rename the columns

Hint: You can click-and-drag on the DataFrame printout above for a less error-prone way of giving the original column names.

###
# TASK: Copy the original column names into the double-quote below
#       to complete the code and rename the columns with shorter
#       names.
###

df_q1 = (
    df_tabula
    >> gr.tf_rename(
## TASK: Uncomment these lines and fill-in the new (short) column names
#         obs_1="",   # Observation Number, Block 1
#         area_1="",  # Specimen area, Block 1
#         sigma_1="", # Stress, Block 1
#         obs_2="",   # Observation Number, Block 2
#         area_2="",  # Specimen area, Block 2
#         sigma_2="", # Stress, Block 2

    )
)

## NOTE: No need to edit, this will show your renamed data
df_q1

Use the following to check your work:

## NO NEED TO EDIT; use this to check your work
assert(set(df_q1.columns) == {"obs_1", "area_1", "sigma_1", "obs_2", "area_2", "sigma_2"})
print("Success!")

Now the column names are much shorter, but they are far less interpretable. To help keep track of what each column means, we can construct a data dictionary.

Q2: Complete the data dictionary below to document the units associated with the short column names.

Note: Weibull in his (1939) paper reports these stress values in units kg / mm^2. This is strange; we’ll return to this later!

df_tabula.head()

TASK: Complete the text below. (Note: You can double-click on this cell to start editing it!)

Column

Units

obs_1

(Unitless)

area_1,

???

sigma_1

???

obs_2

???

area_2,

???

sigma_2

???

Pivoting

Note that the data are in a rather strange layout; the first three columns report the specimen number (obs_1), cross-sectional area (area_1), and measured stress (sigma_1) for all the observations up to 13. Then the last three columns report the same quantities starting at observation 14. This sort of layout with “two blocks” is convenient for reporting values in a compact form, but it is highly inconvenient for doing analysis, as we’ll see next.

df_q1.head()

Imagine we wanted to compute some simple statistics on these data, say the mean of the stress values. Since the data come in two blocks, we have to access them separately:

sigma_mean_1 = df_q1.sigma_1.mean()
sigma_mean_2 = df_q1.sigma_2.mean()

print("Mean 1: {0:4.3f}".format(sigma_mean_1))
print("Mean 2: {0:4.3f}".format(sigma_mean_2))

We could do something hacky to combine the two:

sigma_mean_both = (12 * sigma_mean_1 + 11 * sigma_mean_2) / (12 + 11)
print("Mean both: {0:4.3f}".format(sigma_mean_both))

But it would be far easier if we could just combine all the relevant columns so they’re not in “blocks.” This is what pivoting a dataset allows us to do:

## NOTE: No need to edit; you'll learn this in the evening's notebook
df_long = (
    df_q1
    >> gr.tf_pivot_longer(
        columns=["obs_1", "area_1", "sigma_1", "obs_2", "area_2", "sigma_2"],
        names_to=[".value", "block"],
        names_sep="_",
    )
    >> gr.tf_arrange(DF.obs)
)
df_long

Pivoting is one of the key tools we need to tidy our data.

Tidy Data

Tidy data is a very simple—but very powerful—idea. The image below gives the definition of tidy data.

In tidy data: each variable forms a column, each observation forms a row, each cell is a single measurement.

Artwork by Allison Horst, generously released under an open-source license!

The big payoff of making your data tidy is that you don’t have to build special-purpose tools to work with your data. Since every untidy dataset is untidy in its own unique (hideous) way, you would have to build special-purpose tools for every untidy dataset. By having one standard form of data, you can use the same tools for every dataset.

Comical cartoon comparing tools for tidy data and tools for untidy data.

Let’s apply these ideas to the data we extracted using Tabula.

Q3: Inspect the original form of the data. Why are these data not tidy? How would you make the data tidy?

## NO NEED TO EDIT; run and inspect
df_tabula.head()
  • (Your response here)

Data Wrangling

Even once our data are tidy, they still might not be as usable as we might like. Another, less well-defined aspect of making data usable is wrangling the data into a more useful form. We’ll talk about a few aspects we might need to wrangle our data to fix: unit conversion, filtering, and data types.

Unit conversion

Weibull reports the \(\sigma_d\) values in kg / mm^2; if we interpret kg as a kilogram (mass) then these can’t be stress values! (Stress has units of force per area.) However, suppose for a moment that he were using kg to denote a kilogram-force, where \(1 \text{kgf} = 1 \text{kg} \times 9.8 m/s^2\).

Weibull gives a summary value for the same stress in the more interpretable units g / (cm s^2). Let’s check this hypothesis by comparing the proposed unit converstion with our data:

# (540 x 10^5 g/(cm s^2)) / (980 cm/s^2) * (kg / 1000 g) * (cm^2 / 100 mm^2)
print("Calculated sigma: {0:4.3f}".format(540e5 / 980 / 1000 / 100))
df_long.head()

This is very near the sigma values we have in our dataset, which lends a great deal of credibility to our interpretation of kg as kgf. With this determined, we can make a unit conversation to more standard units.

\[\text{kgf}/\text{mm}^2 = 9.8 \text{MPa}\]

Q4: Convert the units to MPa.

Replace the factor of 1.0 below to convert the units to MPa.

###
# TASK: Replace the 1.0 factor with the correct conversion factor
###


df_q4 = (
    df_long
    >> gr.tf_mutate(sigma_MPa=DF.sigma * 1.0) # Replace 1.0 with correct factor

)

df_q4.head()

Filtering invalid values

If we take a look at the end of the dataset, we see some strange NaN entries:

## NOTE: No need to edit
df_q4.tail()

The value NaN is a special value that denotes Not a Number. Essentially, this is the way we represent something invalid. NaN can represent a missing value, a failed type conversion, or any of a variety of ways something can fail to be a number.

Often, the way we deal with NaN values is simply to remove the offending row from our dataset. We can do that with a filter operation:

## NOTE: No need to edit
df_filtered = (
    df_q4
    >> gr.tf_filter(gr.not_nan(DF.sigma_MPa))
)
df_filtered

Converting types

Lastly, it’s worth noting that all of our columns have a data type associated with them. We can inspect that type with the dtypes attribute.

## NOTE: No need to edit
df_q4.dtypes

The float64 is a way of representing a decimal value. The area and stress values we’d expect to be decimal values, but what’s up with block and obs? We’d expect those to be integers. If the types of your columns are not what you’d expect, you can do a type conversion to attempt a converstion from one data type to a desired type.

## NOTE: No need to edit
df_converted = (
    df_filtered
    >> gr.tf_mutate(
        block=gr.as_int(DF.block),
        obs=gr.as_int(DF.obs),
    )
)
df_converted.dtypes

We’ll see more about casting in the evening notebook. But now, we have a tidy and wrangled dataset!

## NOTE: No need to edit
df_converted

Bonus: The Payoff of Tidying and Wrangling

As a preview for why we want to learn how to tidy and wrangle data, let’s see what some of these tidy data tools will allow us to do. Using ggplot, we can very quickly construct simple graphs:

## NOTE: No need to edit
(
    df_converted
    >> gr.ggplot(gr.aes("area", "sigma_MPa"))
    + gr.geom_point()
)

Part of the power of plotnine (over other graphing software) is that we can very easily tweak our graphs to show additional information. For instance, if we also wanted to see if the samples in the two blocks were at all different, we could easily color the points.

## NOTE: No need to edit
(
    df_converted
    >> gr.ggplot(gr.aes("area", "sigma_MPa"))
#     + gr.geom_point() # Original
    + gr.geom_point(gr.aes(color="block")) # Modified for color
)

We’ll learn more about ggplot and visualization in tomorrow’s workshop exercises.

Survey


Once you complete this activity, please fill out the following 30-second survey: