{
"cells": [
{
"cell_type": "markdown",
"id": "e83d474c",
"metadata": {
"tags": []
},
"source": [
"# Data: Pivoting Data\n",
"\n",
"*Purpose*: Data is easiest to use when it is *tidy*. In fact, grama is specifically designed to use tidy data. But not all data we'll encounter is tidy! To that end, in this exercise we'll learn how to tidy our data by *pivoting*.\n",
"\n",
"As a result of learning how to quickly *tidy* data, you'll vastly expand the set of datasets you can analyze. Rather than fighting with data, you'll be able to quickly wrangle and extract insights.\n"
]
},
{
"cell_type": "markdown",
"id": "4f895c9a",
"metadata": {},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a2de1b4e",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import grama as gr\n",
"DF = gr.Intention()\n",
"%matplotlib inline\n",
"\n",
"# For assertion\n",
"from pandas.api.types import is_integer_dtype"
]
},
{
"cell_type": "markdown",
"id": "6e3e55d6",
"metadata": {},
"source": [
"# Tidy Data\n",
"\n",
"*Tidy Data* is a very simple---but *very powerful*---concept for structuring a dataset. "
]
},
{
"cell_type": "markdown",
"id": "327c425e",
"metadata": {},
"source": [
"\n",
"Artwork by @allison_horst\n"
]
},
{
"cell_type": "markdown",
"id": "1e9dcb2f",
"metadata": {},
"source": [
"## Tidy Data: Definition\n",
"\n",
"A tidy dataset has three properties:\n",
"\n",
"- each variable forms a column\n",
"- each observation forms a row\n",
"- each cell is a single measurement\n"
]
},
{
"cell_type": "markdown",
"id": "2096d57f",
"metadata": {},
"source": [
"As an example, the following dataset is tidy:\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "083e991a",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" thick | \n",
" alloy | \n",
" E | \n",
" mu | \n",
" ang | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.022 | \n",
" al_24st | \n",
" 10600 | \n",
" 0.321 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.022 | \n",
" al_24st | \n",
" 10600 | \n",
" 0.323 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.032 | \n",
" al_24st | \n",
" 10400 | \n",
" 0.329 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.032 | \n",
" al_24st | \n",
" 10300 | \n",
" 0.319 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.064 | \n",
" al_24st | \n",
" 10500 | \n",
" 0.323 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" thick alloy E mu ang\n",
"0 0.022 al_24st 10600 0.321 0\n",
"1 0.022 al_24st 10600 0.323 0\n",
"2 0.032 al_24st 10400 0.329 0\n",
"3 0.032 al_24st 10300 0.319 0\n",
"4 0.064 al_24st 10500 0.323 0"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from grama.data import df_stang\n",
"df_stang.head()\n"
]
},
{
"cell_type": "markdown",
"id": "ad920c9e",
"metadata": {},
"source": [
"The observations are all measured material properties taken at a particular angle (with respect to the direction in which the specimens were rolled). Each column reports values for just one variable, each row corresponds to a single observation, and every cell reports just one measurement.\n",
"\n",
"However, the following form of the same dataset is *not* tidy:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "8d9430ea",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" thick | \n",
" E_00 | \n",
" mu_00 | \n",
" E_45 | \n",
" mu_45 | \n",
" E_90 | \n",
" mu_90 | \n",
" alloy | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.022 | \n",
" 10600 | \n",
" 0.321 | \n",
" 10700 | \n",
" 0.329 | \n",
" 10500 | \n",
" 0.310 | \n",
" al_24st | \n",
"
\n",
" \n",
" 1 | \n",
" 0.022 | \n",
" 10600 | \n",
" 0.323 | \n",
" 10500 | \n",
" 0.331 | \n",
" 10700 | \n",
" 0.323 | \n",
" al_24st | \n",
"
\n",
" \n",
" 2 | \n",
" 0.032 | \n",
" 10400 | \n",
" 0.329 | \n",
" 10400 | \n",
" 0.318 | \n",
" 10300 | \n",
" 0.322 | \n",
" al_24st | \n",
"
\n",
" \n",
" 3 | \n",
" 0.032 | \n",
" 10300 | \n",
" 0.319 | \n",
" 10500 | \n",
" 0.326 | \n",
" 10400 | \n",
" 0.330 | \n",
" al_24st | \n",
"
\n",
" \n",
" 4 | \n",
" 0.064 | \n",
" 10500 | \n",
" 0.323 | \n",
" 10400 | \n",
" 0.331 | \n",
" 10400 | \n",
" 0.327 | \n",
" al_24st | \n",
"
\n",
" \n",
" 5 | \n",
" 0.064 | \n",
" 10700 | \n",
" 0.328 | \n",
" 10500 | \n",
" 0.328 | \n",
" 10500 | \n",
" 0.320 | \n",
" al_24st | \n",
"
\n",
" \n",
" 6 | \n",
" 0.081 | \n",
" 10000 | \n",
" 0.315 | \n",
" 10000 | \n",
" 0.320 | \n",
" 9900 | \n",
" 0.314 | \n",
" al_24st | \n",
"
\n",
" \n",
" 7 | \n",
" 0.081 | \n",
" 10100 | \n",
" 0.312 | \n",
" 9900 | \n",
" 0.312 | \n",
" 10000 | \n",
" 0.316 | \n",
" al_24st | \n",
"
\n",
" \n",
" 8 | \n",
" 0.081 | \n",
" 10000 | \n",
" 0.311 | \n",
" -1 | \n",
" -1.000 | \n",
" 9900 | \n",
" 0.314 | \n",
" al_24st | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" thick E_00 mu_00 E_45 mu_45 E_90 mu_90 alloy\n",
"0 0.022 10600 0.321 10700 0.329 10500 0.310 al_24st\n",
"1 0.022 10600 0.323 10500 0.331 10700 0.323 al_24st\n",
"2 0.032 10400 0.329 10400 0.318 10300 0.322 al_24st\n",
"3 0.032 10300 0.319 10500 0.326 10400 0.330 al_24st\n",
"4 0.064 10500 0.323 10400 0.331 10400 0.327 al_24st\n",
"5 0.064 10700 0.328 10500 0.328 10500 0.320 al_24st\n",
"6 0.081 10000 0.315 10000 0.320 9900 0.314 al_24st\n",
"7 0.081 10100 0.312 9900 0.312 10000 0.316 al_24st\n",
"8 0.081 10000 0.311 -1 -1.000 9900 0.314 al_24st"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from grama.data import df_stang_wide\n",
"df_stang_wide"
]
},
{
"cell_type": "markdown",
"id": "a827b07d",
"metadata": {},
"source": [
"This dataset is *not* tidy: The angle of each measurement `00, 45, 90` is a variable, but these numerical values are expressed as column names. Put differently, some of the values are not in cells, but rather in the column names.\n"
]
},
{
"cell_type": "markdown",
"id": "6224b58e",
"metadata": {},
"source": [
"## Why tidy data?\n",
"\n",
"Tidy data makes analysis *easier*. Putting our data in tidy form means we can use a *consistent* set of tools to work with *any* dataset.\n"
]
},
{
"cell_type": "markdown",
"id": "db8edfa0",
"metadata": {},
"source": [
"\n",
"Artwork by @allison_horst\n"
]
},
{
"cell_type": "markdown",
"id": "75529e7f",
"metadata": {},
"source": [
"Note that untidy data is not *bad* data; untidy data are simply harder to work with when doing data analysis. Data often come in untidy form when they are reported, say in a paper or a presentation. For instance, the wide form of the Stang et al. dataset can easily fit on one page:\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d7954242",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" thick | \n",
" E_00 | \n",
" mu_00 | \n",
" E_45 | \n",
" mu_45 | \n",
" E_90 | \n",
" mu_90 | \n",
" alloy | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.022 | \n",
" 10600 | \n",
" 0.321 | \n",
" 10700 | \n",
" 0.329 | \n",
" 10500 | \n",
" 0.310 | \n",
" al_24st | \n",
"
\n",
" \n",
" 1 | \n",
" 0.022 | \n",
" 10600 | \n",
" 0.323 | \n",
" 10500 | \n",
" 0.331 | \n",
" 10700 | \n",
" 0.323 | \n",
" al_24st | \n",
"
\n",
" \n",
" 2 | \n",
" 0.032 | \n",
" 10400 | \n",
" 0.329 | \n",
" 10400 | \n",
" 0.318 | \n",
" 10300 | \n",
" 0.322 | \n",
" al_24st | \n",
"
\n",
" \n",
" 3 | \n",
" 0.032 | \n",
" 10300 | \n",
" 0.319 | \n",
" 10500 | \n",
" 0.326 | \n",
" 10400 | \n",
" 0.330 | \n",
" al_24st | \n",
"
\n",
" \n",
" 4 | \n",
" 0.064 | \n",
" 10500 | \n",
" 0.323 | \n",
" 10400 | \n",
" 0.331 | \n",
" 10400 | \n",
" 0.327 | \n",
" al_24st | \n",
"
\n",
" \n",
" 5 | \n",
" 0.064 | \n",
" 10700 | \n",
" 0.328 | \n",
" 10500 | \n",
" 0.328 | \n",
" 10500 | \n",
" 0.320 | \n",
" al_24st | \n",
"
\n",
" \n",
" 6 | \n",
" 0.081 | \n",
" 10000 | \n",
" 0.315 | \n",
" 10000 | \n",
" 0.320 | \n",
" 9900 | \n",
" 0.314 | \n",
" al_24st | \n",
"
\n",
" \n",
" 7 | \n",
" 0.081 | \n",
" 10100 | \n",
" 0.312 | \n",
" 9900 | \n",
" 0.312 | \n",
" 10000 | \n",
" 0.316 | \n",
" al_24st | \n",
"
\n",
" \n",
" 8 | \n",
" 0.081 | \n",
" 10000 | \n",
" 0.311 | \n",
" -1 | \n",
" -1.000 | \n",
" 9900 | \n",
" 0.314 | \n",
" al_24st | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" thick E_00 mu_00 E_45 mu_45 E_90 mu_90 alloy\n",
"0 0.022 10600 0.321 10700 0.329 10500 0.310 al_24st\n",
"1 0.022 10600 0.323 10500 0.331 10700 0.323 al_24st\n",
"2 0.032 10400 0.329 10400 0.318 10300 0.322 al_24st\n",
"3 0.032 10300 0.319 10500 0.326 10400 0.330 al_24st\n",
"4 0.064 10500 0.323 10400 0.331 10400 0.327 al_24st\n",
"5 0.064 10700 0.328 10500 0.328 10500 0.320 al_24st\n",
"6 0.081 10000 0.315 10000 0.320 9900 0.314 al_24st\n",
"7 0.081 10100 0.312 9900 0.312 10000 0.316 al_24st\n",
"8 0.081 10000 0.311 -1 -1.000 9900 0.314 al_24st"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_stang_wide"
]
},
{
"cell_type": "markdown",
"id": "a8b7ebd0",
"metadata": {},
"source": [
"However, the tidy form of the same dataset is far less compact:\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "dc13f694",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" thick | \n",
" alloy | \n",
" E | \n",
" mu | \n",
" ang | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.022 | \n",
" al_24st | \n",
" 10600 | \n",
" 0.321 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.022 | \n",
" al_24st | \n",
" 10600 | \n",
" 0.323 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.032 | \n",
" al_24st | \n",
" 10400 | \n",
" 0.329 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.032 | \n",
" al_24st | \n",
" 10300 | \n",
" 0.319 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.064 | \n",
" al_24st | \n",
" 10500 | \n",
" 0.323 | \n",
" 0 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 71 | \n",
" 0.064 | \n",
" al_24st | \n",
" 10400 | \n",
" 0.327 | \n",
" 90 | \n",
"
\n",
" \n",
" 72 | \n",
" 0.064 | \n",
" al_24st | \n",
" 10500 | \n",
" 0.320 | \n",
" 90 | \n",
"
\n",
" \n",
" 73 | \n",
" 0.081 | \n",
" al_24st | \n",
" 9900 | \n",
" 0.314 | \n",
" 90 | \n",
"
\n",
" \n",
" 74 | \n",
" 0.081 | \n",
" al_24st | \n",
" 10000 | \n",
" 0.316 | \n",
" 90 | \n",
"
\n",
" \n",
" 75 | \n",
" 0.081 | \n",
" al_24st | \n",
" 9900 | \n",
" 0.314 | \n",
" 90 | \n",
"
\n",
" \n",
"
\n",
"
76 rows × 5 columns
\n",
"
"
],
"text/plain": [
" thick alloy E mu ang\n",
"0 0.022 al_24st 10600 0.321 0\n",
"1 0.022 al_24st 10600 0.323 0\n",
"2 0.032 al_24st 10400 0.329 0\n",
"3 0.032 al_24st 10300 0.319 0\n",
"4 0.064 al_24st 10500 0.323 0\n",
".. ... ... ... ... ...\n",
"71 0.064 al_24st 10400 0.327 90\n",
"72 0.064 al_24st 10500 0.320 90\n",
"73 0.081 al_24st 9900 0.314 90\n",
"74 0.081 al_24st 10000 0.316 90\n",
"75 0.081 al_24st 9900 0.314 90\n",
"\n",
"[76 rows x 5 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_stang"
]
},
{
"cell_type": "markdown",
"id": "fac124db",
"metadata": {},
"source": [
"## Exercises\n",
"\n",
"Let's practice identifying tidy and untidy data.\n"
]
},
{
"cell_type": "markdown",
"id": "a6cc04c9",
"metadata": {},
"source": [
"### __q1__ Identify\n",
"\n",
"Inspect the following dataset; answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f9a3fe5b",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year2011 | \n",
" year2012 | \n",
" year2013 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" FR | \n",
" 7000 | \n",
" 6900 | \n",
" 7000 | \n",
"
\n",
" \n",
" 1 | \n",
" DE | \n",
" 5800 | \n",
" 6000 | \n",
" 6200 | \n",
"
\n",
" \n",
" 2 | \n",
" US | \n",
" 15000 | \n",
" 14000 | \n",
" 13000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year2011 year2012 year2013\n",
"0 FR 7000 6900 7000\n",
"1 DE 5800 6000 6200\n",
"2 US 15000 14000 13000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: No need to edit; run and inspect\n",
"df_cases = gr.df_make(\n",
" country=[\"FR\", \"DE\", \"US\"],\n",
" year2011=[7000, 5800, 15000],\n",
" year2012=[6900, 6000, 14000],\n",
" year2013=[7000, 6200, 13000],\n",
")\n",
"df_cases\n"
]
},
{
"cell_type": "markdown",
"id": "e83ee3f0",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- What are the *variables* in this dataset?\n",
" - Country, year, and some unknown quantity (n, count, etc.)\n",
"- Is this dataset *tidy*? Why or why not?\n",
" - No; the year values are in the column names.\n",
""
]
},
{
"cell_type": "markdown",
"id": "f75cd1f4",
"metadata": {},
"source": [
"### __q2__ Identify\n",
"\n",
"Inspect the following dataset; answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e8145fbc",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" thick | \n",
" E_00 | \n",
" mu_00 | \n",
" E_45 | \n",
" mu_45 | \n",
" rep | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.022 | \n",
" 10600 | \n",
" 0.321 | \n",
" 10700 | \n",
" 0.329 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.022 | \n",
" 10600 | \n",
" 0.323 | \n",
" 10500 | \n",
" 0.331 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.032 | \n",
" 10400 | \n",
" 0.329 | \n",
" 10400 | \n",
" 0.318 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.032 | \n",
" 10300 | \n",
" 0.319 | \n",
" 10500 | \n",
" 0.326 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" thick E_00 mu_00 E_45 mu_45 rep\n",
"0 0.022 10600 0.321 10700 0.329 1\n",
"1 0.022 10600 0.323 10500 0.331 2\n",
"2 0.032 10400 0.329 10400 0.318 1\n",
"3 0.032 10300 0.319 10500 0.326 2"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: No need to edit; run and inspect\n",
"df_alloys1 = gr.df_make(\n",
" thick=[0.022, 0.022, 0.032, 0.032],\n",
" E_00=[10600, 10600, 10400, 10300],\n",
" mu_00=[0.321, 0.323, 0.329, 0.319],\n",
" E_45=[10700, 10500, 10400, 10500],\n",
" mu_45=[0.329, 0.331, 0.318, 0.326],\n",
" rep=[1, 2, 1, 2],\n",
")\n",
"df_alloys1"
]
},
{
"cell_type": "markdown",
"id": "56013aad",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- What are the *variables* in this dataset?\n",
" - Thickness `thick`, elasticity `E`, poisson's ratio `mu`, angle (in column names), replication `rep`\n",
"- Is this dataset *tidy*? Why or why not?\n",
" - No; the angle values are in the column names.\n",
""
]
},
{
"cell_type": "markdown",
"id": "dc834b12",
"metadata": {},
"source": [
"### __q3__ Identify\n",
"\n",
"Inspect the following dataset; answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "10700eb4",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" thick | \n",
" var | \n",
" value | \n",
" rep | \n",
" angle | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.022 | \n",
" E | \n",
" 10700.000 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.022 | \n",
" mu | \n",
" 0.321 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.032 | \n",
" E | \n",
" 10500.000 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.032 | \n",
" mu | \n",
" 0.323 | \n",
" 2 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" thick var value rep angle\n",
"0 0.022 E 10700.000 1 0\n",
"1 0.022 mu 0.321 1 0\n",
"2 0.032 E 10500.000 2 0\n",
"3 0.032 mu 0.323 2 0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: No need to edit; run and inspect\n",
"df_alloys2 = gr.df_make(\n",
" thick=[0.022, 0.022, 0.032, 0.032],\n",
" var=[\"E\", \"mu\", \"E\", \"mu\"],\n",
" value=[10700, 0.321, 10500, 0.323],\n",
" rep=[1, 1, 2, 2],\n",
" angle=[0, 0, 0, 0],\n",
")\n",
"df_alloys2"
]
},
{
"cell_type": "markdown",
"id": "551295cc",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- What are the *variables* in this dataset?\n",
" - Thickness `thick`, elasticity `E`, poisson's ratio `mu`, `angle`, replication `rep`\n",
"- Is this dataset *tidy*? Why or why not?\n",
" - No; the column `value` contains values of two *different* variables `E` and `mu`.\n",
""
]
},
{
"cell_type": "markdown",
"id": "2be971b3",
"metadata": {},
"source": [
"# Pivoting Data\n",
"\n",
"The untidy datasets above fail to be tidy because they have the wrong *shape*; we can tidy these datasets by *pivoting* the data. There are just two pivots we need to learn about: `tf_pivot_longer()` and `tf_pivot_wider()`.\n"
]
},
{
"cell_type": "markdown",
"id": "09b93698",
"metadata": {},
"source": [
"## Pivot Longer\n",
"\n",
"Let's take another look at the `df_cases` example dataset:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "12df95e8",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year2011 | \n",
" year2012 | \n",
" year2013 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" FR | \n",
" 7000 | \n",
" 6900 | \n",
" 7000 | \n",
"
\n",
" \n",
" 1 | \n",
" DE | \n",
" 5800 | \n",
" 6000 | \n",
" 6200 | \n",
"
\n",
" \n",
" 2 | \n",
" US | \n",
" 15000 | \n",
" 14000 | \n",
" 13000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year2011 year2012 year2013\n",
"0 FR 7000 6900 7000\n",
"1 DE 5800 6000 6200\n",
"2 US 15000 14000 13000"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"df_cases"
]
},
{
"cell_type": "markdown",
"id": "b8cd68e5",
"metadata": {},
"source": [
"This dataset is *too wide*; the column names `year2011,year2012,year2013` should really be numbers inside a single `year` column. We can *pivot longer* to move from a wide structure to a longer one:\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "c0b87da0",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year | \n",
" count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" FR | \n",
" year2011 | \n",
" 7000 | \n",
"
\n",
" \n",
" 1 | \n",
" DE | \n",
" year2011 | \n",
" 5800 | \n",
"
\n",
" \n",
" 2 | \n",
" US | \n",
" year2011 | \n",
" 15000 | \n",
"
\n",
" \n",
" 3 | \n",
" FR | \n",
" year2012 | \n",
" 6900 | \n",
"
\n",
" \n",
" 4 | \n",
" DE | \n",
" year2012 | \n",
" 6000 | \n",
"
\n",
" \n",
" 5 | \n",
" US | \n",
" year2012 | \n",
" 14000 | \n",
"
\n",
" \n",
" 6 | \n",
" FR | \n",
" year2013 | \n",
" 7000 | \n",
"
\n",
" \n",
" 7 | \n",
" DE | \n",
" year2013 | \n",
" 6200 | \n",
"
\n",
" \n",
" 8 | \n",
" US | \n",
" year2013 | \n",
" 13000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year count\n",
"0 FR year2011 7000\n",
"1 DE year2011 5800\n",
"2 US year2011 15000\n",
"3 FR year2012 6900\n",
"4 DE year2012 6000\n",
"5 US year2012 14000\n",
"6 FR year2013 7000\n",
"7 DE year2013 6200\n",
"8 US year2013 13000"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_cases\n",
" >> gr.tf_pivot_longer(\n",
" columns=[\"year2011\", \"year2012\", \"year2013\"],\n",
" names_to=\"year\",\n",
" values_to=\"count\",\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9dcfffe3",
"metadata": {},
"source": [
"Note what we had to provide as arguments to `gr.tf_pivot_longer()`:\n",
"\n",
"- `columns` specifies the columns to involve in the pivoting\n",
"- `names_to` specifies what new column will contain the previous column names\n",
"- `value_to` specifies what new column will contain the previous cell values\n",
"\n",
"Often, a pivot alone is not enough to fully clean a dataset; for instance, we would still need to remove the `year` string from each cell entry. However, we can do this much more easily once the data are pivoted into a longer format:\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "08768fbb",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" country | \n",
" year | \n",
" count | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" FR | \n",
" 2011 | \n",
" 7000 | \n",
"
\n",
" \n",
" 1 | \n",
" DE | \n",
" 2011 | \n",
" 5800 | \n",
"
\n",
" \n",
" 2 | \n",
" US | \n",
" 2011 | \n",
" 15000 | \n",
"
\n",
" \n",
" 3 | \n",
" FR | \n",
" 2012 | \n",
" 6900 | \n",
"
\n",
" \n",
" 4 | \n",
" DE | \n",
" 2012 | \n",
" 6000 | \n",
"
\n",
" \n",
" 5 | \n",
" US | \n",
" 2012 | \n",
" 14000 | \n",
"
\n",
" \n",
" 6 | \n",
" FR | \n",
" 2013 | \n",
" 7000 | \n",
"
\n",
" \n",
" 7 | \n",
" DE | \n",
" 2013 | \n",
" 6200 | \n",
"
\n",
" \n",
" 8 | \n",
" US | \n",
" 2013 | \n",
" 13000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country year count\n",
"0 FR 2011 7000\n",
"1 DE 2011 5800\n",
"2 US 2011 15000\n",
"3 FR 2012 6900\n",
"4 DE 2012 6000\n",
"5 US 2012 14000\n",
"6 FR 2013 7000\n",
"7 DE 2013 6200\n",
"8 US 2013 13000"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_cases\n",
" >> gr.tf_pivot_longer(\n",
" columns=[\"year2011\", \"year2012\", \"year2013\"],\n",
" names_to=\"year\",\n",
" values_to=\"count\",\n",
" )\n",
" >> gr.tf_mutate(year=gr.str_replace(DF.year, \"year\", \"\"))\n",
")"
]
},
{
"cell_type": "markdown",
"id": "4e1c0d74",
"metadata": {},
"source": [
"### __q4__ Pivot the data\n",
"\n",
"Pivot the dataset `df_alloys1` to place each of `E_00, mu_00, E_45, mu_45` as values in a new `variable` column, and move the original values to a new `value` column. Answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8eeffa51",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" thick rep variable value\n",
"0 0.022 1 E_00 10600.000\n",
"1 0.022 2 E_00 10600.000\n",
"2 0.032 1 E_00 10400.000\n",
"3 0.032 2 E_00 10300.000\n",
"4 0.022 1 mu_00 0.321\n",
"5 0.022 2 mu_00 0.323\n",
"6 0.032 1 mu_00 0.329\n",
"7 0.032 2 mu_00 0.319\n",
"8 0.022 1 E_45 10700.000\n",
"9 0.022 2 E_45 10500.000\n",
"10 0.032 1 E_45 10400.000\n",
"11 0.032 2 E_45 10500.000\n",
"12 0.022 1 mu_45 0.329\n",
"13 0.022 2 mu_45 0.331\n",
"14 0.032 1 mu_45 0.318\n",
"15 0.032 2 mu_45 0.326\n"
]
}
],
"source": [
"## TASK: Tidy the data, make sure to provide `variable` and `value` columns\n",
"df_res1 = (\n",
" df_alloys1\n",
"\n",
" >> gr.tf_pivot_longer(\n",
" columns=[\"E_00\", \"mu_00\", \"E_45\", \"mu_45\"],\n",
" names_to=\"variable\",\n",
" values_to=\"value\",\n",
" )\n",
")\n",
"\n",
"## NOTE: Use this to check your work\n",
"print(df_res1)\n",
"\n",
"assert \\\n",
" {\"thick\", \"rep\", \"variable\", \"value\"} == set(df_res1.columns), \\\n",
" 'columns of df_res1 are not [\"thick\", \"rep\", \"variable\", \"value\"]'\n",
"\n",
"assert \\\n",
" {\"E_00\", \"mu_00\", \"E_45\", \"mu_45\"} == set(df_res1.variable), \\\n",
" 'Entries in df_res1.variable are incorrect'\n"
]
},
{
"cell_type": "markdown",
"id": "55fd504c",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- Is this dataset tidy? Why or why not?\n",
" - This dataset is not tidy; the entries in the `value` column include values of `E` and `mu` (two different variables). Also, the `variable` column contains both variable names and angle values.\n",
""
]
},
{
"cell_type": "markdown",
"id": "325d594c",
"metadata": {},
"source": [
"### Using selection helpers\n",
"\n",
"One useful feature of `gr.tf_pivot_longer()` is that we can use [selection helpers](https://zdelrosario.github.io/evc-course/exercises_solution/d02-e-data01-isolate-solution.html#selection-helpers) to choose which columns to use in a pivot. This can save us a lot of typing, and allows us to write code that can react to multiple different datasets. Let's practice using selection helpers by re-doing the previous task.\n"
]
},
{
"cell_type": "markdown",
"id": "4a61a47b",
"metadata": {},
"source": [
"### __q5__ Simplify with selection helpers\n",
"\n",
"Use a single [selection helper](https://zdelrosario.github.io/evc-course/exercises_solution/d02-e-data01-isolate-solution.html#selection-helpers) to simplify the `columns` argument to `gr.tf_pivot_longer()`.\n",
"\n",
"*Hint*: There are multiple ways to do this: The string `\"E|mu\"` will `match` the strings `\"E\"` or `\"mu\"`. The string `\"\\\\d+\"` will `match` *any string of consecutive digits*.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "598055cf",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" thick rep variable value\n",
"0 0.022 1 E_00 10600.000\n",
"1 0.022 2 E_00 10600.000\n",
"2 0.032 1 E_00 10400.000\n",
"3 0.032 2 E_00 10300.000\n",
"4 0.022 1 mu_00 0.321\n",
"5 0.022 2 mu_00 0.323\n",
"6 0.032 1 mu_00 0.329\n",
"7 0.032 2 mu_00 0.319\n",
"8 0.022 1 E_45 10700.000\n",
"9 0.022 2 E_45 10500.000\n",
"10 0.032 1 E_45 10400.000\n",
"11 0.032 2 E_45 10500.000\n",
"12 0.022 1 mu_45 0.329\n",
"13 0.022 2 mu_45 0.331\n",
"14 0.032 1 mu_45 0.318\n",
"15 0.032 2 mu_45 0.326\n"
]
}
],
"source": [
"## TASK: Replace the `columns` argument with a selection helper\n",
"df_simplified = (\n",
" df_alloys1\n",
" >> gr.tf_pivot_longer(\n",
"\n",
" columns=gr.matches(\"E|mu\"), \n",
" names_to=\"variable\",\n",
" values_to=\"value\",\n",
" )\n",
")\n",
"\n",
"## NOTE: Use this to check your work\n",
"print(df_simplified)\n",
"\n",
"assert \\\n",
" {\"thick\", \"rep\", \"variable\", \"value\"} == set(df_simplified.columns), \\\n",
" 'columns of df_res1 are not [\"thick\", \"rep\", \"variable\", \"value\"]'\n",
"\n",
"assert \\\n",
" {\"E_00\", \"mu_00\", \"E_45\", \"mu_45\"} == set(df_simplified.variable), \\\n",
" 'Entries in df_res1.variable are incorrect'\n"
]
},
{
"cell_type": "markdown",
"id": "9e20c94e",
"metadata": {},
"source": [
"## Pivot Wider\n",
"\n",
"Just as a dataset can be \"too wide\", it can also be \"too long.\" In this case, we can use `gr.tf_pivot_wider()`. Let's look at an illustrative example:\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "3a9dc621",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" variable | \n",
" value | \n",
" observation | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" x | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" y | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" f | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" x | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" y | \n",
" 1 | \n",
" 1 | \n",
"
\n",
" \n",
" 5 | \n",
" f | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" variable value observation\n",
"0 x 0 0\n",
"1 y 0 0\n",
"2 f 1 0\n",
"3 x 0 1\n",
"4 y 1 1\n",
"5 f 0 1"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"df_long = gr.df_make(\n",
" variable=[\"x\", \"y\", \"f\", \"x\", \"y\", \"f\"],\n",
" value=[0, 0, 1, 0, 1, 0],\n",
" observation=[0, 0, 0, 1, 1, 1],\n",
")\n",
"df_long\n"
]
},
{
"cell_type": "markdown",
"id": "55be108f",
"metadata": {},
"source": [
"Here, the `value` column has mixed variables; values of `x`, `y`, and `f` comingle in the `value` column. Really we should have the *columns* `x`, `y`, and `f`: In this sense the data is *too long*. Let's pivot wider to tidy the data.\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "b919286c",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" observation | \n",
" f | \n",
" x | \n",
" y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" observation f x y\n",
"0 0 1 0 0\n",
"1 1 0 0 1"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_long\n",
" >> gr.tf_pivot_wider(\n",
" names_from=\"variable\",\n",
" values_from=\"value\",\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e747affc",
"metadata": {},
"source": [
"Note that we had to provide just two inputs to `gr.tf_pivot_wider()`:\n",
"\n",
"- `names_from` specifies which column will provide the new column names\n",
"- `value_from` specifies which column will provide the new column values\n",
"\n",
"Let's get some practice!\n"
]
},
{
"cell_type": "markdown",
"id": "ad0074c9",
"metadata": {},
"source": [
"### __q6__ Pivot the data\n",
"\n",
"Pivot the dataset `df_alloys2` to provide `E` and `mu` as columns in `df_res2`. Answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "54359a2f",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" thick rep angle E mu\n",
"0 0.022 1 0 10700.0 0.321\n",
"1 0.032 2 0 10500.0 0.323\n"
]
}
],
"source": [
"## TASK: Pivot the dataset wider\n",
"df_res2 = (\n",
" df_alloys2\n",
" >> gr.tf_pivot_wider(\n",
" names_from=\"var\",\n",
" values_from=\"value\",\n",
" )\n",
")\n",
"\n",
"## NOTE: Use the following to check your work\n",
"print(df_res2)\n",
"\n",
"assert \\\n",
" {\"thick\", \"rep\", \"angle\", \"E\", \"mu\"} == set(df_res2.columns), \\\n",
" 'df_res2 does not have the columns [\"thick\", \"rep\", \"angle\", \"E\", \"mu\"]'\n"
]
},
{
"cell_type": "markdown",
"id": "30685938",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- Is this dataset tidy? Why or why not?\n",
" - This dataset is tidy! Every column is one variable, every row is one observation, and every cell contains one value.\n",
""
]
},
{
"cell_type": "markdown",
"id": "7edea2c5",
"metadata": {},
"source": [
"### Warning: No observation identifier\n",
"\n",
"While the arguments to `gr.tf_pivot_wider()` only target two columns, we *do* need to think carefully about what other columns are in the dataset. It is important that the other columns help identify which values are associated with the same observation---the easiest way to do this is with some sort of observation identifier column.\n",
"\n",
"Let's take a look at what happens when we *do not* have a way to associate values with the same observation:\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "dc37c5f0",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" f | \n",
" x | \n",
" y | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" 0.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" NaN | \n",
" NaN | \n",
" 1.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" f x y\n",
"0 NaN 0.0 NaN\n",
"1 NaN NaN 0.0\n",
"2 1.0 NaN NaN\n",
"3 NaN 0.0 NaN\n",
"4 NaN NaN 1.0\n",
"5 0.0 NaN NaN"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" gr.df_make(\n",
" variable=[\"x\", \"y\", \"f\", \"x\", \"y\", \"f\"],\n",
" value=[0, 0, 1, 0, 1, 0],\n",
" # observation=[0, 0, 0, 1, 1, 1], # Remove the observation identifier\n",
" )\n",
" >> gr.tf_pivot_wider(\n",
" names_from=\"variable\",\n",
" values_from=\"value\",\n",
" )\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "750cd93f",
"metadata": {},
"source": [
"Note that `gr.tf_pivot_wider()` placed all of the values in the \"correct\" column, but filled all over column values with `NaN`s. If you get strange behavior with `gr.tf_pivot_wider()`, you should think about whether you are providing the information necessary to associate common values with the same observation. We'll learn more about this in the next exercise.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}