{ "cells": [ { "cell_type": "markdown", "id": "e83d474c", "metadata": { "tags": [] }, "source": [ "# Data: Pivoting Data\n", "\n", "*Purpose*: Data is easiest to use when it is *tidy*. In fact, grama is specifically designed to use tidy data. But not all data we'll encounter is tidy! To that end, in this exercise we'll learn how to tidy our data by *pivoting*.\n", "\n", "As a result of learning how to quickly *tidy* data, you'll vastly expand the set of datasets you can analyze. Rather than fighting with data, you'll be able to quickly wrangle and extract insights.\n" ] }, { "cell_type": "markdown", "id": "4f895c9a", "metadata": {}, "source": [ "## Setup\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "a2de1b4e", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import grama as gr\n", "DF = gr.Intention()\n", "%matplotlib inline\n", "\n", "# For assertion\n", "from pandas.api.types import is_integer_dtype" ] }, { "cell_type": "markdown", "id": "6e3e55d6", "metadata": {}, "source": [ "# Tidy Data\n", "\n", "*Tidy Data* is a very simple---but *very powerful*---concept for structuring a dataset. " ] }, { "cell_type": "markdown", "id": "327c425e", "metadata": {}, "source": [ "![Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.](./images/tidydata_1.jpg)\n", "Artwork by @allison_horst\n" ] }, { "cell_type": "markdown", "id": "1e9dcb2f", "metadata": {}, "source": [ "## Tidy Data: Definition\n", "\n", "A tidy dataset has three properties:\n", "\n", "- each variable forms a column\n", "- each observation forms a row\n", "- each cell is a single measurement\n" ] }, { "cell_type": "markdown", "id": "2096d57f", "metadata": {}, "source": [ "As an example, the following dataset is tidy:\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "083e991a", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
thickalloyEmuang
00.022al_24st106000.3210
10.022al_24st106000.3230
20.032al_24st104000.3290
30.032al_24st103000.3190
40.064al_24st105000.3230
\n", "
" ], "text/plain": [ " thick alloy E mu ang\n", "0 0.022 al_24st 10600 0.321 0\n", "1 0.022 al_24st 10600 0.323 0\n", "2 0.032 al_24st 10400 0.329 0\n", "3 0.032 al_24st 10300 0.319 0\n", "4 0.064 al_24st 10500 0.323 0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from grama.data import df_stang\n", "df_stang.head()\n" ] }, { "cell_type": "markdown", "id": "ad920c9e", "metadata": {}, "source": [ "The observations are all measured material properties taken at a particular angle (with respect to the direction in which the specimens were rolled). Each column reports values for just one variable, each row corresponds to a single observation, and every cell reports just one measurement.\n", "\n", "However, the following form of the same dataset is *not* tidy:\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "8d9430ea", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
thickE_00mu_00E_45mu_45E_90mu_90alloy
00.022106000.321107000.329105000.310al_24st
10.022106000.323105000.331107000.323al_24st
20.032104000.329104000.318103000.322al_24st
30.032103000.319105000.326104000.330al_24st
40.064105000.323104000.331104000.327al_24st
50.064107000.328105000.328105000.320al_24st
60.081100000.315100000.32099000.314al_24st
70.081101000.31299000.312100000.316al_24st
80.081100000.311-1-1.00099000.314al_24st
\n", "
" ], "text/plain": [ " thick E_00 mu_00 E_45 mu_45 E_90 mu_90 alloy\n", "0 0.022 10600 0.321 10700 0.329 10500 0.310 al_24st\n", "1 0.022 10600 0.323 10500 0.331 10700 0.323 al_24st\n", "2 0.032 10400 0.329 10400 0.318 10300 0.322 al_24st\n", "3 0.032 10300 0.319 10500 0.326 10400 0.330 al_24st\n", "4 0.064 10500 0.323 10400 0.331 10400 0.327 al_24st\n", "5 0.064 10700 0.328 10500 0.328 10500 0.320 al_24st\n", "6 0.081 10000 0.315 10000 0.320 9900 0.314 al_24st\n", "7 0.081 10100 0.312 9900 0.312 10000 0.316 al_24st\n", "8 0.081 10000 0.311 -1 -1.000 9900 0.314 al_24st" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from grama.data import df_stang_wide\n", "df_stang_wide" ] }, { "cell_type": "markdown", "id": "a827b07d", "metadata": {}, "source": [ "This dataset is *not* tidy: The angle of each measurement `00, 45, 90` is a variable, but these numerical values are expressed as column names. Put differently, some of the values are not in cells, but rather in the column names.\n" ] }, { "cell_type": "markdown", "id": "6224b58e", "metadata": {}, "source": [ "## Why tidy data?\n", "\n", "Tidy data makes analysis *easier*. Putting our data in tidy form means we can use a *consistent* set of tools to work with *any* dataset.\n" ] }, { "cell_type": "markdown", "id": "db8edfa0", "metadata": {}, "source": [ "![On the left is a happy cute fuzzy monster holding a rectangular data frame with a tool that fits the data frame shape. On the workbench behind the monster are other data frames of similar rectangular shape, and neatly arranged tools that also look like they would fit those data frames. The workbench looks uncluttered and tidy. The text above the tidy workbench reads “When working with tidy data, we can use the same tools in similar ways for different datasets…” On the right is a cute monster looking very frustrated, using duct tape and other tools to haphazardly tie data tables together, each in a different way. The monster is in front of a messy, cluttered workbench. The text above the frustrated monster reads “...but working with untidy data often means reinventing the wheel with one-time approaches that are hard to iterate or reuse.”](./images/tidydata_3.jpg)\n", "Artwork by @allison_horst\n" ] }, { "cell_type": "markdown", "id": "75529e7f", "metadata": {}, "source": [ "Note that untidy data is not *bad* data; untidy data are simply harder to work with when doing data analysis. Data often come in untidy form when they are reported, say in a paper or a presentation. For instance, the wide form of the Stang et al. dataset can easily fit on one page:\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "d7954242", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
thickE_00mu_00E_45mu_45E_90mu_90alloy
00.022106000.321107000.329105000.310al_24st
10.022106000.323105000.331107000.323al_24st
20.032104000.329104000.318103000.322al_24st
30.032103000.319105000.326104000.330al_24st
40.064105000.323104000.331104000.327al_24st
50.064107000.328105000.328105000.320al_24st
60.081100000.315100000.32099000.314al_24st
70.081101000.31299000.312100000.316al_24st
80.081100000.311-1-1.00099000.314al_24st
\n", "
" ], "text/plain": [ " thick E_00 mu_00 E_45 mu_45 E_90 mu_90 alloy\n", "0 0.022 10600 0.321 10700 0.329 10500 0.310 al_24st\n", "1 0.022 10600 0.323 10500 0.331 10700 0.323 al_24st\n", "2 0.032 10400 0.329 10400 0.318 10300 0.322 al_24st\n", "3 0.032 10300 0.319 10500 0.326 10400 0.330 al_24st\n", "4 0.064 10500 0.323 10400 0.331 10400 0.327 al_24st\n", "5 0.064 10700 0.328 10500 0.328 10500 0.320 al_24st\n", "6 0.081 10000 0.315 10000 0.320 9900 0.314 al_24st\n", "7 0.081 10100 0.312 9900 0.312 10000 0.316 al_24st\n", "8 0.081 10000 0.311 -1 -1.000 9900 0.314 al_24st" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_stang_wide" ] }, { "cell_type": "markdown", "id": "a8b7ebd0", "metadata": {}, "source": [ "However, the tidy form of the same dataset is far less compact:\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "dc13f694", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
thickalloyEmuang
00.022al_24st106000.3210
10.022al_24st106000.3230
20.032al_24st104000.3290
30.032al_24st103000.3190
40.064al_24st105000.3230
..................
710.064al_24st104000.32790
720.064al_24st105000.32090
730.081al_24st99000.31490
740.081al_24st100000.31690
750.081al_24st99000.31490
\n", "

76 rows × 5 columns

\n", "
" ], "text/plain": [ " thick alloy E mu ang\n", "0 0.022 al_24st 10600 0.321 0\n", "1 0.022 al_24st 10600 0.323 0\n", "2 0.032 al_24st 10400 0.329 0\n", "3 0.032 al_24st 10300 0.319 0\n", "4 0.064 al_24st 10500 0.323 0\n", ".. ... ... ... ... ...\n", "71 0.064 al_24st 10400 0.327 90\n", "72 0.064 al_24st 10500 0.320 90\n", "73 0.081 al_24st 9900 0.314 90\n", "74 0.081 al_24st 10000 0.316 90\n", "75 0.081 al_24st 9900 0.314 90\n", "\n", "[76 rows x 5 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_stang" ] }, { "cell_type": "markdown", "id": "fac124db", "metadata": {}, "source": [ "## Exercises\n", "\n", "Let's practice identifying tidy and untidy data.\n" ] }, { "cell_type": "markdown", "id": "a6cc04c9", "metadata": {}, "source": [ "### __q1__ Identify\n", "\n", "Inspect the following dataset; answer the questions under *observations* below.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "f9a3fe5b", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear2011year2012year2013
0FR700069007000
1DE580060006200
2US150001400013000
\n", "
" ], "text/plain": [ " country year2011 year2012 year2013\n", "0 FR 7000 6900 7000\n", "1 DE 5800 6000 6200\n", "2 US 15000 14000 13000" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## TASK: No need to edit; run and inspect\n", "df_cases = gr.df_make(\n", " country=[\"FR\", \"DE\", \"US\"],\n", " year2011=[7000, 5800, 15000],\n", " year2012=[6900, 6000, 14000],\n", " year2013=[7000, 6200, 13000],\n", ")\n", "df_cases\n" ] }, { "cell_type": "markdown", "id": "e83ee3f0", "metadata": {}, "source": [ "*Observations*\n", "\n", "- What are the *variables* in this dataset?\n", " - Country, year, and some unknown quantity (n, count, etc.)\n", "- Is this dataset *tidy*? Why or why not?\n", " - No; the year values are in the column names.\n", "" ] }, { "cell_type": "markdown", "id": "f75cd1f4", "metadata": {}, "source": [ "### __q2__ Identify\n", "\n", "Inspect the following dataset; answer the questions under *observations* below.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "e8145fbc", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
thickE_00mu_00E_45mu_45rep
00.022106000.321107000.3291
10.022106000.323105000.3312
20.032104000.329104000.3181
30.032103000.319105000.3262
\n", "
" ], "text/plain": [ " thick E_00 mu_00 E_45 mu_45 rep\n", "0 0.022 10600 0.321 10700 0.329 1\n", "1 0.022 10600 0.323 10500 0.331 2\n", "2 0.032 10400 0.329 10400 0.318 1\n", "3 0.032 10300 0.319 10500 0.326 2" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## TASK: No need to edit; run and inspect\n", "df_alloys1 = gr.df_make(\n", " thick=[0.022, 0.022, 0.032, 0.032],\n", " E_00=[10600, 10600, 10400, 10300],\n", " mu_00=[0.321, 0.323, 0.329, 0.319],\n", " E_45=[10700, 10500, 10400, 10500],\n", " mu_45=[0.329, 0.331, 0.318, 0.326],\n", " rep=[1, 2, 1, 2],\n", ")\n", "df_alloys1" ] }, { "cell_type": "markdown", "id": "56013aad", "metadata": {}, "source": [ "*Observations*\n", "\n", "- What are the *variables* in this dataset?\n", " - Thickness `thick`, elasticity `E`, poisson's ratio `mu`, angle (in column names), replication `rep`\n", "- Is this dataset *tidy*? Why or why not?\n", " - No; the angle values are in the column names.\n", "" ] }, { "cell_type": "markdown", "id": "dc834b12", "metadata": {}, "source": [ "### __q3__ Identify\n", "\n", "Inspect the following dataset; answer the questions under *observations* below.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "10700eb4", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
thickvarvaluerepangle
00.022E10700.00010
10.022mu0.32110
20.032E10500.00020
30.032mu0.32320
\n", "
" ], "text/plain": [ " thick var value rep angle\n", "0 0.022 E 10700.000 1 0\n", "1 0.022 mu 0.321 1 0\n", "2 0.032 E 10500.000 2 0\n", "3 0.032 mu 0.323 2 0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## TASK: No need to edit; run and inspect\n", "df_alloys2 = gr.df_make(\n", " thick=[0.022, 0.022, 0.032, 0.032],\n", " var=[\"E\", \"mu\", \"E\", \"mu\"],\n", " value=[10700, 0.321, 10500, 0.323],\n", " rep=[1, 1, 2, 2],\n", " angle=[0, 0, 0, 0],\n", ")\n", "df_alloys2" ] }, { "cell_type": "markdown", "id": "551295cc", "metadata": {}, "source": [ "*Observations*\n", "\n", "- What are the *variables* in this dataset?\n", " - Thickness `thick`, elasticity `E`, poisson's ratio `mu`, `angle`, replication `rep`\n", "- Is this dataset *tidy*? Why or why not?\n", " - No; the column `value` contains values of two *different* variables `E` and `mu`.\n", "" ] }, { "cell_type": "markdown", "id": "2be971b3", "metadata": {}, "source": [ "# Pivoting Data\n", "\n", "The untidy datasets above fail to be tidy because they have the wrong *shape*; we can tidy these datasets by *pivoting* the data. There are just two pivots we need to learn about: `tf_pivot_longer()` and `tf_pivot_wider()`.\n" ] }, { "cell_type": "markdown", "id": "09b93698", "metadata": {}, "source": [ "## Pivot Longer\n", "\n", "Let's take another look at the `df_cases` example dataset:\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "12df95e8", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyear2011year2012year2013
0FR700069007000
1DE580060006200
2US150001400013000
\n", "
" ], "text/plain": [ " country year2011 year2012 year2013\n", "0 FR 7000 6900 7000\n", "1 DE 5800 6000 6200\n", "2 US 15000 14000 13000" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "df_cases" ] }, { "cell_type": "markdown", "id": "b8cd68e5", "metadata": {}, "source": [ "This dataset is *too wide*; the column names `year2011,year2012,year2013` should really be numbers inside a single `year` column. We can *pivot longer* to move from a wide structure to a longer one:\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "c0b87da0", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearcount
0FRyear20117000
1DEyear20115800
2USyear201115000
3FRyear20126900
4DEyear20126000
5USyear201214000
6FRyear20137000
7DEyear20136200
8USyear201313000
\n", "
" ], "text/plain": [ " country year count\n", "0 FR year2011 7000\n", "1 DE year2011 5800\n", "2 US year2011 15000\n", "3 FR year2012 6900\n", "4 DE year2012 6000\n", "5 US year2012 14000\n", "6 FR year2013 7000\n", "7 DE year2013 6200\n", "8 US year2013 13000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " df_cases\n", " >> gr.tf_pivot_longer(\n", " columns=[\"year2011\", \"year2012\", \"year2013\"],\n", " names_to=\"year\",\n", " values_to=\"count\",\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "9dcfffe3", "metadata": {}, "source": [ "Note what we had to provide as arguments to `gr.tf_pivot_longer()`:\n", "\n", "- `columns` specifies the columns to involve in the pivoting\n", "- `names_to` specifies what new column will contain the previous column names\n", "- `value_to` specifies what new column will contain the previous cell values\n", "\n", "Often, a pivot alone is not enough to fully clean a dataset; for instance, we would still need to remove the `year` string from each cell entry. However, we can do this much more easily once the data are pivoted into a longer format:\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "08768fbb", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
countryyearcount
0FR20117000
1DE20115800
2US201115000
3FR20126900
4DE20126000
5US201214000
6FR20137000
7DE20136200
8US201313000
\n", "
" ], "text/plain": [ " country year count\n", "0 FR 2011 7000\n", "1 DE 2011 5800\n", "2 US 2011 15000\n", "3 FR 2012 6900\n", "4 DE 2012 6000\n", "5 US 2012 14000\n", "6 FR 2013 7000\n", "7 DE 2013 6200\n", "8 US 2013 13000" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " df_cases\n", " >> gr.tf_pivot_longer(\n", " columns=[\"year2011\", \"year2012\", \"year2013\"],\n", " names_to=\"year\",\n", " values_to=\"count\",\n", " )\n", " >> gr.tf_mutate(year=gr.str_replace(DF.year, \"year\", \"\"))\n", ")" ] }, { "cell_type": "markdown", "id": "4e1c0d74", "metadata": {}, "source": [ "### __q4__ Pivot the data\n", "\n", "Pivot the dataset `df_alloys1` to place each of `E_00, mu_00, E_45, mu_45` as values in a new `variable` column, and move the original values to a new `value` column. Answer the questions under *observations* below.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "8eeffa51", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " thick rep variable value\n", "0 0.022 1 E_00 10600.000\n", "1 0.022 2 E_00 10600.000\n", "2 0.032 1 E_00 10400.000\n", "3 0.032 2 E_00 10300.000\n", "4 0.022 1 mu_00 0.321\n", "5 0.022 2 mu_00 0.323\n", "6 0.032 1 mu_00 0.329\n", "7 0.032 2 mu_00 0.319\n", "8 0.022 1 E_45 10700.000\n", "9 0.022 2 E_45 10500.000\n", "10 0.032 1 E_45 10400.000\n", "11 0.032 2 E_45 10500.000\n", "12 0.022 1 mu_45 0.329\n", "13 0.022 2 mu_45 0.331\n", "14 0.032 1 mu_45 0.318\n", "15 0.032 2 mu_45 0.326\n" ] } ], "source": [ "## TASK: Tidy the data, make sure to provide `variable` and `value` columns\n", "df_res1 = (\n", " df_alloys1\n", "\n", " >> gr.tf_pivot_longer(\n", " columns=[\"E_00\", \"mu_00\", \"E_45\", \"mu_45\"],\n", " names_to=\"variable\",\n", " values_to=\"value\",\n", " )\n", ")\n", "\n", "## NOTE: Use this to check your work\n", "print(df_res1)\n", "\n", "assert \\\n", " {\"thick\", \"rep\", \"variable\", \"value\"} == set(df_res1.columns), \\\n", " 'columns of df_res1 are not [\"thick\", \"rep\", \"variable\", \"value\"]'\n", "\n", "assert \\\n", " {\"E_00\", \"mu_00\", \"E_45\", \"mu_45\"} == set(df_res1.variable), \\\n", " 'Entries in df_res1.variable are incorrect'\n" ] }, { "cell_type": "markdown", "id": "55fd504c", "metadata": {}, "source": [ "*Observations*\n", "\n", "- Is this dataset tidy? Why or why not?\n", " - This dataset is not tidy; the entries in the `value` column include values of `E` and `mu` (two different variables). Also, the `variable` column contains both variable names and angle values.\n", "" ] }, { "cell_type": "markdown", "id": "325d594c", "metadata": {}, "source": [ "### Using selection helpers\n", "\n", "One useful feature of `gr.tf_pivot_longer()` is that we can use [selection helpers](https://zdelrosario.github.io/evc-course/exercises_solution/d02-e-data01-isolate-solution.html#selection-helpers) to choose which columns to use in a pivot. This can save us a lot of typing, and allows us to write code that can react to multiple different datasets. Let's practice using selection helpers by re-doing the previous task.\n" ] }, { "cell_type": "markdown", "id": "4a61a47b", "metadata": {}, "source": [ "### __q5__ Simplify with selection helpers\n", "\n", "Use a single [selection helper](https://zdelrosario.github.io/evc-course/exercises_solution/d02-e-data01-isolate-solution.html#selection-helpers) to simplify the `columns` argument to `gr.tf_pivot_longer()`.\n", "\n", "*Hint*: There are multiple ways to do this: The string `\"E|mu\"` will `match` the strings `\"E\"` or `\"mu\"`. The string `\"\\\\d+\"` will `match` *any string of consecutive digits*.\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "598055cf", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " thick rep variable value\n", "0 0.022 1 E_00 10600.000\n", "1 0.022 2 E_00 10600.000\n", "2 0.032 1 E_00 10400.000\n", "3 0.032 2 E_00 10300.000\n", "4 0.022 1 mu_00 0.321\n", "5 0.022 2 mu_00 0.323\n", "6 0.032 1 mu_00 0.329\n", "7 0.032 2 mu_00 0.319\n", "8 0.022 1 E_45 10700.000\n", "9 0.022 2 E_45 10500.000\n", "10 0.032 1 E_45 10400.000\n", "11 0.032 2 E_45 10500.000\n", "12 0.022 1 mu_45 0.329\n", "13 0.022 2 mu_45 0.331\n", "14 0.032 1 mu_45 0.318\n", "15 0.032 2 mu_45 0.326\n" ] } ], "source": [ "## TASK: Replace the `columns` argument with a selection helper\n", "df_simplified = (\n", " df_alloys1\n", " >> gr.tf_pivot_longer(\n", "\n", " columns=gr.matches(\"E|mu\"), \n", " names_to=\"variable\",\n", " values_to=\"value\",\n", " )\n", ")\n", "\n", "## NOTE: Use this to check your work\n", "print(df_simplified)\n", "\n", "assert \\\n", " {\"thick\", \"rep\", \"variable\", \"value\"} == set(df_simplified.columns), \\\n", " 'columns of df_res1 are not [\"thick\", \"rep\", \"variable\", \"value\"]'\n", "\n", "assert \\\n", " {\"E_00\", \"mu_00\", \"E_45\", \"mu_45\"} == set(df_simplified.variable), \\\n", " 'Entries in df_res1.variable are incorrect'\n" ] }, { "cell_type": "markdown", "id": "9e20c94e", "metadata": {}, "source": [ "## Pivot Wider\n", "\n", "Just as a dataset can be \"too wide\", it can also be \"too long.\" In this case, we can use `gr.tf_pivot_wider()`. Let's look at an illustrative example:\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "3a9dc621", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
variablevalueobservation
0x00
1y00
2f10
3x01
4y11
5f01
\n", "
" ], "text/plain": [ " variable value observation\n", "0 x 0 0\n", "1 y 0 0\n", "2 f 1 0\n", "3 x 0 1\n", "4 y 1 1\n", "5 f 0 1" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "df_long = gr.df_make(\n", " variable=[\"x\", \"y\", \"f\", \"x\", \"y\", \"f\"],\n", " value=[0, 0, 1, 0, 1, 0],\n", " observation=[0, 0, 0, 1, 1, 1],\n", ")\n", "df_long\n" ] }, { "cell_type": "markdown", "id": "55be108f", "metadata": {}, "source": [ "Here, the `value` column has mixed variables; values of `x`, `y`, and `f` comingle in the `value` column. Really we should have the *columns* `x`, `y`, and `f`: In this sense the data is *too long*. Let's pivot wider to tidy the data.\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "b919286c", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
observationfxy
00100
11001
\n", "
" ], "text/plain": [ " observation f x y\n", "0 0 1 0 0\n", "1 1 0 0 1" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " df_long\n", " >> gr.tf_pivot_wider(\n", " names_from=\"variable\",\n", " values_from=\"value\",\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "e747affc", "metadata": {}, "source": [ "Note that we had to provide just two inputs to `gr.tf_pivot_wider()`:\n", "\n", "- `names_from` specifies which column will provide the new column names\n", "- `value_from` specifies which column will provide the new column values\n", "\n", "Let's get some practice!\n" ] }, { "cell_type": "markdown", "id": "ad0074c9", "metadata": {}, "source": [ "### __q6__ Pivot the data\n", "\n", "Pivot the dataset `df_alloys2` to provide `E` and `mu` as columns in `df_res2`. Answer the questions under *observations* below.\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "54359a2f", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " thick rep angle E mu\n", "0 0.022 1 0 10700.0 0.321\n", "1 0.032 2 0 10500.0 0.323\n" ] } ], "source": [ "## TASK: Pivot the dataset wider\n", "df_res2 = (\n", " df_alloys2\n", " >> gr.tf_pivot_wider(\n", " names_from=\"var\",\n", " values_from=\"value\",\n", " )\n", ")\n", "\n", "## NOTE: Use the following to check your work\n", "print(df_res2)\n", "\n", "assert \\\n", " {\"thick\", \"rep\", \"angle\", \"E\", \"mu\"} == set(df_res2.columns), \\\n", " 'df_res2 does not have the columns [\"thick\", \"rep\", \"angle\", \"E\", \"mu\"]'\n" ] }, { "cell_type": "markdown", "id": "30685938", "metadata": {}, "source": [ "*Observations*\n", "\n", "- Is this dataset tidy? Why or why not?\n", " - This dataset is tidy! Every column is one variable, every row is one observation, and every cell contains one value.\n", "" ] }, { "cell_type": "markdown", "id": "7edea2c5", "metadata": {}, "source": [ "### Warning: No observation identifier\n", "\n", "While the arguments to `gr.tf_pivot_wider()` only target two columns, we *do* need to think carefully about what other columns are in the dataset. It is important that the other columns help identify which values are associated with the same observation---the easiest way to do this is with some sort of observation identifier column.\n", "\n", "Let's take a look at what happens when we *do not* have a way to associate values with the same observation:\n" ] }, { "cell_type": "code", "execution_count": 17, "id": "dc37c5f0", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
fxy
0NaN0.0NaN
1NaNNaN0.0
21.0NaNNaN
3NaN0.0NaN
4NaNNaN1.0
50.0NaNNaN
\n", "
" ], "text/plain": [ " f x y\n", "0 NaN 0.0 NaN\n", "1 NaN NaN 0.0\n", "2 1.0 NaN NaN\n", "3 NaN 0.0 NaN\n", "4 NaN NaN 1.0\n", "5 0.0 NaN NaN" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " gr.df_make(\n", " variable=[\"x\", \"y\", \"f\", \"x\", \"y\", \"f\"],\n", " value=[0, 0, 1, 0, 1, 0],\n", " # observation=[0, 0, 0, 1, 1, 1], # Remove the observation identifier\n", " )\n", " >> gr.tf_pivot_wider(\n", " names_from=\"variable\",\n", " values_from=\"value\",\n", " )\n", ")\n" ] }, { "cell_type": "markdown", "id": "750cd93f", "metadata": {}, "source": [ "Note that `gr.tf_pivot_wider()` placed all of the values in the \"correct\" column, but filled all over column values with `NaN`s. If you get strange behavior with `gr.tf_pivot_wider()`, you should think about whether you are providing the information necessary to associate common values with the same observation. We'll learn more about this in the next exercise.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }