{ "cells": [ { "cell_type": "markdown", "id": "e56ce1d1", "metadata": {}, "source": [ "# Data: Deriving Quantities\n", "\n", "*Purpose*: Often our data will not tell us *directly* what we want to know; in\n", "these cases we need to *derive* new quantities from our data. In this exercise,\n", "we'll work with `tf_mutate()` to create new columns by operating on existing\n", "variables, and use `tf_group_by()` with `tf_summarize()` to compute aggregate\n", "statistics (summaries!) of our data.\n", "\n", "Aside: The data-summary verbs in grama are heavily inspired by the [dplyr](https://dplyr.tidyverse.org/) package in the R programming langauge. " ] }, { "cell_type": "markdown", "id": "a5b9d5e1", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "id": "34b54e42", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import grama as gr\n", "DF = gr.Intention()\n", "%matplotlib inline\n", "\n", "# For assertion\n", "from pandas.api.types import is_numeric_dtype" ] }, { "cell_type": "markdown", "id": "dbd356f3", "metadata": {}, "source": [ "We'll be using the `diamonds` as seen in `e-vis00-basics` earlier. " ] }, { "cell_type": "code", "execution_count": 2, "id": "800429ed", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from grama.data import df_diamonds" ] }, { "cell_type": "markdown", "id": "134f3740", "metadata": {}, "source": [ "# Mutating DataFrames\n", "\n", "A *mutation* transformation allows us to make new columns (or edit old ones) by *combining* column values. \n", "\n", "For example, let's remember the diamonds dataset:\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "0032eee7", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
caratcutcolorclaritydepthtablepricexyz
00.23IdealESI261.555.03263.953.982.43
10.21PremiumESI159.861.03263.893.842.31
20.23GoodEVS156.965.03274.054.072.31
30.29PremiumIVS262.458.03344.204.232.63
40.31GoodJSI263.358.03354.344.352.75
\n", "
" ], "text/plain": [ " carat cut color clarity depth table price x y z\n", "0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43\n", "1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31\n", "2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31\n", "3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63\n", "4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " df_diamonds\n", " >> gr.tf_head(5)\n", ")" ] }, { "cell_type": "markdown", "id": "1cff6ce5", "metadata": {}, "source": [ "If we wanted to approximate the volume of every diamond, we could do this by multiplying their `x`, `y`, and `z` dimensions. We could create a new column with this value via `gr.tf_mutate()`:\n" ] }, { "cell_type": "code", "execution_count": 4, "id": "9ebd167f", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
caratcutcolorclaritydepthtablepricexyzvolume
00.23IdealESI261.555.03263.953.982.4338.202030
10.21PremiumESI159.861.03263.893.842.3134.505856
20.23GoodEVS156.965.03274.054.072.3138.076885
30.29PremiumIVS262.458.03344.204.232.6346.724580
40.31GoodJSI263.358.03354.344.352.7551.917250
\n", "
" ], "text/plain": [ " carat cut color clarity depth table price x y z \\\n", "0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 \n", "1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 \n", "2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 \n", "3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 \n", "4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 \n", "\n", " volume \n", "0 38.202030 \n", "1 34.505856 \n", "2 38.076885 \n", "3 46.724580 \n", "4 51.917250 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " df_diamonds\n", " >> gr.tf_mutate(volume=DF.x * DF.y * DF.z)\n", " >> gr.tf_head(5)\n", ")" ] }, { "cell_type": "markdown", "id": "825af8ad", "metadata": {}, "source": [ "*Aside*: Note that we can use dot `.` notation to access columns, rather than bracket `[]` notation. This works only when the column names are valid variable names (i.e. do not contain spaces, do not contain special characters, etc.).\n" ] }, { "cell_type": "markdown", "id": "1a815918", "metadata": {}, "source": [ "### __q1__ Compute a mass density\n", "\n", "Approximate the density of the diamonds using the expression\n", "\n", "$$\\rho = \\frac{\\text{carat}}{xyz}.$$\n", "\n", "Name the new column `rho`. Answer the questions under *observations* below.\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "5206567d", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " rho\n", "count 53920.000000\n", "mean 0.006127\n", "std 0.000178\n", "min 0.000521\n", "25% 0.006048\n", "50% 0.006117\n", "75% 0.006190\n", "max 0.022647\n" ] } ], "source": [ "## TASK: Provide the approximate density as the column `rho`\n", "df_rho = (\n", " df_diamonds\n", " >> gr.tf_mutate(rho=DF[\"carat\"] / DF[\"x\"] / DF[\"y\"] / DF[\"z\"])\n", ")\n", "\n", "## NOTE: Use this to check your work\n", "assert \\\n", " \"rho\" in df_rho.columns, \\\n", " \"df_rho does not have a `rho` column\"\n", "\n", "assert \\\n", " abs(df_rho.rho.median() - 6.117055e-03) < 1e-6, \\\n", " \"The values of df_rho.rho are incorrect; check your calculation\"\n", "\n", "print(\n", " df_rho\n", " >> gr.tf_filter(DF.rho < 1e6)\n", " >> gr.tf_select(\"rho\")\n", " >> gr.tf_describe()\n", ")\n" ] }, { "cell_type": "markdown", "id": "57fc5465", "metadata": {}, "source": [ "*Observations*\n", "\n", "- What is the mean density `rho`?\n", " - `0.006127`\n", "- What is the lower-quarter density `rho` (25% value)?\n", " - `0.006048`\n", "- What is the upper-quarter density `rho` (75% value)?\n", " - `0.006190`\n", "" ] }, { "cell_type": "markdown", "id": "e321d628", "metadata": {}, "source": [ "## Vectorized Functions\n", "\n", "The `gr.tf_mutate()` verb carries out *vector* operations; these operations take in one or more columns, and return an entire column. For operations such as addition, this results in *elementwise* addition, as illustrated below:\n", "\n", "| `x` | `y` | `x + y` |\n", "|-----|-----|---------|\n", "| 1 | 0 | 1 |\n", "| 1 | 1 | 2 |\n", "| 1 | -1 | 0 |\n" ] }, { "cell_type": "markdown", "id": "5f98f3e3", "metadata": {}, "source": [ "The `gr.tf_mutate()` verb accepts a wide variety of *vectorized* functions; the following table organizes these into various types of operations.\n", "\n", "| Type | Functions |\n", "| ---- | --------- |\n", "| Arithmetic ops. | `DF.x + 1, DF.y - DF.x, DF.x * DF.y, DF.x / DF.y, DF.x^DF.a` |\n", "| Modular arith. | `DF.x // DF.y` (Floor division) `DF.x % DF.y` (Remainder) |\n", "| Logical comp. | `<, <=, >, >=, !=, ==` |\n", "| Logarithms | `gr.log(DF.x)` |\n", "| Offsets | `gr.lead(DF.x), gr.lag(DF.x)` |\n", "| Cumulants | `gr.cumsum(DF.x), gr.cumprod(DF.x), gr.cummin(DF.x), gr.cummaDF.x(DF.x), gr.cummean(DF.x)` |\n", "| Ranking | `gr.min_rank(DF.x), gr.row_number(DF.x), gr.dense_rank(DF.x), gr.percent_rank(DF.x)` |\n", "| Data conversion | `gr.as_numeric(DF.x)`, `gr.as_str(DF.x)`, `gr.as_factor(DF.x)` |\n", "| Control | `gr.if_else()`, `gr.case_when()` |" ] }, { "cell_type": "markdown", "id": "2ddd4fdd", "metadata": {}, "source": [ "### __q2__ Do a type conversion\n", "\n", "Convert the column `y` below into a numeric type.\n", "\n", "*Hint*: Use the table above to find an appropriate data conversion function.\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "6f0b8926", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success!\n" ] } ], "source": [ "## TASK: Convert the `y` column to a numeric type\n", "df_converted = (\n", " gr.df_make(\n", " x=[ 1, 2, 3],\n", " y=[\"4\", \"5\", \"6\"],\n", " )\n", "\n", " >> gr.tf_mutate(y=gr.as_numeric(DF.y))\n", ")\n", "\n", "## NOTE: Use this to check your work\n", "assert \\\n", " is_numeric_dtype(df_converted.y), \\\n", " \"df_converted.y is not numeric; make sure to do a conversion\"\n", "\n", "print(\"Success!\")" ] }, { "cell_type": "markdown", "id": "20bb4124", "metadata": {}, "source": [ "### __q3__ Sanity-check the data\n", "\n", "The `depth` variable in `df_diamonds` is *supposedly* computed via `depth_computed = 100 * 2 * DF[\"z\"] / (DF[\"x\"] + DF[\"y\"])`. Compute `diff = DF[\"depth\"] - DF[\"depth_computed\"]`: This is a measure of discrepancy between the given and computed depth. Answer the questions under *observations* below.\n", "\n", "*Hint*: If you want to compute `depth_computed` and use it to compute `diff`, you'll need to use two calls of `gr.tf_mutate()`.\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "9e7bf919", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
diff
count5.393300e+04
mean5.284249e-03
std2.629223e+00
min-5.574795e+02
25%-2.660550e-02
50%-7.105427e-15
75%2.665245e-02
max6.400000e+01
\n", "
" ], "text/plain": [ " diff\n", "count 5.393300e+04\n", "mean 5.284249e-03\n", "std 2.629223e+00\n", "min -5.574795e+02\n", "25% -2.660550e-02\n", "50% -7.105427e-15\n", "75% 2.665245e-02\n", "max 6.400000e+01" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## TASK: Compute `depth_computed` and `diff`, as described above\n", "df_depth = (\n", " df_diamonds \n", "\n", "# solution-begin \n", " >> gr.tf_mutate(depth_computed=100 * 2 * DF[\"z\"] / (DF[\"x\"] + DF[\"y\"]))\n", " >> gr.tf_mutate(diff = DF[\"depth\"]- DF[\"depth_computed\"])\n", ")\n", "\n", "## NOTE: Use this to check your work\n", "assert \\\n", " abs(df_depth[\"diff\"].median()) < 1e-14, \\\n", " \"df_depth.diff has the wrong values; check your calculation\"\n", "assert \\\n", " abs(df_depth[\"diff\"].mean() - 5.284249e-03) < 1e-6, \\\n", " \"df_depth.diff has the wrong values; check your calculation\"\n", "\n", "# Show the data\n", "(\n", " df_depth\n", " >> gr.tf_select(\"diff\")\n", " >> gr.tf_describe()\n", ")\n" ] }, { "cell_type": "markdown", "id": "1e67cf7f", "metadata": {}, "source": [ "**Observation**\n", "\n", "- What is the mean of `diff`? What is the median of `diff`?\n", " - `mean(diff) == 5.284249e-03`; `median(diff) ~= 0.0`\n", "- What are the min and max of `diff`?\n", " - `min(diff) == -5.574795e+02`, `max(diff) == 64`\n", "- Remember that `diff` measures the agreement between `depth` and `depth_computed`; how well do these two quantities match?\n", " - Generally, the agreement between `depth` and `depth_computed` is very good (the mean is close to zero, the 25% and 75% quantiles are quite small). However, there are some extreme cases of disagreement; perhaps there are data errors in these particular values?\n" ] }, { "cell_type": "markdown", "id": "b4fd84ff", "metadata": {}, "source": [ "## Control Functions\n", "\n", "The `gr.if_else()` and `gr.case_when()` functions are *special* vectorized functions---they allow us to use more programming-like constructs to control the output. The `gr.if_else()` helper is a simple \"yes-no\" function, while `gr.case_when()` is a vectorized [switch statement](https://en.wikipedia.org/wiki/Switch_statement). If you are trying to do more complicated data operations, you might want to give these functions a look.\n" ] }, { "cell_type": "markdown", "id": "e5cf4ba1", "metadata": {}, "source": [ "### __q4__ Use an if/else statement\n", "\n", "Use `gr.if_else()` to flag diamonds as `\"Big\"` if they have `carat >= 1` and `\"Small\"` otherwise.\n", "\n", "*Hint*: Remember to read the documentation for an unfamiliar function to learn how to call it; in particular, try looking at the *Examples* section of the documentation.\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "eeff6986", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
caratcutcolorclaritydepthtablepricexyzsize
00.23IdealESI261.555.03263.953.982.43Small
10.21PremiumESI159.861.03263.893.842.31Small
20.23GoodEVS156.965.03274.054.072.31Small
30.29PremiumIVS262.458.03344.204.232.63Small
40.31GoodJSI263.358.03354.344.352.75Small
....................................
539350.72IdealDSI160.857.027575.755.763.50Small
539360.72GoodDSI163.155.027575.695.753.61Small
539370.70Very GoodDSI162.860.027575.665.683.56Small
539380.86PremiumHSI261.058.027576.156.123.74Small
539390.75IdealDSI262.255.027575.835.873.64Small
\n", "

53940 rows × 11 columns

\n", "
" ], "text/plain": [ " carat cut color clarity depth table price x y z \\\n", "0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 \n", "1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 \n", "2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 \n", "3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 \n", "4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 \n", "... ... ... ... ... ... ... ... ... ... ... \n", "53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50 \n", "53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61 \n", "53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56 \n", "53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 \n", "53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64 \n", "\n", " size \n", "0 Small \n", "1 Small \n", "2 Small \n", "3 Small \n", "4 Small \n", "... ... \n", "53935 Small \n", "53936 Small \n", "53937 Small \n", "53938 Small \n", "53939 Small \n", "\n", "[53940 rows x 11 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## TODO: Compute the `size` using `gr.if_else()`\n", "df_size = (\n", " df_diamonds\n", " >> gr.tf_mutate(\n", " \n", " size=gr.if_else(DF.carat >= 1, \"Big\", \"Small\")\n", " )\n", ")\n", "\n", "## NOTE: Use this to check you work\n", "assert \\\n", " \"size\" in df_size.columns, \\\n", " \"df_size does not have a `size` column\"\n", "\n", "assert \\\n", " all(df_size[df_size[\"size\"] == \"Big\"].carat >= 1), \\\n", " \"Size calculation is incorrect\"\n", "\n", "df_size" ] }, { "cell_type": "markdown", "id": "8283a434", "metadata": {}, "source": [ "# Summarizing Data Frames\n", "\n", "The `gr.tf_summarize()` verb is used to *reduce* a DataFrame down to fewer rows using summary functions. For instance, we can compute the mean `carat` and `price` across the whole dataset of diamonds:\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "b2264c8a", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carat_meanprice_mean
00.797943932.799722
\n", "
" ], "text/plain": [ " carat_mean price_mean\n", "0 0.79794 3932.799722" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " df_diamonds\n", " >> gr.tf_summarize(\n", " carat_mean=gr.mean(DF.carat),\n", " price_mean=gr.mean(DF.price),\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "807a445a", "metadata": {}, "source": [ "Notice that we went from having around `54,000` rows down to just one row; in this sense we have *summarized* the dataset.\n" ] }, { "cell_type": "markdown", "id": "896f5bc3", "metadata": {}, "source": [ "## Summary functions\n", "\n", "Summarize accepts a large number of *summary functions*; the table below lists and categorizes some of the most common summary functions.\n", "\n", "| Type | Functions |\n", "| ---- | --------- |\n", "| Location | `gr.mean(DF.x), gr.median(DF.x), gr.min(DF.x), gr.max(DF.x)` |\n", "| Spread | `gr.sd(DF.x), gr.var(DF.x), gr.IQR(DF.x)` |\n", "| Order | `gr.first(DF.x), gr.nth(DF.x, n), gr.last(DF.x)` |\n", "| Counts | `gr.n_distinct(DF.x)`, `gr.n()` (Total count) |\n", "| Logical | `gr.sum(DF.x == 0)` (Count cases where `x == 0`) `gr.pr(DF.g > 0)` (Probability helper) |\n" ] }, { "cell_type": "markdown", "id": "b996b9fa", "metadata": {}, "source": [ "### __q5__ Count the Ideal rows\n", "\n", "Use `gr.tf_summarize()` to count the number of rows where `cut == \"Ideal\"`. Provide this as the column `n_ideal`.\n", "\n", "*Hint*: The `Summary functions` table above provides a recipe for counting the number of cases where some condition is met.\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "84494779", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
n_ideal
021551
\n", "
" ], "text/plain": [ " n_ideal\n", "0 21551" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# TASK: Find the number of rows with cut == \"Ideal\"\n", "df_nideal = (\n", " df_diamonds\n", " >> gr.tf_summarize(\n", "\n", " n_ideal=gr.sum(DF.cut == \"Ideal\")\n", " )\n", ")\n", "\n", "## NOTE: Use this to check your work\n", "assert \\\n", " \"n_ideal\" in df_nideal.columns, \\\n", " \"df_nideal does not have a `n_ideal` column\"\n", "\n", "assert \\\n", " df_nideal[\"n_ideal\"][0]/23==937,\\\n", " \"The count is incorrect\"\n", "\n", "df_nideal" ] }, { "cell_type": "markdown", "id": "c012012e", "metadata": {}, "source": [ "## Grouping by variables\n", "\n", "The `gr.tf_summarize()` verb is helpful for getting \"basic facts\" about a dataset, but it is even more powerful when combined with `gr.tf_group_by()`.\n", "\n", "The `gr.tf_group_by()` verb takes a DataFrame and \"groups\" it by one (or more) columns. This causes other verbs (such as `gr.tf_summarize()`) to treat the data as though it were *multiple* DataFrames; these sub-DataFrames exactly correspond to the groups determined by `gr.tf_group_by()`." ] }, { "cell_type": "markdown", "id": "30dac286", "metadata": {}, "source": [ "![The `gr.tf_group_by()` verb modifies the behavior of a DataFrame to treat it as a collection of multiple datasets, defined by the grouping variable(s).](images/group-by.png)\n", "\n", "Image: Zimmerman et al." ] }, { "cell_type": "markdown", "id": "d3c35d31", "metadata": {}, "source": [ "When the input DataFrame is grouped, the `gr.tf_summarize()` verb will calculate the summary functions in a group-wise manner." ] }, { "cell_type": "markdown", "id": "2fb9887f", "metadata": {}, "source": [ "![Grouped summarize](images/grouped-summarize.png)\n", "\n", "Image: Zimmerman et al." ] }, { "cell_type": "markdown", "id": "8b7d83fb", "metadata": {}, "source": [ "Returning to the diamonds example: Let's see what adding a `gr.tf_group_by(DF.cut)` before the price and carat mean calculation does to our results." ] }, { "cell_type": "code", "execution_count": 11, "id": "9b4083ce", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cutcarat_meanprice_mean
0Fair1.0461374358.757764
1Good0.8491853928.864452
2Ideal0.7028373457.541970
3Premium0.8919554584.257704
4Very Good0.8063813981.759891
\n", "
" ], "text/plain": [ " cut carat_mean price_mean\n", "0 Fair 1.046137 4358.757764\n", "1 Good 0.849185 3928.864452\n", "2 Ideal 0.702837 3457.541970\n", "3 Premium 0.891955 4584.257704\n", "4 Very Good 0.806381 3981.759891" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## NOTE: No need to edit\n", "(\n", " df_diamonds\n", " >> gr.tf_group_by(DF.cut)\n", " >> gr.tf_summarize(\n", " carat_mean=gr.mean(DF.carat),\n", " price_mean=gr.mean(DF.price),\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "da8ea0ce", "metadata": {}, "source": [ "Now we can compare the mean `carat` and mean `price` across different `cut` levels. Personally, I was surprised to see that the `Fair` cut diamonds have a higher mean `price` than the `Ideal` diamonds!" ] }, { "cell_type": "markdown", "id": "f0d79338", "metadata": {}, "source": [ "### __q6__ Try out a two-variable grouping\n", "\n", "The code below groups by two variables. Uncomment the `tf_group_by()` line below, and describe how the result changes.\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "42853239", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
claritycolordiamonds_mean
0I1D3863.023810
1IFD8307.369863
2SI1D2976.146423
3SI2D3931.101460
4VS1D3030.158865
5VS2D2587.225692
6VVS1D2947.912698
7VVS2D3351.128391
8I1E3488.421569
9IFE3668.506329
10SI1E3161.838005
11SI2E4173.826036
12VS1E2856.294301
13VS2E2750.941700
14VVS1E2219.820122
15VVS2E2499.674067
16I1F3342.181818
17IFF2750.836364
18SI1F3714.225716
19SI2F4472.625233
20VS1F3796.717742
21VS2F3756.795093
22VVS1F2804.276567
23VVS2F3475.512821
24I1G3545.693333
25IFG2558.033774
26SI1G3774.787449
27SI2G5021.684109
28VS1G4131.362197
29VS2G4416.256498
30VVS1G2866.820821
31VVS2G3845.283437
32I1H4453.413580
33IFH2287.869565
34SI1H5032.414945
35SI2H6099.895074
36VS1H3780.688623
37VS2H4722.414486
38VVS1H1845.658120
39VVS2H2649.067434
40I1I4302.184783
41IFI1994.937063
42SI1I5355.019663
43SI2I7002.649123
44VS1I4633.183992
45VS2I5690.505560
46VVS1I2034.861972
47VVS2I2968.232877
48I1J5254.060000
49IFJ3363.882353
50SI1J5186.048000
51SI2J6520.958246
52VS1J4884.461255
53VS2J5311.058824
54VVS1J4034.175676
55VVS2J5142.396947
\n", "
" ], "text/plain": [ " clarity color diamonds_mean\n", "0 I1 D 3863.023810\n", "1 IF D 8307.369863\n", "2 SI1 D 2976.146423\n", "3 SI2 D 3931.101460\n", "4 VS1 D 3030.158865\n", "5 VS2 D 2587.225692\n", "6 VVS1 D 2947.912698\n", "7 VVS2 D 3351.128391\n", "8 I1 E 3488.421569\n", "9 IF E 3668.506329\n", "10 SI1 E 3161.838005\n", "11 SI2 E 4173.826036\n", "12 VS1 E 2856.294301\n", "13 VS2 E 2750.941700\n", "14 VVS1 E 2219.820122\n", "15 VVS2 E 2499.674067\n", "16 I1 F 3342.181818\n", "17 IF F 2750.836364\n", "18 SI1 F 3714.225716\n", "19 SI2 F 4472.625233\n", "20 VS1 F 3796.717742\n", "21 VS2 F 3756.795093\n", "22 VVS1 F 2804.276567\n", "23 VVS2 F 3475.512821\n", "24 I1 G 3545.693333\n", "25 IF G 2558.033774\n", "26 SI1 G 3774.787449\n", "27 SI2 G 5021.684109\n", "28 VS1 G 4131.362197\n", "29 VS2 G 4416.256498\n", "30 VVS1 G 2866.820821\n", "31 VVS2 G 3845.283437\n", "32 I1 H 4453.413580\n", "33 IF H 2287.869565\n", "34 SI1 H 5032.414945\n", "35 SI2 H 6099.895074\n", "36 VS1 H 3780.688623\n", "37 VS2 H 4722.414486\n", "38 VVS1 H 1845.658120\n", "39 VVS2 H 2649.067434\n", "40 I1 I 4302.184783\n", "41 IF I 1994.937063\n", "42 SI1 I 5355.019663\n", "43 SI2 I 7002.649123\n", "44 VS1 I 4633.183992\n", "45 VS2 I 5690.505560\n", "46 VVS1 I 2034.861972\n", "47 VVS2 I 2968.232877\n", "48 I1 J 5254.060000\n", "49 IF J 3363.882353\n", "50 SI1 J 5186.048000\n", "51 SI2 J 6520.958246\n", "52 VS1 J 4884.461255\n", "53 VS2 J 5311.058824\n", "54 VVS1 J 4034.175676\n", "55 VVS2 J 5142.396947" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#TASK: Uncomment the tf_group_by() below, and describe how the result changes\n", "df_q2 = (\n", " df_diamonds\n", "\n", " >> gr.tf_group_by(DF[\"color\"], DF[\"clarity\"])\n", " >> gr.tf_summarize(diamonds_mean=gr.mean(DF[\"price\"]))\n", ")\n", "\n", "df_q2" ] }, { "cell_type": "markdown", "id": "4c9f6b3a", "metadata": {}, "source": [ "*Observations*\n", "- How many rows does the commented version (no grouping) return?\n", " - Just one row\n", "- How many rows (roughly) does the uncommented version (with grouping) return?\n", " - Many rows! One row for every unique combination of `clarity` and `color`.\n", "" ] }, { "cell_type": "markdown", "id": "06ea856f", "metadata": {}, "source": [ "### __q7__ Study the color\n", "\n", "Compute the mean `price` grouped by `color`. Answer the questions under *observations* below.\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "c2530c62", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colorprice_mean
0E3076.752475
1D3169.954096
2F3724.886397
3G3999.135671
4H4486.669196
5I5091.874954
6J5323.818020
\n", "
" ], "text/plain": [ " color price_mean\n", "0 E 3076.752475\n", "1 D 3169.954096\n", "2 F 3724.886397\n", "3 G 3999.135671\n", "4 H 4486.669196\n", "5 I 5091.874954\n", "6 J 5323.818020" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# TASK: Compute the mean price, grouped by color\n", "(\n", " df_diamonds\n", "\n", "# solution-begin \n", " >> gr.tf_group_by(DF[\"color\"])\n", " >> gr.tf_summarize(\n", " price_mean=gr.mean(DF[\"price\"]),\n", " )\n", " \n", " ## NOTE: We need to \"ungroup\" in order to re-arrange the data\n", " >> gr.tf_ungroup()\n", " >> gr.tf_arrange(\"price_mean\")\n", ")" ] }, { "cell_type": "markdown", "id": "4099ff6e", "metadata": {}, "source": [ "*Observations*\n", "- Which tends to be the most valuable color? Which is the least valuable?\n", " - `color == \"J\"` tends to be the most valuable, while `color == \"E\"` tends to be least valuable.\n", "- The code above orders the colors according to their average price; are they in alphabetical order?\n", " - Surprisingly, no! `\"D\"` would be the first letter, but `color == \"E\"` has the lowest mean price.\n", "" ] }, { "cell_type": "markdown", "id": "3e071016", "metadata": {}, "source": [ "# References\n", "\n", "- Naupaka Zimmerman, Greg Wilson, Raniere Silva, Scott Ritchie, François Michonneau, Jeffrey Oliver, … Yuka Takemon. (2019, July). swcarpentry/r-novice-gapminder: Software Carpentry: R for Reproducible Scientific Analysis, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3265164" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }