{
"cells": [
{
"cell_type": "markdown",
"id": "e56ce1d1",
"metadata": {},
"source": [
"# Data: Deriving Quantities\n",
"\n",
"*Purpose*: Often our data will not tell us *directly* what we want to know; in\n",
"these cases we need to *derive* new quantities from our data. In this exercise,\n",
"we'll work with `tf_mutate()` to create new columns by operating on existing\n",
"variables, and use `tf_group_by()` with `tf_summarize()` to compute aggregate\n",
"statistics (summaries!) of our data.\n",
"\n",
"Aside: The data-summary verbs in grama are heavily inspired by the [dplyr](https://dplyr.tidyverse.org/) package in the R programming langauge. "
]
},
{
"cell_type": "markdown",
"id": "a5b9d5e1",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "34b54e42",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import grama as gr\n",
"DF = gr.Intention()\n",
"%matplotlib inline\n",
"\n",
"# For assertion\n",
"from pandas.api.types import is_numeric_dtype"
]
},
{
"cell_type": "markdown",
"id": "dbd356f3",
"metadata": {},
"source": [
"We'll be using the `diamonds` as seen in `e-vis00-basics` earlier. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "800429ed",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from grama.data import df_diamonds"
]
},
{
"cell_type": "markdown",
"id": "134f3740",
"metadata": {},
"source": [
"# Mutating DataFrames\n",
"\n",
"A *mutation* transformation allows us to make new columns (or edit old ones) by *combining* column values. \n",
"\n",
"For example, let's remember the diamonds dataset:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "0032eee7",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carat | \n",
" cut | \n",
" color | \n",
" clarity | \n",
" depth | \n",
" table | \n",
" price | \n",
" x | \n",
" y | \n",
" z | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.23 | \n",
" Ideal | \n",
" E | \n",
" SI2 | \n",
" 61.5 | \n",
" 55.0 | \n",
" 326 | \n",
" 3.95 | \n",
" 3.98 | \n",
" 2.43 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.21 | \n",
" Premium | \n",
" E | \n",
" SI1 | \n",
" 59.8 | \n",
" 61.0 | \n",
" 326 | \n",
" 3.89 | \n",
" 3.84 | \n",
" 2.31 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.23 | \n",
" Good | \n",
" E | \n",
" VS1 | \n",
" 56.9 | \n",
" 65.0 | \n",
" 327 | \n",
" 4.05 | \n",
" 4.07 | \n",
" 2.31 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.29 | \n",
" Premium | \n",
" I | \n",
" VS2 | \n",
" 62.4 | \n",
" 58.0 | \n",
" 334 | \n",
" 4.20 | \n",
" 4.23 | \n",
" 2.63 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.31 | \n",
" Good | \n",
" J | \n",
" SI2 | \n",
" 63.3 | \n",
" 58.0 | \n",
" 335 | \n",
" 4.34 | \n",
" 4.35 | \n",
" 2.75 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carat cut color clarity depth table price x y z\n",
"0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43\n",
"1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31\n",
"2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31\n",
"3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63\n",
"4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_diamonds\n",
" >> gr.tf_head(5)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1cff6ce5",
"metadata": {},
"source": [
"If we wanted to approximate the volume of every diamond, we could do this by multiplying their `x`, `y`, and `z` dimensions. We could create a new column with this value via `gr.tf_mutate()`:\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9ebd167f",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carat | \n",
" cut | \n",
" color | \n",
" clarity | \n",
" depth | \n",
" table | \n",
" price | \n",
" x | \n",
" y | \n",
" z | \n",
" volume | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.23 | \n",
" Ideal | \n",
" E | \n",
" SI2 | \n",
" 61.5 | \n",
" 55.0 | \n",
" 326 | \n",
" 3.95 | \n",
" 3.98 | \n",
" 2.43 | \n",
" 38.202030 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.21 | \n",
" Premium | \n",
" E | \n",
" SI1 | \n",
" 59.8 | \n",
" 61.0 | \n",
" 326 | \n",
" 3.89 | \n",
" 3.84 | \n",
" 2.31 | \n",
" 34.505856 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.23 | \n",
" Good | \n",
" E | \n",
" VS1 | \n",
" 56.9 | \n",
" 65.0 | \n",
" 327 | \n",
" 4.05 | \n",
" 4.07 | \n",
" 2.31 | \n",
" 38.076885 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.29 | \n",
" Premium | \n",
" I | \n",
" VS2 | \n",
" 62.4 | \n",
" 58.0 | \n",
" 334 | \n",
" 4.20 | \n",
" 4.23 | \n",
" 2.63 | \n",
" 46.724580 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.31 | \n",
" Good | \n",
" J | \n",
" SI2 | \n",
" 63.3 | \n",
" 58.0 | \n",
" 335 | \n",
" 4.34 | \n",
" 4.35 | \n",
" 2.75 | \n",
" 51.917250 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carat cut color clarity depth table price x y z \\\n",
"0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 \n",
"1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 \n",
"2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 \n",
"3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 \n",
"4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 \n",
"\n",
" volume \n",
"0 38.202030 \n",
"1 34.505856 \n",
"2 38.076885 \n",
"3 46.724580 \n",
"4 51.917250 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_diamonds\n",
" >> gr.tf_mutate(volume=DF.x * DF.y * DF.z)\n",
" >> gr.tf_head(5)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "825af8ad",
"metadata": {},
"source": [
"*Aside*: Note that we can use dot `.` notation to access columns, rather than bracket `[]` notation. This works only when the column names are valid variable names (i.e. do not contain spaces, do not contain special characters, etc.).\n"
]
},
{
"cell_type": "markdown",
"id": "1a815918",
"metadata": {},
"source": [
"### __q1__ Compute a mass density\n",
"\n",
"Approximate the density of the diamonds using the expression\n",
"\n",
"$$\\rho = \\frac{\\text{carat}}{xyz}.$$\n",
"\n",
"Name the new column `rho`. Answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5206567d",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" rho\n",
"count 53920.000000\n",
"mean 0.006127\n",
"std 0.000178\n",
"min 0.000521\n",
"25% 0.006048\n",
"50% 0.006117\n",
"75% 0.006190\n",
"max 0.022647\n"
]
}
],
"source": [
"## TASK: Provide the approximate density as the column `rho`\n",
"df_rho = (\n",
" df_diamonds\n",
" >> gr.tf_mutate(rho=DF[\"carat\"] / DF[\"x\"] / DF[\"y\"] / DF[\"z\"])\n",
")\n",
"\n",
"## NOTE: Use this to check your work\n",
"assert \\\n",
" \"rho\" in df_rho.columns, \\\n",
" \"df_rho does not have a `rho` column\"\n",
"\n",
"assert \\\n",
" abs(df_rho.rho.median() - 6.117055e-03) < 1e-6, \\\n",
" \"The values of df_rho.rho are incorrect; check your calculation\"\n",
"\n",
"print(\n",
" df_rho\n",
" >> gr.tf_filter(DF.rho < 1e6)\n",
" >> gr.tf_select(\"rho\")\n",
" >> gr.tf_describe()\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "57fc5465",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- What is the mean density `rho`?\n",
" - `0.006127`\n",
"- What is the lower-quarter density `rho` (25% value)?\n",
" - `0.006048`\n",
"- What is the upper-quarter density `rho` (75% value)?\n",
" - `0.006190`\n",
""
]
},
{
"cell_type": "markdown",
"id": "e321d628",
"metadata": {},
"source": [
"## Vectorized Functions\n",
"\n",
"The `gr.tf_mutate()` verb carries out *vector* operations; these operations take in one or more columns, and return an entire column. For operations such as addition, this results in *elementwise* addition, as illustrated below:\n",
"\n",
"| `x` | `y` | `x + y` |\n",
"|-----|-----|---------|\n",
"| 1 | 0 | 1 |\n",
"| 1 | 1 | 2 |\n",
"| 1 | -1 | 0 |\n"
]
},
{
"cell_type": "markdown",
"id": "5f98f3e3",
"metadata": {},
"source": [
"The `gr.tf_mutate()` verb accepts a wide variety of *vectorized* functions; the following table organizes these into various types of operations.\n",
"\n",
"| Type | Functions |\n",
"| ---- | --------- |\n",
"| Arithmetic ops. | `DF.x + 1, DF.y - DF.x, DF.x * DF.y, DF.x / DF.y, DF.x^DF.a` |\n",
"| Modular arith. | `DF.x // DF.y` (Floor division) `DF.x % DF.y` (Remainder) |\n",
"| Logical comp. | `<, <=, >, >=, !=, ==` |\n",
"| Logarithms | `gr.log(DF.x)` |\n",
"| Offsets | `gr.lead(DF.x), gr.lag(DF.x)` |\n",
"| Cumulants | `gr.cumsum(DF.x), gr.cumprod(DF.x), gr.cummin(DF.x), gr.cummaDF.x(DF.x), gr.cummean(DF.x)` |\n",
"| Ranking | `gr.min_rank(DF.x), gr.row_number(DF.x), gr.dense_rank(DF.x), gr.percent_rank(DF.x)` |\n",
"| Data conversion | `gr.as_numeric(DF.x)`, `gr.as_str(DF.x)`, `gr.as_factor(DF.x)` |\n",
"| Control | `gr.if_else()`, `gr.case_when()` |"
]
},
{
"cell_type": "markdown",
"id": "2ddd4fdd",
"metadata": {},
"source": [
"### __q2__ Do a type conversion\n",
"\n",
"Convert the column `y` below into a numeric type.\n",
"\n",
"*Hint*: Use the table above to find an appropriate data conversion function.\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6f0b8926",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Success!\n"
]
}
],
"source": [
"## TASK: Convert the `y` column to a numeric type\n",
"df_converted = (\n",
" gr.df_make(\n",
" x=[ 1, 2, 3],\n",
" y=[\"4\", \"5\", \"6\"],\n",
" )\n",
"\n",
" >> gr.tf_mutate(y=gr.as_numeric(DF.y))\n",
")\n",
"\n",
"## NOTE: Use this to check your work\n",
"assert \\\n",
" is_numeric_dtype(df_converted.y), \\\n",
" \"df_converted.y is not numeric; make sure to do a conversion\"\n",
"\n",
"print(\"Success!\")"
]
},
{
"cell_type": "markdown",
"id": "20bb4124",
"metadata": {},
"source": [
"### __q3__ Sanity-check the data\n",
"\n",
"The `depth` variable in `df_diamonds` is *supposedly* computed via `depth_computed = 100 * 2 * DF[\"z\"] / (DF[\"x\"] + DF[\"y\"])`. Compute `diff = DF[\"depth\"] - DF[\"depth_computed\"]`: This is a measure of discrepancy between the given and computed depth. Answer the questions under *observations* below.\n",
"\n",
"*Hint*: If you want to compute `depth_computed` and use it to compute `diff`, you'll need to use two calls of `gr.tf_mutate()`.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "9e7bf919",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" diff | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 5.393300e+04 | \n",
"
\n",
" \n",
" mean | \n",
" 5.284249e-03 | \n",
"
\n",
" \n",
" std | \n",
" 2.629223e+00 | \n",
"
\n",
" \n",
" min | \n",
" -5.574795e+02 | \n",
"
\n",
" \n",
" 25% | \n",
" -2.660550e-02 | \n",
"
\n",
" \n",
" 50% | \n",
" -7.105427e-15 | \n",
"
\n",
" \n",
" 75% | \n",
" 2.665245e-02 | \n",
"
\n",
" \n",
" max | \n",
" 6.400000e+01 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" diff\n",
"count 5.393300e+04\n",
"mean 5.284249e-03\n",
"std 2.629223e+00\n",
"min -5.574795e+02\n",
"25% -2.660550e-02\n",
"50% -7.105427e-15\n",
"75% 2.665245e-02\n",
"max 6.400000e+01"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: Compute `depth_computed` and `diff`, as described above\n",
"df_depth = (\n",
" df_diamonds \n",
"\n",
"# solution-begin \n",
" >> gr.tf_mutate(depth_computed=100 * 2 * DF[\"z\"] / (DF[\"x\"] + DF[\"y\"]))\n",
" >> gr.tf_mutate(diff = DF[\"depth\"]- DF[\"depth_computed\"])\n",
")\n",
"\n",
"## NOTE: Use this to check your work\n",
"assert \\\n",
" abs(df_depth[\"diff\"].median()) < 1e-14, \\\n",
" \"df_depth.diff has the wrong values; check your calculation\"\n",
"assert \\\n",
" abs(df_depth[\"diff\"].mean() - 5.284249e-03) < 1e-6, \\\n",
" \"df_depth.diff has the wrong values; check your calculation\"\n",
"\n",
"# Show the data\n",
"(\n",
" df_depth\n",
" >> gr.tf_select(\"diff\")\n",
" >> gr.tf_describe()\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "1e67cf7f",
"metadata": {},
"source": [
"**Observation**\n",
"\n",
"- What is the mean of `diff`? What is the median of `diff`?\n",
" - `mean(diff) == 5.284249e-03`; `median(diff) ~= 0.0`\n",
"- What are the min and max of `diff`?\n",
" - `min(diff) == -5.574795e+02`, `max(diff) == 64`\n",
"- Remember that `diff` measures the agreement between `depth` and `depth_computed`; how well do these two quantities match?\n",
" - Generally, the agreement between `depth` and `depth_computed` is very good (the mean is close to zero, the 25% and 75% quantiles are quite small). However, there are some extreme cases of disagreement; perhaps there are data errors in these particular values?\n"
]
},
{
"cell_type": "markdown",
"id": "b4fd84ff",
"metadata": {},
"source": [
"## Control Functions\n",
"\n",
"The `gr.if_else()` and `gr.case_when()` functions are *special* vectorized functions---they allow us to use more programming-like constructs to control the output. The `gr.if_else()` helper is a simple \"yes-no\" function, while `gr.case_when()` is a vectorized [switch statement](https://en.wikipedia.org/wiki/Switch_statement). If you are trying to do more complicated data operations, you might want to give these functions a look.\n"
]
},
{
"cell_type": "markdown",
"id": "e5cf4ba1",
"metadata": {},
"source": [
"### __q4__ Use an if/else statement\n",
"\n",
"Use `gr.if_else()` to flag diamonds as `\"Big\"` if they have `carat >= 1` and `\"Small\"` otherwise.\n",
"\n",
"*Hint*: Remember to read the documentation for an unfamiliar function to learn how to call it; in particular, try looking at the *Examples* section of the documentation.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "eeff6986",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carat | \n",
" cut | \n",
" color | \n",
" clarity | \n",
" depth | \n",
" table | \n",
" price | \n",
" x | \n",
" y | \n",
" z | \n",
" size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.23 | \n",
" Ideal | \n",
" E | \n",
" SI2 | \n",
" 61.5 | \n",
" 55.0 | \n",
" 326 | \n",
" 3.95 | \n",
" 3.98 | \n",
" 2.43 | \n",
" Small | \n",
"
\n",
" \n",
" 1 | \n",
" 0.21 | \n",
" Premium | \n",
" E | \n",
" SI1 | \n",
" 59.8 | \n",
" 61.0 | \n",
" 326 | \n",
" 3.89 | \n",
" 3.84 | \n",
" 2.31 | \n",
" Small | \n",
"
\n",
" \n",
" 2 | \n",
" 0.23 | \n",
" Good | \n",
" E | \n",
" VS1 | \n",
" 56.9 | \n",
" 65.0 | \n",
" 327 | \n",
" 4.05 | \n",
" 4.07 | \n",
" 2.31 | \n",
" Small | \n",
"
\n",
" \n",
" 3 | \n",
" 0.29 | \n",
" Premium | \n",
" I | \n",
" VS2 | \n",
" 62.4 | \n",
" 58.0 | \n",
" 334 | \n",
" 4.20 | \n",
" 4.23 | \n",
" 2.63 | \n",
" Small | \n",
"
\n",
" \n",
" 4 | \n",
" 0.31 | \n",
" Good | \n",
" J | \n",
" SI2 | \n",
" 63.3 | \n",
" 58.0 | \n",
" 335 | \n",
" 4.34 | \n",
" 4.35 | \n",
" 2.75 | \n",
" Small | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 53935 | \n",
" 0.72 | \n",
" Ideal | \n",
" D | \n",
" SI1 | \n",
" 60.8 | \n",
" 57.0 | \n",
" 2757 | \n",
" 5.75 | \n",
" 5.76 | \n",
" 3.50 | \n",
" Small | \n",
"
\n",
" \n",
" 53936 | \n",
" 0.72 | \n",
" Good | \n",
" D | \n",
" SI1 | \n",
" 63.1 | \n",
" 55.0 | \n",
" 2757 | \n",
" 5.69 | \n",
" 5.75 | \n",
" 3.61 | \n",
" Small | \n",
"
\n",
" \n",
" 53937 | \n",
" 0.70 | \n",
" Very Good | \n",
" D | \n",
" SI1 | \n",
" 62.8 | \n",
" 60.0 | \n",
" 2757 | \n",
" 5.66 | \n",
" 5.68 | \n",
" 3.56 | \n",
" Small | \n",
"
\n",
" \n",
" 53938 | \n",
" 0.86 | \n",
" Premium | \n",
" H | \n",
" SI2 | \n",
" 61.0 | \n",
" 58.0 | \n",
" 2757 | \n",
" 6.15 | \n",
" 6.12 | \n",
" 3.74 | \n",
" Small | \n",
"
\n",
" \n",
" 53939 | \n",
" 0.75 | \n",
" Ideal | \n",
" D | \n",
" SI2 | \n",
" 62.2 | \n",
" 55.0 | \n",
" 2757 | \n",
" 5.83 | \n",
" 5.87 | \n",
" 3.64 | \n",
" Small | \n",
"
\n",
" \n",
"
\n",
"
53940 rows × 11 columns
\n",
"
"
],
"text/plain": [
" carat cut color clarity depth table price x y z \\\n",
"0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 \n",
"1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 \n",
"2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 \n",
"3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 \n",
"4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 \n",
"... ... ... ... ... ... ... ... ... ... ... \n",
"53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50 \n",
"53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61 \n",
"53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56 \n",
"53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 \n",
"53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64 \n",
"\n",
" size \n",
"0 Small \n",
"1 Small \n",
"2 Small \n",
"3 Small \n",
"4 Small \n",
"... ... \n",
"53935 Small \n",
"53936 Small \n",
"53937 Small \n",
"53938 Small \n",
"53939 Small \n",
"\n",
"[53940 rows x 11 columns]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TODO: Compute the `size` using `gr.if_else()`\n",
"df_size = (\n",
" df_diamonds\n",
" >> gr.tf_mutate(\n",
" \n",
" size=gr.if_else(DF.carat >= 1, \"Big\", \"Small\")\n",
" )\n",
")\n",
"\n",
"## NOTE: Use this to check you work\n",
"assert \\\n",
" \"size\" in df_size.columns, \\\n",
" \"df_size does not have a `size` column\"\n",
"\n",
"assert \\\n",
" all(df_size[df_size[\"size\"] == \"Big\"].carat >= 1), \\\n",
" \"Size calculation is incorrect\"\n",
"\n",
"df_size"
]
},
{
"cell_type": "markdown",
"id": "8283a434",
"metadata": {},
"source": [
"# Summarizing Data Frames\n",
"\n",
"The `gr.tf_summarize()` verb is used to *reduce* a DataFrame down to fewer rows using summary functions. For instance, we can compute the mean `carat` and `price` across the whole dataset of diamonds:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b2264c8a",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carat_mean | \n",
" price_mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0.79794 | \n",
" 3932.799722 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carat_mean price_mean\n",
"0 0.79794 3932.799722"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_diamonds\n",
" >> gr.tf_summarize(\n",
" carat_mean=gr.mean(DF.carat),\n",
" price_mean=gr.mean(DF.price),\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "807a445a",
"metadata": {},
"source": [
"Notice that we went from having around `54,000` rows down to just one row; in this sense we have *summarized* the dataset.\n"
]
},
{
"cell_type": "markdown",
"id": "896f5bc3",
"metadata": {},
"source": [
"## Summary functions\n",
"\n",
"Summarize accepts a large number of *summary functions*; the table below lists and categorizes some of the most common summary functions.\n",
"\n",
"| Type | Functions |\n",
"| ---- | --------- |\n",
"| Location | `gr.mean(DF.x), gr.median(DF.x), gr.min(DF.x), gr.max(DF.x)` |\n",
"| Spread | `gr.sd(DF.x), gr.var(DF.x), gr.IQR(DF.x)` |\n",
"| Order | `gr.first(DF.x), gr.nth(DF.x, n), gr.last(DF.x)` |\n",
"| Counts | `gr.n_distinct(DF.x)`, `gr.n()` (Total count) |\n",
"| Logical | `gr.sum(DF.x == 0)` (Count cases where `x == 0`) `gr.pr(DF.g > 0)` (Probability helper) |\n"
]
},
{
"cell_type": "markdown",
"id": "b996b9fa",
"metadata": {},
"source": [
"### __q5__ Count the Ideal rows\n",
"\n",
"Use `gr.tf_summarize()` to count the number of rows where `cut == \"Ideal\"`. Provide this as the column `n_ideal`.\n",
"\n",
"*Hint*: The `Summary functions` table above provides a recipe for counting the number of cases where some condition is met.\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "84494779",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" n_ideal | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 21551 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" n_ideal\n",
"0 21551"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# TASK: Find the number of rows with cut == \"Ideal\"\n",
"df_nideal = (\n",
" df_diamonds\n",
" >> gr.tf_summarize(\n",
"\n",
" n_ideal=gr.sum(DF.cut == \"Ideal\")\n",
" )\n",
")\n",
"\n",
"## NOTE: Use this to check your work\n",
"assert \\\n",
" \"n_ideal\" in df_nideal.columns, \\\n",
" \"df_nideal does not have a `n_ideal` column\"\n",
"\n",
"assert \\\n",
" df_nideal[\"n_ideal\"][0]/23==937,\\\n",
" \"The count is incorrect\"\n",
"\n",
"df_nideal"
]
},
{
"cell_type": "markdown",
"id": "c012012e",
"metadata": {},
"source": [
"## Grouping by variables\n",
"\n",
"The `gr.tf_summarize()` verb is helpful for getting \"basic facts\" about a dataset, but it is even more powerful when combined with `gr.tf_group_by()`.\n",
"\n",
"The `gr.tf_group_by()` verb takes a DataFrame and \"groups\" it by one (or more) columns. This causes other verbs (such as `gr.tf_summarize()`) to treat the data as though it were *multiple* DataFrames; these sub-DataFrames exactly correspond to the groups determined by `gr.tf_group_by()`."
]
},
{
"cell_type": "markdown",
"id": "30dac286",
"metadata": {},
"source": [
"\n",
"\n",
"Image: Zimmerman et al."
]
},
{
"cell_type": "markdown",
"id": "d3c35d31",
"metadata": {},
"source": [
"When the input DataFrame is grouped, the `gr.tf_summarize()` verb will calculate the summary functions in a group-wise manner."
]
},
{
"cell_type": "markdown",
"id": "2fb9887f",
"metadata": {},
"source": [
"\n",
"\n",
"Image: Zimmerman et al."
]
},
{
"cell_type": "markdown",
"id": "8b7d83fb",
"metadata": {},
"source": [
"Returning to the diamonds example: Let's see what adding a `gr.tf_group_by(DF.cut)` before the price and carat mean calculation does to our results."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9b4083ce",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" cut | \n",
" carat_mean | \n",
" price_mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Fair | \n",
" 1.046137 | \n",
" 4358.757764 | \n",
"
\n",
" \n",
" 1 | \n",
" Good | \n",
" 0.849185 | \n",
" 3928.864452 | \n",
"
\n",
" \n",
" 2 | \n",
" Ideal | \n",
" 0.702837 | \n",
" 3457.541970 | \n",
"
\n",
" \n",
" 3 | \n",
" Premium | \n",
" 0.891955 | \n",
" 4584.257704 | \n",
"
\n",
" \n",
" 4 | \n",
" Very Good | \n",
" 0.806381 | \n",
" 3981.759891 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" cut carat_mean price_mean\n",
"0 Fair 1.046137 4358.757764\n",
"1 Good 0.849185 3928.864452\n",
"2 Ideal 0.702837 3457.541970\n",
"3 Premium 0.891955 4584.257704\n",
"4 Very Good 0.806381 3981.759891"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_diamonds\n",
" >> gr.tf_group_by(DF.cut)\n",
" >> gr.tf_summarize(\n",
" carat_mean=gr.mean(DF.carat),\n",
" price_mean=gr.mean(DF.price),\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "da8ea0ce",
"metadata": {},
"source": [
"Now we can compare the mean `carat` and mean `price` across different `cut` levels. Personally, I was surprised to see that the `Fair` cut diamonds have a higher mean `price` than the `Ideal` diamonds!"
]
},
{
"cell_type": "markdown",
"id": "f0d79338",
"metadata": {},
"source": [
"### __q6__ Try out a two-variable grouping\n",
"\n",
"The code below groups by two variables. Uncomment the `tf_group_by()` line below, and describe how the result changes.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "42853239",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" clarity | \n",
" color | \n",
" diamonds_mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" I1 | \n",
" D | \n",
" 3863.023810 | \n",
"
\n",
" \n",
" 1 | \n",
" IF | \n",
" D | \n",
" 8307.369863 | \n",
"
\n",
" \n",
" 2 | \n",
" SI1 | \n",
" D | \n",
" 2976.146423 | \n",
"
\n",
" \n",
" 3 | \n",
" SI2 | \n",
" D | \n",
" 3931.101460 | \n",
"
\n",
" \n",
" 4 | \n",
" VS1 | \n",
" D | \n",
" 3030.158865 | \n",
"
\n",
" \n",
" 5 | \n",
" VS2 | \n",
" D | \n",
" 2587.225692 | \n",
"
\n",
" \n",
" 6 | \n",
" VVS1 | \n",
" D | \n",
" 2947.912698 | \n",
"
\n",
" \n",
" 7 | \n",
" VVS2 | \n",
" D | \n",
" 3351.128391 | \n",
"
\n",
" \n",
" 8 | \n",
" I1 | \n",
" E | \n",
" 3488.421569 | \n",
"
\n",
" \n",
" 9 | \n",
" IF | \n",
" E | \n",
" 3668.506329 | \n",
"
\n",
" \n",
" 10 | \n",
" SI1 | \n",
" E | \n",
" 3161.838005 | \n",
"
\n",
" \n",
" 11 | \n",
" SI2 | \n",
" E | \n",
" 4173.826036 | \n",
"
\n",
" \n",
" 12 | \n",
" VS1 | \n",
" E | \n",
" 2856.294301 | \n",
"
\n",
" \n",
" 13 | \n",
" VS2 | \n",
" E | \n",
" 2750.941700 | \n",
"
\n",
" \n",
" 14 | \n",
" VVS1 | \n",
" E | \n",
" 2219.820122 | \n",
"
\n",
" \n",
" 15 | \n",
" VVS2 | \n",
" E | \n",
" 2499.674067 | \n",
"
\n",
" \n",
" 16 | \n",
" I1 | \n",
" F | \n",
" 3342.181818 | \n",
"
\n",
" \n",
" 17 | \n",
" IF | \n",
" F | \n",
" 2750.836364 | \n",
"
\n",
" \n",
" 18 | \n",
" SI1 | \n",
" F | \n",
" 3714.225716 | \n",
"
\n",
" \n",
" 19 | \n",
" SI2 | \n",
" F | \n",
" 4472.625233 | \n",
"
\n",
" \n",
" 20 | \n",
" VS1 | \n",
" F | \n",
" 3796.717742 | \n",
"
\n",
" \n",
" 21 | \n",
" VS2 | \n",
" F | \n",
" 3756.795093 | \n",
"
\n",
" \n",
" 22 | \n",
" VVS1 | \n",
" F | \n",
" 2804.276567 | \n",
"
\n",
" \n",
" 23 | \n",
" VVS2 | \n",
" F | \n",
" 3475.512821 | \n",
"
\n",
" \n",
" 24 | \n",
" I1 | \n",
" G | \n",
" 3545.693333 | \n",
"
\n",
" \n",
" 25 | \n",
" IF | \n",
" G | \n",
" 2558.033774 | \n",
"
\n",
" \n",
" 26 | \n",
" SI1 | \n",
" G | \n",
" 3774.787449 | \n",
"
\n",
" \n",
" 27 | \n",
" SI2 | \n",
" G | \n",
" 5021.684109 | \n",
"
\n",
" \n",
" 28 | \n",
" VS1 | \n",
" G | \n",
" 4131.362197 | \n",
"
\n",
" \n",
" 29 | \n",
" VS2 | \n",
" G | \n",
" 4416.256498 | \n",
"
\n",
" \n",
" 30 | \n",
" VVS1 | \n",
" G | \n",
" 2866.820821 | \n",
"
\n",
" \n",
" 31 | \n",
" VVS2 | \n",
" G | \n",
" 3845.283437 | \n",
"
\n",
" \n",
" 32 | \n",
" I1 | \n",
" H | \n",
" 4453.413580 | \n",
"
\n",
" \n",
" 33 | \n",
" IF | \n",
" H | \n",
" 2287.869565 | \n",
"
\n",
" \n",
" 34 | \n",
" SI1 | \n",
" H | \n",
" 5032.414945 | \n",
"
\n",
" \n",
" 35 | \n",
" SI2 | \n",
" H | \n",
" 6099.895074 | \n",
"
\n",
" \n",
" 36 | \n",
" VS1 | \n",
" H | \n",
" 3780.688623 | \n",
"
\n",
" \n",
" 37 | \n",
" VS2 | \n",
" H | \n",
" 4722.414486 | \n",
"
\n",
" \n",
" 38 | \n",
" VVS1 | \n",
" H | \n",
" 1845.658120 | \n",
"
\n",
" \n",
" 39 | \n",
" VVS2 | \n",
" H | \n",
" 2649.067434 | \n",
"
\n",
" \n",
" 40 | \n",
" I1 | \n",
" I | \n",
" 4302.184783 | \n",
"
\n",
" \n",
" 41 | \n",
" IF | \n",
" I | \n",
" 1994.937063 | \n",
"
\n",
" \n",
" 42 | \n",
" SI1 | \n",
" I | \n",
" 5355.019663 | \n",
"
\n",
" \n",
" 43 | \n",
" SI2 | \n",
" I | \n",
" 7002.649123 | \n",
"
\n",
" \n",
" 44 | \n",
" VS1 | \n",
" I | \n",
" 4633.183992 | \n",
"
\n",
" \n",
" 45 | \n",
" VS2 | \n",
" I | \n",
" 5690.505560 | \n",
"
\n",
" \n",
" 46 | \n",
" VVS1 | \n",
" I | \n",
" 2034.861972 | \n",
"
\n",
" \n",
" 47 | \n",
" VVS2 | \n",
" I | \n",
" 2968.232877 | \n",
"
\n",
" \n",
" 48 | \n",
" I1 | \n",
" J | \n",
" 5254.060000 | \n",
"
\n",
" \n",
" 49 | \n",
" IF | \n",
" J | \n",
" 3363.882353 | \n",
"
\n",
" \n",
" 50 | \n",
" SI1 | \n",
" J | \n",
" 5186.048000 | \n",
"
\n",
" \n",
" 51 | \n",
" SI2 | \n",
" J | \n",
" 6520.958246 | \n",
"
\n",
" \n",
" 52 | \n",
" VS1 | \n",
" J | \n",
" 4884.461255 | \n",
"
\n",
" \n",
" 53 | \n",
" VS2 | \n",
" J | \n",
" 5311.058824 | \n",
"
\n",
" \n",
" 54 | \n",
" VVS1 | \n",
" J | \n",
" 4034.175676 | \n",
"
\n",
" \n",
" 55 | \n",
" VVS2 | \n",
" J | \n",
" 5142.396947 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" clarity color diamonds_mean\n",
"0 I1 D 3863.023810\n",
"1 IF D 8307.369863\n",
"2 SI1 D 2976.146423\n",
"3 SI2 D 3931.101460\n",
"4 VS1 D 3030.158865\n",
"5 VS2 D 2587.225692\n",
"6 VVS1 D 2947.912698\n",
"7 VVS2 D 3351.128391\n",
"8 I1 E 3488.421569\n",
"9 IF E 3668.506329\n",
"10 SI1 E 3161.838005\n",
"11 SI2 E 4173.826036\n",
"12 VS1 E 2856.294301\n",
"13 VS2 E 2750.941700\n",
"14 VVS1 E 2219.820122\n",
"15 VVS2 E 2499.674067\n",
"16 I1 F 3342.181818\n",
"17 IF F 2750.836364\n",
"18 SI1 F 3714.225716\n",
"19 SI2 F 4472.625233\n",
"20 VS1 F 3796.717742\n",
"21 VS2 F 3756.795093\n",
"22 VVS1 F 2804.276567\n",
"23 VVS2 F 3475.512821\n",
"24 I1 G 3545.693333\n",
"25 IF G 2558.033774\n",
"26 SI1 G 3774.787449\n",
"27 SI2 G 5021.684109\n",
"28 VS1 G 4131.362197\n",
"29 VS2 G 4416.256498\n",
"30 VVS1 G 2866.820821\n",
"31 VVS2 G 3845.283437\n",
"32 I1 H 4453.413580\n",
"33 IF H 2287.869565\n",
"34 SI1 H 5032.414945\n",
"35 SI2 H 6099.895074\n",
"36 VS1 H 3780.688623\n",
"37 VS2 H 4722.414486\n",
"38 VVS1 H 1845.658120\n",
"39 VVS2 H 2649.067434\n",
"40 I1 I 4302.184783\n",
"41 IF I 1994.937063\n",
"42 SI1 I 5355.019663\n",
"43 SI2 I 7002.649123\n",
"44 VS1 I 4633.183992\n",
"45 VS2 I 5690.505560\n",
"46 VVS1 I 2034.861972\n",
"47 VVS2 I 2968.232877\n",
"48 I1 J 5254.060000\n",
"49 IF J 3363.882353\n",
"50 SI1 J 5186.048000\n",
"51 SI2 J 6520.958246\n",
"52 VS1 J 4884.461255\n",
"53 VS2 J 5311.058824\n",
"54 VVS1 J 4034.175676\n",
"55 VVS2 J 5142.396947"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#TASK: Uncomment the tf_group_by() below, and describe how the result changes\n",
"df_q2 = (\n",
" df_diamonds\n",
"\n",
" >> gr.tf_group_by(DF[\"color\"], DF[\"clarity\"])\n",
" >> gr.tf_summarize(diamonds_mean=gr.mean(DF[\"price\"]))\n",
")\n",
"\n",
"df_q2"
]
},
{
"cell_type": "markdown",
"id": "4c9f6b3a",
"metadata": {},
"source": [
"*Observations*\n",
"- How many rows does the commented version (no grouping) return?\n",
" - Just one row\n",
"- How many rows (roughly) does the uncommented version (with grouping) return?\n",
" - Many rows! One row for every unique combination of `clarity` and `color`.\n",
""
]
},
{
"cell_type": "markdown",
"id": "06ea856f",
"metadata": {},
"source": [
"### __q7__ Study the color\n",
"\n",
"Compute the mean `price` grouped by `color`. Answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "c2530c62",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" color | \n",
" price_mean | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" E | \n",
" 3076.752475 | \n",
"
\n",
" \n",
" 1 | \n",
" D | \n",
" 3169.954096 | \n",
"
\n",
" \n",
" 2 | \n",
" F | \n",
" 3724.886397 | \n",
"
\n",
" \n",
" 3 | \n",
" G | \n",
" 3999.135671 | \n",
"
\n",
" \n",
" 4 | \n",
" H | \n",
" 4486.669196 | \n",
"
\n",
" \n",
" 5 | \n",
" I | \n",
" 5091.874954 | \n",
"
\n",
" \n",
" 6 | \n",
" J | \n",
" 5323.818020 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" color price_mean\n",
"0 E 3076.752475\n",
"1 D 3169.954096\n",
"2 F 3724.886397\n",
"3 G 3999.135671\n",
"4 H 4486.669196\n",
"5 I 5091.874954\n",
"6 J 5323.818020"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# TASK: Compute the mean price, grouped by color\n",
"(\n",
" df_diamonds\n",
"\n",
"# solution-begin \n",
" >> gr.tf_group_by(DF[\"color\"])\n",
" >> gr.tf_summarize(\n",
" price_mean=gr.mean(DF[\"price\"]),\n",
" )\n",
" \n",
" ## NOTE: We need to \"ungroup\" in order to re-arrange the data\n",
" >> gr.tf_ungroup()\n",
" >> gr.tf_arrange(\"price_mean\")\n",
")"
]
},
{
"cell_type": "markdown",
"id": "4099ff6e",
"metadata": {},
"source": [
"*Observations*\n",
"- Which tends to be the most valuable color? Which is the least valuable?\n",
" - `color == \"J\"` tends to be the most valuable, while `color == \"E\"` tends to be least valuable.\n",
"- The code above orders the colors according to their average price; are they in alphabetical order?\n",
" - Surprisingly, no! `\"D\"` would be the first letter, but `color == \"E\"` has the lowest mean price.\n",
""
]
},
{
"cell_type": "markdown",
"id": "3e071016",
"metadata": {},
"source": [
"# References\n",
"\n",
"- Naupaka Zimmerman, Greg Wilson, Raniere Silva, Scott Ritchie, François Michonneau, Jeffrey Oliver, … Yuka Takemon. (2019, July). swcarpentry/r-novice-gapminder: Software Carpentry: R for Reproducible Scientific Analysis, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3265164"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}