{
"cells": [
{
"cell_type": "markdown",
"id": "d840814c",
"metadata": {},
"source": [
"# Data: Joining Datasets\n",
"\n",
"*Purpose*: Often our data are scattered across multiple sets. In this case, we need to be able to *join* data.\n"
]
},
{
"cell_type": "markdown",
"id": "f37431c0",
"metadata": {},
"source": [
"## Setup\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "969f3e41",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import grama as gr\n",
"DF = gr.Intention()\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"id": "b406824a",
"metadata": {},
"source": [
"# Danger! Naive \"binding\" of data\n"
]
},
{
"cell_type": "markdown",
"id": "a573aef6",
"metadata": {},
"source": [
"The simplest means we have to combine two datasets is to *bind* them together. The verb `gr.tf_bind_rows()` binds together two datasets vertically (adds rows to rows), while `gr.tf_bind_cols()` binds two datasets horizontally (adds new columns to the dataset). These are very simple ways to combine datasets; for example:\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "ee30a716",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" numbers | \n",
" letters | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" A | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" B | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" C | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" D | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" numbers letters\n",
"0 1 A\n",
"1 2 B\n",
"2 3 C\n",
"3 4 D"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" gr.df_make(numbers=[1,2,3,4])\n",
" >> gr.tf_bind_cols(\n",
" gr.df_make(letters=[\"A\", \"B\", \"C\", \"D\"])\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "445d5bfe",
"metadata": {},
"source": [
"Binding is appropriate when we have *strong knowledge* of how our data are structured; we blindly smash together rows or columns to make a new dataframe, so we'd better be sure those rows/columns are in the right order! Otherwise, we may get surprising (and wrong) results....\n"
]
},
{
"cell_type": "markdown",
"id": "a4920e06",
"metadata": {},
"source": [
"### __q1__ What went wrong with this bind?\n",
"\n",
"Run the following code and answer the questions under *observations* below.\n",
"\n",
"*Hint*: If you're not a Beatles fan, it may be helpful to consult the relevant [personnel](https://en.wikipedia.org/wiki/The_Beatles#Personnel) page on Wikipedia.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "323984f6",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" band | \n",
" name | \n",
" surname | \n",
" instrument | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Beatles | \n",
" John | \n",
" McCartney | \n",
" bass | \n",
"
\n",
" \n",
" 1 | \n",
" Beatles | \n",
" Paul | \n",
" Harrison | \n",
" guitar | \n",
"
\n",
" \n",
" 2 | \n",
" Beatles | \n",
" George | \n",
" Starr | \n",
" drums | \n",
"
\n",
" \n",
" 3 | \n",
" Beatles | \n",
" Ringo | \n",
" Lennon | \n",
" guitar | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" band name surname instrument\n",
"0 Beatles John McCartney bass\n",
"1 Beatles Paul Harrison guitar\n",
"2 Beatles George Starr drums\n",
"3 Beatles Ringo Lennon guitar"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit; run and inspect\n",
"# Setup\n",
"df_beatles = gr.df_make(\n",
" band=[\"Beatles\"] * 4,\n",
" name=[\"John\", \"Paul\", \"George\", \"Ringo\"],\n",
")\n",
"\n",
"df_beatles_instruments = gr.df_make(\n",
" surname=[\"McCartney\", \"Harrison\", \"Starr\", \"Lennon\"],\n",
" instrument=[\"bass\", \"guitar\", \"drums\", \"guitar\"]\n",
")\n",
"\n",
"# Attempt to combine the datasets... to disastrous results!\n",
"(\n",
" df_beatles\n",
" >> gr.tf_bind_cols(df_beatles_instruments)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7540d724",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- What went wrong in binding `df_beatles` and `df_beatles_instruments`?\n",
" - The rows of `df_beatles` were not in the same order as the rows of `df_beatles_instruments`! Now we have absurd combinations such as `\"Ringo Lennon\"` and `\"George Starr\"`.\n",
""
]
},
{
"cell_type": "markdown",
"id": "d1b567cb",
"metadata": {},
"source": [
"# A safer way: \"Joining\" datasets\n",
"\n",
"A safer way to combine two datasets is to *not assume they are ordered correctly*. Instead, we can use *common* information to *join* two datasets. In order to do a join, we must have a set of \"keys\" by which to combine data from the two datasets. For instance, if we had a DataFrame with both `name` and `surname`, we could join to `df_beatles1` by the `name` column.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9200e83c",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" band | \n",
" name | \n",
" surname | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Beatles | \n",
" John | \n",
" Lennon | \n",
"
\n",
" \n",
" 1 | \n",
" Beatles | \n",
" Paul | \n",
" McCartney | \n",
"
\n",
" \n",
" 2 | \n",
" Beatles | \n",
" George | \n",
" Harrison | \n",
"
\n",
" \n",
" 3 | \n",
" Beatles | \n",
" Ringo | \n",
" Starr | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" band name surname\n",
"0 Beatles John Lennon\n",
"1 Beatles Paul McCartney\n",
"2 Beatles George Harrison\n",
"3 Beatles Ringo Starr"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"df_beatles_surnames = gr.df_make(\n",
" name=[\"John\", \"Paul\", \"George\", \"Ringo\"],\n",
" surname=[\"Lennon\", \"McCartney\", \"Harrison\", \"Starr\"],\n",
")\n",
"\n",
"df_beatles_names = (\n",
" df_beatles\n",
" >> gr.tf_left_join(df_beatles_surnames, by=\"name\")\n",
")\n",
"df_beatles_names"
]
},
{
"cell_type": "markdown",
"id": "4938390c",
"metadata": {},
"source": [
"Note that this correctly associates names with surnames.\n"
]
},
{
"cell_type": "markdown",
"id": "32e80f4a",
"metadata": {},
"source": [
"### __q2__ Do a join\n",
"\n",
"Use `gr.tf_left_join()` to associate each `instrument` with the correct band member.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b37c4754",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" band | \n",
" name | \n",
" surname | \n",
" instrument | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Beatles | \n",
" John | \n",
" Lennon | \n",
" guitar | \n",
"
\n",
" \n",
" 1 | \n",
" Beatles | \n",
" Paul | \n",
" McCartney | \n",
" bass | \n",
"
\n",
" \n",
" 2 | \n",
" Beatles | \n",
" George | \n",
" Harrison | \n",
" guitar | \n",
"
\n",
" \n",
" 3 | \n",
" Beatles | \n",
" Ringo | \n",
" Starr | \n",
" drums | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" band name surname instrument\n",
"0 Beatles John Lennon guitar\n",
"1 Beatles Paul McCartney bass\n",
"2 Beatles George Harrison guitar\n",
"3 Beatles Ringo Starr drums"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: Join df_beatles2 correctly to add the `instrument` column\n",
"df_beatles_full = (\n",
" df_beatles_names\n",
" >> gr.tf_left_join(\n",
" df_beatles_instruments,\n",
" by=\"surname\",\n",
" )\n",
")\n",
"\n",
"## NOTE: Use this to check your work\n",
"assert \\\n",
" \"instrument\" in df_beatles_full.columns, \\\n",
" \"df_beatles_full does not have an `instrument` column\"\n",
"\n",
"assert \\\n",
" df_beatles_full[df_beatles_full.name == \"Ringo\"].instrument.values[0] == \"drums\", \\\n",
" \"Ringo Starr played drums!\"\n",
"\n",
"df_beatles_full\n"
]
},
{
"cell_type": "markdown",
"id": "8ff7346f",
"metadata": {},
"source": [
"There's a **very important lesson** here: In general, don't trust `gr.tf_bind_cols`. It's easy in the example above to tell there's a problem because the data are *small*; when working with larger datasets, the software will happily give you the wrong answer if you give it the wrong instructions. Whenever possible, use some form of join to combine datasets.\n"
]
},
{
"cell_type": "markdown",
"id": "7af86805",
"metadata": {},
"source": [
"# Types of joins\n",
"\n",
"There are several types of joins:\n",
"\n",
"| Name | Type | Description |\n",
"|------|------|-------------|\n",
"| `gr.tf_left_join()` | Mutating join | Preserves rows in the left DataFrame being joined |\n",
"| `gr.tf_right_join()` | Mutating join | Preserves rows in the right DataFrame being joined |\n",
"| `gr.tf_inner_join()` | Mutating join | Preserves rows common to both DataFrames being joined |\n",
"| `gr.tf_outer_join()` | Mutating join | Preserves all rows in both DataFrames being joined |\n",
"| `gr.tf_semi_join()` | Filtering join | Returns all rows in left DataFrame that have a match in the right DataFrame |\n",
"| `gr.tf_anti_join()` | Filtering join | Returns all rows in left DataFrame that have *no* match in the right DataFrame |\n",
"\n",
"We'll discuss (and use!) each of these below.\n"
]
},
{
"cell_type": "markdown",
"id": "b96767c2",
"metadata": {},
"source": [
"## Mutating joins\n",
"\n",
"A *mutating join* is a join that also performs a mutation---it adds columns to the DataFrame. Like we saw above, we can use a mutating join to add information to a datset. However, there are four different types of mutating\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1cdcec1d",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"## NOTE: No need to edit\n",
"df_beatles_names = gr.df_make(\n",
" name=[\"John\", \"Paul\", \"George\", \"Ringo\", \"George\"],\n",
" surname=[\"Lennon\", \"McCartney\", \"Harrison\", \"Starr\", \"Martin\"],\n",
")\n",
"\n",
"df_beatles_roles = gr.df_make(\n",
" surname=[\"Lennon\", \"McCartney\", \"Harrison\", \"Starr\", \"Epstein\"],\n",
" role=[\"Bandmate\", \"Bandmate\", \"Bandmate\", \"Bandmate\", \"Manager\"],\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "5a5f5c6e",
"metadata": {},
"source": [
"You'll investigate how the various join types function in the next task:\n"
]
},
{
"cell_type": "markdown",
"id": "3d91cba8",
"metadata": {},
"source": [
"### __q3__ Test the joins\n",
"\n",
"Uncomment one line at a time and run the code below. Answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "e4d46c16",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" surname | \n",
" role | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Lennon | \n",
" Bandmate | \n",
"
\n",
" \n",
" 1 | \n",
" Paul | \n",
" McCartney | \n",
" Bandmate | \n",
"
\n",
" \n",
" 2 | \n",
" George | \n",
" Harrison | \n",
" Bandmate | \n",
"
\n",
" \n",
" 3 | \n",
" Ringo | \n",
" Starr | \n",
" Bandmate | \n",
"
\n",
" \n",
" 4 | \n",
" George | \n",
" Martin | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name surname role\n",
"0 John Lennon Bandmate\n",
"1 Paul McCartney Bandmate\n",
"2 George Harrison Bandmate\n",
"3 Ringo Starr Bandmate\n",
"4 George Martin NaN"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: Uncomment one line at a time and run; document your findings\n",
"(\n",
" df_beatles_names\n",
" >> gr.tf_left_join(df_beatles_roles, by=\"surname\")\n",
" # >> gr.tf_right_join(df_beatles_roles, by=\"surname\")\n",
" # >> gr.tf_inner_join(df_beatles_roles, by=\"surname\")\n",
" # >> gr.tf_outer_join(df_beatles_roles, by=\"surname\")\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9bfad25b",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- Which rows does `tf_left_join()` preserve?\n",
" - This preserves the rows in the left (first) DataFrame.\n",
"- Which rows does `tf_right_join()` preserve?\n",
" - This preserves the rows in the right (second) DataFrame.\n",
"- Which rows does `tf_inner_join()` preserve?\n",
" - This preserves the rows common to both DataFrames.\n",
"- Which rows does `tf_outer_join()` preserve?\n",
" - This preserves all rows in both DataFrames.\n",
""
]
},
{
"cell_type": "markdown",
"id": "f5be2d47",
"metadata": {
"tags": []
},
"source": [
"### Visual Aid: Types of Joins\n",
"\n",
"The following visual may help you make sense of the four mutating joins; it depicts the four verbs as Venn diagrams for the left (`L`) and right (`R`) DataFrames in the join.\n",
"\n",
"![Venn diagrams for four types of joins: left, right, inner, outer](images/joins.jpg)\n",
"\n",
"You may also find [this image](https://github.com/amartinson193/SQL_Checkered_Flag_Join_Diagrams/blob/main/checkered_flag_diagram_pg1.png) helpful for visualizing the join types.\n"
]
},
{
"cell_type": "markdown",
"id": "605dd2fa",
"metadata": {},
"source": [
"## Danger! Non-unique keys\n",
"\n",
"Note that when we do any sort of join, we need *unique* keys. We'll run into trouble if the provided keys do not uniquely identify each row. For example, Harrison and Martin share a given name:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "2a28d05a",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" surname | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Lennon | \n",
"
\n",
" \n",
" 1 | \n",
" Paul | \n",
" McCartney | \n",
"
\n",
" \n",
" 2 | \n",
" George | \n",
" Harrison | \n",
"
\n",
" \n",
" 3 | \n",
" Ringo | \n",
" Starr | \n",
"
\n",
" \n",
" 4 | \n",
" George | \n",
" Martin | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name surname\n",
"0 John Lennon\n",
"1 Paul McCartney\n",
"2 George Harrison\n",
"3 Ringo Starr\n",
"4 George Martin"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"df_beatles_names\n"
]
},
{
"cell_type": "markdown",
"id": "cf41c490",
"metadata": {},
"source": [
"Look at what happens when we join on first `name` only:\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "730a98d9",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" surname | \n",
" instrument | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Lennon | \n",
" guitar | \n",
"
\n",
" \n",
" 1 | \n",
" Paul | \n",
" McCartney | \n",
" bass | \n",
"
\n",
" \n",
" 2 | \n",
" George | \n",
" Harrison | \n",
" guitar | \n",
"
\n",
" \n",
" 3 | \n",
" George | \n",
" Martin | \n",
" guitar | \n",
"
\n",
" \n",
" 4 | \n",
" Ringo | \n",
" Starr | \n",
" drums | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name surname instrument\n",
"0 John Lennon guitar\n",
"1 Paul McCartney bass\n",
"2 George Harrison guitar\n",
"3 George Martin guitar\n",
"4 Ringo Starr drums"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit; this gives incorrect results due to non-unique keys\n",
"(\n",
" df_beatles_names\n",
" >> gr.tf_full_join(\n",
" gr.df_make(\n",
" name=[\"Paul\", \"George\", \"Ringo\", \"John\"],\n",
" instrument=[\"bass\", \"guitar\", \"drums\", \"guitar\"]\n",
" ),\n",
" by=\"name\"\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b6861404",
"metadata": {},
"source": [
"George Martin didn't play the guitar in the Beatles! He was their producer.\n",
"\n",
"If a single key is not unique, we can use *multiple keys* for the join:\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d26b96cb",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" surname | \n",
" instrument | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" John | \n",
" Lennon | \n",
" guitar | \n",
"
\n",
" \n",
" 1 | \n",
" Paul | \n",
" McCartney | \n",
" bass | \n",
"
\n",
" \n",
" 2 | \n",
" George | \n",
" Harrison | \n",
" guitar | \n",
"
\n",
" \n",
" 3 | \n",
" Ringo | \n",
" Starr | \n",
" drums | \n",
"
\n",
" \n",
" 4 | \n",
" George | \n",
" Martin | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name surname instrument\n",
"0 John Lennon guitar\n",
"1 Paul McCartney bass\n",
"2 George Harrison guitar\n",
"3 Ringo Starr drums\n",
"4 George Martin NaN"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit; using multiple keys corrects the issue\n",
"(\n",
" df_beatles_names\n",
" >> gr.tf_full_join(\n",
" gr.df_make(\n",
" name=[\"Paul\", \"George\", \"Ringo\", \"John\"],\n",
" surname=[\"McCartney\", \"Harrison\", \"Starr\", \"Lennon\"],\n",
" instrument=[\"bass\", \"guitar\", \"drums\", \"guitar\"]\n",
" ),\n",
" by=[\"name\", \"surname\"],\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "aa1b71ed",
"metadata": {},
"source": [
"## Filtering joins\n",
"\n",
"Mutating joins add new columns, but *filtering joins* simply filter the DataFrame. A filter join is particularly helpful when a filter is difficult to express in `gr.tf_filter()`, but easy to express as a set of keys (perhaps with multiple key columns).\n",
"\n",
"As a first example, we can filter on all of the `\"George\"`'s with a `gr.tf_semi_join()`.\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "c2bd6192",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" surname | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" George | \n",
" Harrison | \n",
"
\n",
" \n",
" 1 | \n",
" George | \n",
" Martin | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name surname\n",
"0 George Harrison\n",
"1 George Martin"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" df_beatles_names\n",
" >> gr.tf_semi_join(\n",
" gr.df_make(name=[\"George\"]),\n",
" by=\"name\",\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "b7e9a452",
"metadata": {},
"source": [
"As we saw before, a single key will often not be enough to uniquely identify a row. For instance, the following will filter down to a numer of non-Beatle players:\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0b76c5f1",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" surname | \n",
" instrument | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Clapton | \n",
" guitar | \n",
"
\n",
" \n",
" 1 | \n",
" Harrison | \n",
" guitar | \n",
"
\n",
" \n",
" 2 | \n",
" Wooten | \n",
" bass | \n",
"
\n",
" \n",
" 3 | \n",
" McCartney | \n",
" bass | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" surname instrument\n",
"0 Clapton guitar\n",
"1 Harrison guitar\n",
"2 Wooten bass\n",
"3 McCartney bass"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## NOTE: No need to edit\n",
"(\n",
" gr.df_make(\n",
" surname=[\"Clapton\", \"Harrison\", \"Shankar\", \"Wooten\", \"McCartney\"],\n",
" instrument=[\"guitar\", \"guitar\", \"sitar\", \"bass\", \"bass\"],\n",
" )\n",
" >> gr.tf_semi_join(\n",
" df_beatles_instruments,\n",
" by=\"instrument\"\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "77ab7e11",
"metadata": {},
"source": [
"### __q4__ Semi-join with multiple keys\n",
"\n",
"Construct the `by` argument for `gr.tf_semi_join()` below to filter to *only* persons who were in the Beatles.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "62f26ba8",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" surname | \n",
" instrument | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Harrison | \n",
" guitar | \n",
"
\n",
" \n",
" 1 | \n",
" McCartney | \n",
" bass | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" surname instrument\n",
"0 Harrison guitar\n",
"1 McCartney bass"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: Construct the `by` argument below to filter to *only* persons who were in the Beatles\n",
"(\n",
" gr.df_make(\n",
" surname=[\"Clapton\", \"Harrison\", \"Shankar\", \"Wooten\", \"McCartney\"],\n",
" instrument=[\"guitar\", \"guitar\", \"sitar\", \"bass\", \"bass\"],\n",
" )\n",
" >> gr.tf_semi_join(\n",
" df_beatles_instruments,\n",
"\n",
" by=[\"instrument\", \"surname\"]\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0eee3bbc",
"metadata": {},
"source": [
"# Going Further: Airports dataset\n",
"\n",
"We'll use the `nycflights13` package to demonstrate joins in a more realistic situation. This is a dataset of flights involving the New York City area during 2013.\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "eeae725f",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" year | \n",
" month | \n",
" day | \n",
" dep_time | \n",
" sched_dep_time | \n",
" dep_delay | \n",
" arr_time | \n",
" sched_arr_time | \n",
" arr_delay | \n",
" carrier | \n",
" flight | \n",
" tailnum | \n",
" origin | \n",
" dest | \n",
" air_time | \n",
" distance | \n",
" hour | \n",
" minute | \n",
" time_hour | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 517.0 | \n",
" 515 | \n",
" 2.0 | \n",
" 830.0 | \n",
" 819 | \n",
" 11.0 | \n",
" UA | \n",
" 1545 | \n",
" N14228 | \n",
" EWR | \n",
" IAH | \n",
" 227.0 | \n",
" 1400 | \n",
" 5 | \n",
" 15 | \n",
" 2013-01-01T10:00:00Z | \n",
"
\n",
" \n",
" 1 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 533.0 | \n",
" 529 | \n",
" 4.0 | \n",
" 850.0 | \n",
" 830 | \n",
" 20.0 | \n",
" UA | \n",
" 1714 | \n",
" N24211 | \n",
" LGA | \n",
" IAH | \n",
" 227.0 | \n",
" 1416 | \n",
" 5 | \n",
" 29 | \n",
" 2013-01-01T10:00:00Z | \n",
"
\n",
" \n",
" 2 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 542.0 | \n",
" 540 | \n",
" 2.0 | \n",
" 923.0 | \n",
" 850 | \n",
" 33.0 | \n",
" AA | \n",
" 1141 | \n",
" N619AA | \n",
" JFK | \n",
" MIA | \n",
" 160.0 | \n",
" 1089 | \n",
" 5 | \n",
" 40 | \n",
" 2013-01-01T10:00:00Z | \n",
"
\n",
" \n",
" 3 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 544.0 | \n",
" 545 | \n",
" -1.0 | \n",
" 1004.0 | \n",
" 1022 | \n",
" -18.0 | \n",
" B6 | \n",
" 725 | \n",
" N804JB | \n",
" JFK | \n",
" BQN | \n",
" 183.0 | \n",
" 1576 | \n",
" 5 | \n",
" 45 | \n",
" 2013-01-01T10:00:00Z | \n",
"
\n",
" \n",
" 4 | \n",
" 2013 | \n",
" 1 | \n",
" 1 | \n",
" 554.0 | \n",
" 600 | \n",
" -6.0 | \n",
" 812.0 | \n",
" 837 | \n",
" -25.0 | \n",
" DL | \n",
" 461 | \n",
" N668DN | \n",
" LGA | \n",
" ATL | \n",
" 116.0 | \n",
" 762 | \n",
" 6 | \n",
" 0 | \n",
" 2013-01-01T11:00:00Z | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 336771 | \n",
" 2013 | \n",
" 9 | \n",
" 30 | \n",
" NaN | \n",
" 1455 | \n",
" NaN | \n",
" NaN | \n",
" 1634 | \n",
" NaN | \n",
" 9E | \n",
" 3393 | \n",
" NaN | \n",
" JFK | \n",
" DCA | \n",
" NaN | \n",
" 213 | \n",
" 14 | \n",
" 55 | \n",
" 2013-09-30T18:00:00Z | \n",
"
\n",
" \n",
" 336772 | \n",
" 2013 | \n",
" 9 | \n",
" 30 | \n",
" NaN | \n",
" 2200 | \n",
" NaN | \n",
" NaN | \n",
" 2312 | \n",
" NaN | \n",
" 9E | \n",
" 3525 | \n",
" NaN | \n",
" LGA | \n",
" SYR | \n",
" NaN | \n",
" 198 | \n",
" 22 | \n",
" 0 | \n",
" 2013-10-01T02:00:00Z | \n",
"
\n",
" \n",
" 336773 | \n",
" 2013 | \n",
" 9 | \n",
" 30 | \n",
" NaN | \n",
" 1210 | \n",
" NaN | \n",
" NaN | \n",
" 1330 | \n",
" NaN | \n",
" MQ | \n",
" 3461 | \n",
" N535MQ | \n",
" LGA | \n",
" BNA | \n",
" NaN | \n",
" 764 | \n",
" 12 | \n",
" 10 | \n",
" 2013-09-30T16:00:00Z | \n",
"
\n",
" \n",
" 336774 | \n",
" 2013 | \n",
" 9 | \n",
" 30 | \n",
" NaN | \n",
" 1159 | \n",
" NaN | \n",
" NaN | \n",
" 1344 | \n",
" NaN | \n",
" MQ | \n",
" 3572 | \n",
" N511MQ | \n",
" LGA | \n",
" CLE | \n",
" NaN | \n",
" 419 | \n",
" 11 | \n",
" 59 | \n",
" 2013-09-30T15:00:00Z | \n",
"
\n",
" \n",
" 336775 | \n",
" 2013 | \n",
" 9 | \n",
" 30 | \n",
" NaN | \n",
" 840 | \n",
" NaN | \n",
" NaN | \n",
" 1020 | \n",
" NaN | \n",
" MQ | \n",
" 3531 | \n",
" N839MQ | \n",
" LGA | \n",
" RDU | \n",
" NaN | \n",
" 431 | \n",
" 8 | \n",
" 40 | \n",
" 2013-09-30T12:00:00Z | \n",
"
\n",
" \n",
"
\n",
"
336776 rows × 19 columns
\n",
"
"
],
"text/plain": [
" year month day dep_time sched_dep_time dep_delay arr_time \\\n",
"0 2013 1 1 517.0 515 2.0 830.0 \n",
"1 2013 1 1 533.0 529 4.0 850.0 \n",
"2 2013 1 1 542.0 540 2.0 923.0 \n",
"3 2013 1 1 544.0 545 -1.0 1004.0 \n",
"4 2013 1 1 554.0 600 -6.0 812.0 \n",
"... ... ... ... ... ... ... ... \n",
"336771 2013 9 30 NaN 1455 NaN NaN \n",
"336772 2013 9 30 NaN 2200 NaN NaN \n",
"336773 2013 9 30 NaN 1210 NaN NaN \n",
"336774 2013 9 30 NaN 1159 NaN NaN \n",
"336775 2013 9 30 NaN 840 NaN NaN \n",
"\n",
" sched_arr_time arr_delay carrier flight tailnum origin dest \\\n",
"0 819 11.0 UA 1545 N14228 EWR IAH \n",
"1 830 20.0 UA 1714 N24211 LGA IAH \n",
"2 850 33.0 AA 1141 N619AA JFK MIA \n",
"3 1022 -18.0 B6 725 N804JB JFK BQN \n",
"4 837 -25.0 DL 461 N668DN LGA ATL \n",
"... ... ... ... ... ... ... ... \n",
"336771 1634 NaN 9E 3393 NaN JFK DCA \n",
"336772 2312 NaN 9E 3525 NaN LGA SYR \n",
"336773 1330 NaN MQ 3461 N535MQ LGA BNA \n",
"336774 1344 NaN MQ 3572 N511MQ LGA CLE \n",
"336775 1020 NaN MQ 3531 N839MQ LGA RDU \n",
"\n",
" air_time distance hour minute time_hour \n",
"0 227.0 1400 5 15 2013-01-01T10:00:00Z \n",
"1 227.0 1416 5 29 2013-01-01T10:00:00Z \n",
"2 160.0 1089 5 40 2013-01-01T10:00:00Z \n",
"3 183.0 1576 5 45 2013-01-01T10:00:00Z \n",
"4 116.0 762 6 0 2013-01-01T11:00:00Z \n",
"... ... ... ... ... ... \n",
"336771 NaN 213 14 55 2013-09-30T18:00:00Z \n",
"336772 NaN 198 22 0 2013-10-01T02:00:00Z \n",
"336773 NaN 764 12 10 2013-09-30T16:00:00Z \n",
"336774 NaN 419 11 59 2013-09-30T15:00:00Z \n",
"336775 NaN 431 8 40 2013-09-30T12:00:00Z \n",
"\n",
"[336776 rows x 19 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nycflights13 import flights as df_flights\n",
"df_flights"
]
},
{
"cell_type": "markdown",
"id": "a7d35c08",
"metadata": {},
"source": [
"### __q5__ Make a \"grid\" of filter criteria\n",
"\n",
"Use `gr.df_grid()` to make a DataFrame with the rows.\n",
"\n",
"| `month` | `dest` |\n",
"|---------|--------|\n",
"| 8 | \"SJC\" |\n",
"| 8 | \"SFO\" |\n",
"| 8 | \"OAK\" |\n",
"| 9 | \"SJC\" |\n",
"| 9 | \"SFO\" |\n",
"| 9 | \"OAK\" |\n",
"\n",
"*Note*: We'll use this grid soon in a filtering join.\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "3ef8b98f",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" month | \n",
" dest | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" SJC | \n",
"
\n",
" \n",
" 1 | \n",
" 9 | \n",
" SJC | \n",
"
\n",
" \n",
" 2 | \n",
" 8 | \n",
" SFO | \n",
"
\n",
" \n",
" 3 | \n",
" 9 | \n",
" SFO | \n",
"
\n",
" \n",
" 4 | \n",
" 8 | \n",
" OAK | \n",
"
\n",
" \n",
" 5 | \n",
" 9 | \n",
" OAK | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" month dest\n",
"0 8 SJC\n",
"1 9 SJC\n",
"2 8 SFO\n",
"3 9 SFO\n",
"4 8 OAK\n",
"5 9 OAK"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: Use gr.df_grid() to make the DataFrame described above\n",
"\n",
"df_criteria = gr.df_grid(\n",
" month=[8, 9],\n",
" dest=[\"SJC\", \"SFO\", \"OAK\"]\n",
")\n",
"## NOTE: Use this to check your work\n",
"assert \\\n",
" \"month\" in df_criteria.columns, \\\n",
" \"df_criteria does not have a 'month' column\"\n",
"\n",
"assert \\\n",
" \"dest\" in df_criteria.columns, \\\n",
" \"df_criteria does not have a 'dest' column\"\n",
"\n",
"assert \\\n",
" df_criteria.shape[0] == 6, \\\n",
" \"df_criteria has the wrong number of columns\"\n",
"\n",
"df_criteria"
]
},
{
"cell_type": "markdown",
"id": "c110ae4e",
"metadata": {},
"source": [
"There are *many* carriers in this dataset; let's figure out which carriers are associated with flights in August (`month==8`) and September (`month==9`) to the San Francisco Bay Area (SJC, SFO, OAK).\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "6a551f65",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carrier | \n",
" n | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9E | \n",
" 18460 | \n",
"
\n",
" \n",
" 1 | \n",
" AA | \n",
" 32729 | \n",
"
\n",
" \n",
" 2 | \n",
" AS | \n",
" 714 | \n",
"
\n",
" \n",
" 3 | \n",
" B6 | \n",
" 54635 | \n",
"
\n",
" \n",
" 4 | \n",
" DL | \n",
" 48110 | \n",
"
\n",
" \n",
" 5 | \n",
" EV | \n",
" 54173 | \n",
"
\n",
" \n",
" 6 | \n",
" F9 | \n",
" 685 | \n",
"
\n",
" \n",
" 7 | \n",
" FL | \n",
" 3260 | \n",
"
\n",
" \n",
" 8 | \n",
" HA | \n",
" 342 | \n",
"
\n",
" \n",
" 9 | \n",
" MQ | \n",
" 26397 | \n",
"
\n",
" \n",
" 10 | \n",
" OO | \n",
" 32 | \n",
"
\n",
" \n",
" 11 | \n",
" UA | \n",
" 58665 | \n",
"
\n",
" \n",
" 12 | \n",
" US | \n",
" 20536 | \n",
"
\n",
" \n",
" 13 | \n",
" VX | \n",
" 5162 | \n",
"
\n",
" \n",
" 14 | \n",
" WN | \n",
" 12275 | \n",
"
\n",
" \n",
" 15 | \n",
" YV | \n",
" 601 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carrier n\n",
"0 9E 18460\n",
"1 AA 32729\n",
"2 AS 714\n",
"3 B6 54635\n",
"4 DL 48110\n",
"5 EV 54173\n",
"6 F9 685\n",
"7 FL 3260\n",
"8 HA 342\n",
"9 MQ 26397\n",
"10 OO 32\n",
"11 UA 58665\n",
"12 US 20536\n",
"13 VX 5162\n",
"14 WN 12275\n",
"15 YV 601"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
" df_flights\n",
" >> gr.tf_count(DF.carrier)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3c43a003",
"metadata": {},
"source": [
"### __q6__ Count the Bay area carriers\n",
"\n",
"Perform a semi-join to filter `df_flights` using all the columns of `df_criteria`. Answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "5a1bc066",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carrier | \n",
" n | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AA | \n",
" 239 | \n",
"
\n",
" \n",
" 1 | \n",
" B6 | \n",
" 306 | \n",
"
\n",
" \n",
" 2 | \n",
" DL | \n",
" 324 | \n",
"
\n",
" \n",
" 3 | \n",
" UA | \n",
" 1305 | \n",
"
\n",
" \n",
" 4 | \n",
" VX | \n",
" 410 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carrier n\n",
"0 AA 239\n",
"1 B6 306\n",
"2 DL 324\n",
"3 UA 1305\n",
"4 VX 410"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: Semi-join with df_criteria\n",
"(\n",
" df_flights\n",
"\n",
" >> gr.tf_semi_join(df_criteria, by=[\"month\", \"dest\"])\n",
" >> gr.tf_count(DF.carrier)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0dcecc9c",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- How many carriers provided flights to the SF Bay Area in the months considered?\n",
" - There are `5` such carriers\n",
"- How does this number of carriers compare with the total number of carriers?\n",
" - This is considerably fewer carriers!\n",
""
]
},
{
"cell_type": "markdown",
"id": "38368554",
"metadata": {},
"source": [
"Unless you work in the airline industry, it's probably difficult to make sense of these carrier codes. Thankfully the `nycflights13` package comes with a dataset that helps disambiguate these codes:\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "15a62ce4",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carrier | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 9E | \n",
" Endeavor Air Inc. | \n",
"
\n",
" \n",
" 1 | \n",
" AA | \n",
" American Airlines Inc. | \n",
"
\n",
" \n",
" 2 | \n",
" AS | \n",
" Alaska Airlines Inc. | \n",
"
\n",
" \n",
" 3 | \n",
" B6 | \n",
" JetBlue Airways | \n",
"
\n",
" \n",
" 4 | \n",
" DL | \n",
" Delta Air Lines Inc. | \n",
"
\n",
" \n",
" 5 | \n",
" EV | \n",
" ExpressJet Airlines Inc. | \n",
"
\n",
" \n",
" 6 | \n",
" F9 | \n",
" Frontier Airlines Inc. | \n",
"
\n",
" \n",
" 7 | \n",
" FL | \n",
" AirTran Airways Corporation | \n",
"
\n",
" \n",
" 8 | \n",
" HA | \n",
" Hawaiian Airlines Inc. | \n",
"
\n",
" \n",
" 9 | \n",
" MQ | \n",
" Envoy Air | \n",
"
\n",
" \n",
" 10 | \n",
" OO | \n",
" SkyWest Airlines Inc. | \n",
"
\n",
" \n",
" 11 | \n",
" UA | \n",
" United Air Lines Inc. | \n",
"
\n",
" \n",
" 12 | \n",
" US | \n",
" US Airways Inc. | \n",
"
\n",
" \n",
" 13 | \n",
" VX | \n",
" Virgin America | \n",
"
\n",
" \n",
" 14 | \n",
" WN | \n",
" Southwest Airlines Co. | \n",
"
\n",
" \n",
" 15 | \n",
" YV | \n",
" Mesa Airlines Inc. | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carrier name\n",
"0 9E Endeavor Air Inc.\n",
"1 AA American Airlines Inc.\n",
"2 AS Alaska Airlines Inc.\n",
"3 B6 JetBlue Airways\n",
"4 DL Delta Air Lines Inc.\n",
"5 EV ExpressJet Airlines Inc.\n",
"6 F9 Frontier Airlines Inc.\n",
"7 FL AirTran Airways Corporation\n",
"8 HA Hawaiian Airlines Inc.\n",
"9 MQ Envoy Air\n",
"10 OO SkyWest Airlines Inc.\n",
"11 UA United Air Lines Inc.\n",
"12 US US Airways Inc.\n",
"13 VX Virgin America\n",
"14 WN Southwest Airlines Co.\n",
"15 YV Mesa Airlines Inc."
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nycflights13 import airlines as df_airlines\n",
"df_airlines"
]
},
{
"cell_type": "markdown",
"id": "8df56ec7",
"metadata": {},
"source": [
"### __q7__ Make the data more interpretable\n",
"\n",
"Use the appropriate kind of join to add the `name` column to the results below. Answer the questions under *observations* below.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "cd63d04a",
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" carrier | \n",
" n | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AA | \n",
" 239 | \n",
" American Airlines Inc. | \n",
"
\n",
" \n",
" 1 | \n",
" B6 | \n",
" 306 | \n",
" JetBlue Airways | \n",
"
\n",
" \n",
" 2 | \n",
" DL | \n",
" 324 | \n",
" Delta Air Lines Inc. | \n",
"
\n",
" \n",
" 3 | \n",
" UA | \n",
" 1305 | \n",
" United Air Lines Inc. | \n",
"
\n",
" \n",
" 4 | \n",
" VX | \n",
" 410 | \n",
" Virgin America | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" carrier n name\n",
"0 AA 239 American Airlines Inc.\n",
"1 B6 306 JetBlue Airways\n",
"2 DL 324 Delta Air Lines Inc.\n",
"3 UA 1305 United Air Lines Inc.\n",
"4 VX 410 Virgin America"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## TASK: Add the `name` column from `df_airlines`\n",
"(\n",
" df_flights\n",
" >> gr.tf_semi_join(df_criteria, by=[\"month\", \"dest\"])\n",
" >> gr.tf_count(DF.carrier)\n",
" >> gr.tf_left_join(\n",
" df_airlines,\n",
" by=\"carrier\",\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "65aa70ac",
"metadata": {},
"source": [
"*Observations*\n",
"\n",
"- Which carrier had the most flights in subset considered?\n",
" - United Air Lines, by a sizeable fraction.\n",
""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}