Data: Isolating Data#

Purpose: One of the keys to a successful analysis is the ability to focus on particular topics. When analyzing a dataset, our ability to focus is tied to our facility at isolating data. In this exercise, you will practice isolating columns with tf_select(), picking specific rows with tf_filter(), and sorting your data with tf_arrange() to see what rises to the top.

Aside: The data-management verbs in grama are heavily inspired by the dplyr package in the R programming langauge.

Setup#

import grama as gr
DF = gr.Intention()
%matplotlib inline

We’ll use the nycflights13 package in this exercise: This is a dataset of flights involving the New York City area during 2013.

from nycflights13 import flights as df_flights
df_flights
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15 2013-01-01T10:00:00Z
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29 2013-01-01T10:00:00Z
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40 2013-01-01T10:00:00Z
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336771 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E 3393 NaN JFK DCA NaN 213 14 55 2013-09-30T18:00:00Z
336772 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E 3525 NaN LGA SYR NaN 198 22 0 2013-10-01T02:00:00Z
336773 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ LGA BNA NaN 764 12 10 2013-09-30T16:00:00Z
336774 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ 3572 N511MQ LGA CLE NaN 419 11 59 2013-09-30T15:00:00Z
336775 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ 3531 N839MQ LGA RDU NaN 431 8 40 2013-09-30T12:00:00Z

336776 rows × 19 columns

The DataFrame Object#

The variable df_flights above is a DataFrame; a way of storing data in Python.

Aside: The DataFrame class is provided by the Pandas package, but we will use Grama to work with DataFrames.

Head and Tail#

Looking at an entire DataFrame is usually overwhelming; it is useful to be able to focus on a small subset of the data. One of our most basic tools is to get the first or last few rows of a DataFrame, which we can do with DataFrame.head(n) and DataFrame.tail(n). These functions are called with the following syntax:

df_flights.head(10)

q1 Get the tail of a DataFrame#

Get the last 10 rows of df_flights.

# TASK: Get the last 10 rows of df_flights

df_flights.tail(10)
# solution-end
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
336766 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 2002 N281JB JFK BUF 52.0 301 22 50 2013-10-01T02:00:00Z
336767 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 486 N346JB JFK ROC 47.0 264 22 46 2013-10-01T02:00:00Z
336768 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 718 N565JB JFK BOS 33.0 187 22 55 2013-10-01T02:00:00Z
336769 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 745 N516JB JFK PSE 196.0 1617 23 59 2013-10-01T03:00:00Z
336770 2013 9 30 NaN 1842 NaN NaN 2019 NaN EV 5274 N740EV LGA BNA NaN 764 18 42 2013-09-30T22:00:00Z
336771 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E 3393 NaN JFK DCA NaN 213 14 55 2013-09-30T18:00:00Z
336772 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E 3525 NaN LGA SYR NaN 198 22 0 2013-10-01T02:00:00Z
336773 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ LGA BNA NaN 764 12 10 2013-09-30T16:00:00Z
336774 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ 3572 N511MQ LGA CLE NaN 419 11 59 2013-09-30T15:00:00Z
336775 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ 3531 N839MQ LGA RDU NaN 431 8 40 2013-09-30T12:00:00Z

Piping#

Often, when carrying out studies with data, we want to perform multiple operations on the same dataset. We could do this by assigning intermediate variables, such as a temporary DataFrame:

# NOTE: No need to edit
df_tmp = df_flights.head(10) # Temporary DataFrame
df_tmp.tail(5)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
5 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA 1696 N39463 EWR ORD 150.0 719 5 58 2013-01-01T10:00:00Z
6 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 507 N516JB EWR FLL 158.0 1065 6 0 2013-01-01T11:00:00Z
7 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01T11:00:00Z
8 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB JFK MCO 140.0 944 6 0 2013-01-01T11:00:00Z
9 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA 301 N3ALAA LGA ORD 138.0 733 6 0 2013-01-01T11:00:00Z

Alternatively, we could chain together calls:

# NOTE: No need to edit
df_flights.head(10).tail(5)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
5 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA 1696 N39463 EWR ORD 150.0 719 5 58 2013-01-01T10:00:00Z
6 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 507 N516JB EWR FLL 158.0 1065 6 0 2013-01-01T11:00:00Z
7 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01T11:00:00Z
8 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB JFK MCO 140.0 944 6 0 2013-01-01T11:00:00Z
9 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA 301 N3ALAA LGA ORD 138.0 733 6 0 2013-01-01T11:00:00Z

Rather than these two approaches, using grama we can form a data pipeline using the pipe operator >>. The following code demonstrates the use of the pipe operator:

# NOTE: No need to edit
(
    df_flights
    >> gr.tf_head(10)
    >> gr.tf_tail(5)
)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
5 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA 1696 N39463 EWR ORD 150.0 719 5 58 2013-01-01T10:00:00Z
6 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 507 N516JB EWR FLL 158.0 1065 6 0 2013-01-01T11:00:00Z
7 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01T11:00:00Z
8 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB JFK MCO 140.0 944 6 0 2013-01-01T11:00:00Z
9 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA 301 N3ALAA LGA ORD 138.0 733 6 0 2013-01-01T11:00:00Z

It’s useful to think of the pipe operator as the English word “then”. This allows us to translate code:

(
    df_flights
    >> gr.tf_head(10)
    >> gr.tf_tail(5)
)

into something looking like a plain-language sentence

(
    Start with df_flights
    "then" take the first 10
    "then" take the last 5.
)

Aside: Most of the grama functions we’ll use in this exercise start with the tf_ prefix; this means they take in a DataFrame and return a DataFrame (they transform data).

q2 Convert to piped code#

Convert the following code to a pipe-enabled version.

Note: The pipe-enabled version of DataFrame.head() is tf_head().

# TASK: Convert the following code to pipe-enabled form
df_flights.head(10)

(
    df_flights

    >> gr.tf_head(10)
)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15 2013-01-01T10:00:00Z
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29 2013-01-01T10:00:00Z
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40 2013-01-01T10:00:00Z
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01T11:00:00Z
5 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA 1696 N39463 EWR ORD 150.0 719 5 58 2013-01-01T10:00:00Z
6 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 507 N516JB EWR FLL 158.0 1065 6 0 2013-01-01T11:00:00Z
7 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01T11:00:00Z
8 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB JFK MCO 140.0 944 6 0 2013-01-01T11:00:00Z
9 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA 301 N3ALAA LGA ORD 138.0 733 6 0 2013-01-01T11:00:00Z

Column Selection#

The grama function gr.tf_select() allows us to select particular columns, which is helpful for getting a more “focused” view of a dataset.

q3 Select particular columns#

Select the columns “month”, “day”, “origin”, and “dest”.

## TASK: Select the columns "month", "day", "origin", and "dest"
(
    df_flights
    >> gr.tf_select("month", "day", "origin", "dest")
)
month day origin dest
0 1 1 EWR IAH
1 1 1 LGA IAH
2 1 1 JFK MIA
3 1 1 JFK BQN
4 1 1 LGA ATL
... ... ... ... ...
336771 9 30 JFK DCA
336772 9 30 LGA SYR
336773 9 30 LGA BNA
336774 9 30 LGA CLE
336775 9 30 LGA RDU

336776 rows × 4 columns

Selection Helpers#

The gr.tf_select() function is helpful, but it is made extremely powerful with a few selection helpers. Rather than specify specific column names, we can use helpers to match names that satisfy different criteria.

Here are a few of the most important selection helpers:

Helper

Selects

gr.starts_with(s)

Columns that start with string s

gr.ends_with(s)

Columns that end with string s

gr.contains(s)

Columns that contain the string s

gr.everything()

All columns not already selected

For instance, the following code will select all the columns whose name starts with "dep_".

# NOTE: No need to edit
(
    df_flights
    >> gr.tf_select(gr.starts_with("dep_"))
)
dep_time dep_delay
0 517.0 2.0
1 533.0 4.0
2 542.0 2.0
3 544.0 -1.0
4 554.0 -6.0
... ... ...
336771 NaN NaN
336772 NaN NaN
336773 NaN NaN
336774 NaN NaN
336775 NaN NaN

336776 rows × 2 columns

q4 Match columns#

Select only those columns whose name ends with "_time".

## TASK: Select only those columns whose name ends with "_time"
(
    df_flights
    >> gr.tf_select(gr.ends_with("_time"))
)
dep_time sched_dep_time arr_time sched_arr_time air_time
0 517.0 515 830.0 819 227.0
1 533.0 529 850.0 830 227.0
2 542.0 540 923.0 850 160.0
3 544.0 545 1004.0 1022 183.0
4 554.0 600 812.0 837 116.0
... ... ... ... ... ...
336771 NaN 1455 NaN 1634 NaN
336772 NaN 2200 NaN 2312 NaN
336773 NaN 1210 NaN 1330 NaN
336774 NaN 1159 NaN 1344 NaN
336775 NaN 840 NaN 1020 NaN

336776 rows × 5 columns

Re-arranging columns with gr.everything()#

The gr.everything() helper may seem silly, but it is actually extremely useful; since the everything helper selects all columns not already selected, we can use it to re-arrange the columns in a DataFrame for a more convenient view. For instance, we can bring a few columns closer together to aid in column comparisons, as the following code demonstrates:

## NOTE: No need to edit
(
    df_flights
    >> gr.tf_select("origin", "dest", "distance", gr.everything())
)
origin dest distance year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum air_time hour minute time_hour
0 EWR IAH 1400 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 227.0 5 15 2013-01-01T10:00:00Z
1 LGA IAH 1416 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 227.0 5 29 2013-01-01T10:00:00Z
2 JFK MIA 1089 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA 160.0 5 40 2013-01-01T10:00:00Z
3 JFK BQN 1576 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB 183.0 5 45 2013-01-01T10:00:00Z
4 LGA ATL 762 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN 116.0 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336771 JFK DCA 213 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E 3393 NaN NaN 14 55 2013-09-30T18:00:00Z
336772 LGA SYR 198 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E 3525 NaN NaN 22 0 2013-10-01T02:00:00Z
336773 LGA BNA 764 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ NaN 12 10 2013-09-30T16:00:00Z
336774 LGA CLE 419 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ 3572 N511MQ NaN 11 59 2013-09-30T15:00:00Z
336775 LGA RDU 431 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ 3531 N839MQ NaN 8 40 2013-09-30T12:00:00Z

336776 rows × 19 columns

q5 Re-arrange columns#

Re-arrange the columns to place dest, origin, carrier at the left, but retain all other columns.

# TASK: Bring "dest", "origin", and "carrier" to the left,
# but keep all other columns
(
    df_flights
    >> gr.tf_select("dest", "origin", "carrier", gr.everything())
)
dest origin carrier year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay flight tailnum air_time distance hour minute time_hour
0 IAH EWR UA 2013 1 1 517.0 515 2.0 830.0 819 11.0 1545 N14228 227.0 1400 5 15 2013-01-01T10:00:00Z
1 IAH LGA UA 2013 1 1 533.0 529 4.0 850.0 830 20.0 1714 N24211 227.0 1416 5 29 2013-01-01T10:00:00Z
2 MIA JFK AA 2013 1 1 542.0 540 2.0 923.0 850 33.0 1141 N619AA 160.0 1089 5 40 2013-01-01T10:00:00Z
3 BQN JFK B6 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 725 N804JB 183.0 1576 5 45 2013-01-01T10:00:00Z
4 ATL LGA DL 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 461 N668DN 116.0 762 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336771 DCA JFK 9E 2013 9 30 NaN 1455 NaN NaN 1634 NaN 3393 NaN NaN 213 14 55 2013-09-30T18:00:00Z
336772 SYR LGA 9E 2013 9 30 NaN 2200 NaN NaN 2312 NaN 3525 NaN NaN 198 22 0 2013-10-01T02:00:00Z
336773 BNA LGA MQ 2013 9 30 NaN 1210 NaN NaN 1330 NaN 3461 N535MQ NaN 764 12 10 2013-09-30T16:00:00Z
336774 CLE LGA MQ 2013 9 30 NaN 1159 NaN NaN 1344 NaN 3572 N511MQ NaN 419 11 59 2013-09-30T15:00:00Z
336775 RDU LGA MQ 2013 9 30 NaN 840 NaN NaN 1020 NaN 3531 N839MQ NaN 431 8 40 2013-09-30T12:00:00Z

336776 rows × 19 columns

Row Filtering#

Just as we can select particular columns, we can filter to obtain particular rows.

Accessing column values#

To access a single column of a DataFrame, we can use bracket [] notation:

# NOTE: No need to edit
(
    df_flights["origin"]
)
0         EWR
1         LGA
2         JFK
3         JFK
4         LGA
         ... 
336771    JFK
336772    LGA
336773    LGA
336774    LGA
336775    LGA
Name: origin, Length: 336776, dtype: object

Note that this returns a different datatype: a Pandas series.

Making a comparison#

Remember that we have the following comparison operators:

Symbol

Compares

x < y

x less than y

x <= y

x less than or equal to y

x > y

x greater than y

x >= y

x greater than or equal to y

x == y

x (exactly) equals y

Note that == works for strings too!

x = y

Error!

With a single column, we can make a comparison against a desired value:

# NOTE: No need to edit
(
    df_flights["origin"] == "JFK"
)
0         False
1         False
2          True
3          True
4         False
          ...  
336771     True
336772    False
336773    False
336774    False
336775    False
Name: origin, Length: 336776, dtype: bool

The code above gives us True when the "origin" is "JFK", and False when it is not.

Filtering using comparisons#

These True/False values are useful, because we can use them to filter to only those rows where the comparison yields True:

# NOTE: No need to edit
(
    df_flights
    >> gr.tf_filter(df_flights["origin"] == "JFK")
    # Show that the "origin" is indeed "JFK"
    >> gr.tf_select("origin", gr.everything())
)
origin year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum dest air_time distance hour minute time_hour
0 JFK 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA MIA 160.0 1089 5 40 2013-01-01T10:00:00Z
1 JFK 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
2 JFK 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB MCO 140.0 944 6 0 2013-01-01T11:00:00Z
3 JFK 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 49 N793JB PBI 149.0 1028 6 0 2013-01-01T11:00:00Z
4 JFK 2013 1 1 558.0 600 -2.0 853.0 856 -3.0 B6 71 N657JB TPA 158.0 1005 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
111274 JFK 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 2002 N281JB BUF 52.0 301 22 50 2013-10-01T02:00:00Z
111275 JFK 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 486 N346JB ROC 47.0 264 22 46 2013-10-01T02:00:00Z
111276 JFK 2013 9 30 2307.0 2255 12.0 2359.0 2358 1.0 B6 718 N565JB BOS 33.0 187 22 55 2013-10-01T02:00:00Z
111277 JFK 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 745 N516JB PSE 196.0 1617 23 59 2013-10-01T03:00:00Z
111278 JFK 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E 3393 NaN DCA NaN 213 14 55 2013-09-30T18:00:00Z

111279 rows × 19 columns

q6 Find the early departures#

Use gr.tf_filter() to find all the flights that left early.

# TASK: Filter for only those rows with a negative "dep_delay"
(
    df_flights
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
1 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01T11:00:00Z
2 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA 1696 N39463 EWR ORD 150.0 719 5 58 2013-01-01T10:00:00Z
3 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 507 N516JB EWR FLL 158.0 1065 6 0 2013-01-01T11:00:00Z
4 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183570 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 234 N318JB JFK BTV 43.0 266 22 45 2013-10-01T02:00:00Z
183571 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 1816 N354JB JFK SYR 41.0 209 22 45 2013-10-01T02:00:00Z
183572 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 2002 N281JB JFK BUF 52.0 301 22 50 2013-10-01T02:00:00Z
183573 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 486 N346JB JFK ROC 47.0 264 22 46 2013-10-01T02:00:00Z
183574 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 745 N516JB JFK PSE 196.0 1617 23 59 2013-10-01T03:00:00Z

183575 rows × 19 columns

The data pronoun DF#

Way back at the beginning of this notebook, you may have noticed this line of code:

DF = gr.intention()

This assigned the “data pronoun” to the variable DF. This pronoun can be used inside grama functions to refer to the data as it is at any stage in the pipeline. Rather than:

(
    df_flights
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
)

we instead write

(
    df_flights
    >> gr.tf_filter(DF["dep_delay"] < 0)
)

The data pronoun DF is really just an alias for the DataFrame, so we can still use bracket notation [].

The data pronoun DF is not just convenient; it is necessary to make some operations work! Imagine we wanted to find all cases that satisfy both dep_delay < 0 and arr_delay < 0. Consider the following code:

(
    df_flights
    # This filter works properly
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
    # We now have fewer rows! The next filter will fail
    >> gr.tf_filter(df_flights["arr_delay"] < 0)
)

The data pronoun DF allows us to refer to the data as it is in the pipeline; this resolves the issue with having fewer rows in the second filter. You’ll get a chance to fix this code in the next task.

q7 Use DF to fix this code#

Use the data pronoun DF to fix the following code:

# TASK: Fix this code using the data pronoun DF

(
    df_flights
    >> gr.tf_filter(DF["dep_delay"] < 0)
    >> gr.tf_filter(DF["arr_delay"] < 0)
)
# solution-end
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
1 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01T11:00:00Z
2 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01T11:00:00Z
3 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB JFK MCO 140.0 944 6 0 2013-01-01T11:00:00Z
4 2013 1 1 558.0 600 -2.0 849.0 851 -2.0 B6 49 N793JB JFK PBI 149.0 1028 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
144341 2013 9 30 2237.0 2245 -8.0 2345.0 2353 -8.0 B6 234 N318JB JFK BTV 43.0 266 22 45 2013-10-01T02:00:00Z
144342 2013 9 30 2240.0 2245 -5.0 2334.0 2351 -17.0 B6 1816 N354JB JFK SYR 41.0 209 22 45 2013-10-01T02:00:00Z
144343 2013 9 30 2240.0 2250 -10.0 2347.0 7 -20.0 B6 2002 N281JB JFK BUF 52.0 301 22 50 2013-10-01T02:00:00Z
144344 2013 9 30 2241.0 2246 -5.0 2345.0 1 -16.0 B6 486 N346JB JFK ROC 47.0 264 22 46 2013-10-01T02:00:00Z
144345 2013 9 30 2349.0 2359 -10.0 325.0 350 -25.0 B6 745 N516JB JFK PSE 196.0 1617 23 59 2013-10-01T03:00:00Z

144346 rows × 19 columns

Arranging#

Filtering is particularly helpful when combined with sorting; we can sort on any column using the function gr.tf_arrange(). For instance, the following code sorts by the "distance", pulling the shortest flights to the top of the DataFrame:

# NOTE: No need to edit
(
    df_flights
    # Sorts from smallest to largest
    >> gr.tf_arrange(DF["distance"])
    # Inspect the route
    >> gr.tf_select("distance", "origin", "dest", gr.everything())
)
distance origin dest year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum air_time hour minute time_hour
0 17 EWR LGA 2013 7 27 NaN 106 NaN NaN 245 NaN US 1632 NaN NaN 1 6 2013-07-27T05:00:00Z
1 80 EWR PHL 2013 1 4 1240.0 1200 40.0 1333.0 1306 27.0 EV 4193 N14972 30.0 12 0 2013-01-04T17:00:00Z
2 80 EWR PHL 2013 1 19 1617.0 1617 0.0 1722.0 1722 0.0 EV 4616 N12540 34.0 16 17 2013-01-19T21:00:00Z
3 80 EWR PHL 2013 2 1 2128.0 2129 -1.0 2216.0 2224 -8.0 EV 4619 N13969 24.0 21 29 2013-02-02T02:00:00Z
4 80 EWR PHL 2013 1 23 2128.0 2129 -1.0 2221.0 2224 -3.0 EV 4619 N12135 23.0 21 29 2013-01-24T02:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336771 4983 JFK HNL 2013 12 18 928.0 930 -2.0 1543.0 1535 8.0 HA 51 N395HA 641.0 9 30 2013-12-18T14:00:00Z
336772 4983 JFK HNL 2013 6 2 956.0 1000 -4.0 1442.0 1435 7.0 HA 51 N383HA 617.0 10 0 2013-06-02T14:00:00Z
336773 4983 JFK HNL 2013 3 17 1006.0 1000 6.0 1607.0 1530 37.0 HA 51 N380HA 686.0 10 0 2013-03-17T14:00:00Z
336774 4983 JFK HNL 2013 5 28 953.0 1000 -7.0 1447.0 1500 -13.0 HA 51 N385HA 631.0 10 0 2013-05-28T14:00:00Z
336775 4983 JFK HNL 2013 8 11 950.0 1000 -10.0 1438.0 1440 -2.0 HA 51 N391HA 628.0 10 0 2013-08-11T14:00:00Z

336776 rows × 19 columns

We can also reverse the order of the sort with the gr.desc() helper, as shown below:

# NOTE: No need to edit
(
    df_flights
    # Sorts from largest to smallest (*desc*ending)
    >> gr.tf_arrange(gr.desc(DF["distance"]))
    # Inspect the route
    >> gr.tf_select("distance", "origin", "dest", gr.everything())
)
distance origin dest year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum air_time hour minute time_hour
0 4983 JFK HNL 2013 5 14 959.0 1000 -1.0 1433.0 1500 -27.0 HA 51 N388HA 608.0 10 0 2013-05-14T14:00:00Z
1 4983 JFK HNL 2013 1 11 855.0 900 -5.0 1442.0 1530 -48.0 HA 51 N383HA 613.0 9 0 2013-01-11T14:00:00Z
2 4983 JFK HNL 2013 4 6 957.0 1000 -3.0 1510.0 1510 0.0 HA 51 N382HA 638.0 10 0 2013-04-06T14:00:00Z
3 4983 JFK HNL 2013 2 15 858.0 900 -2.0 1436.0 1540 -64.0 HA 51 N382HA 611.0 9 0 2013-02-15T14:00:00Z
4 4983 JFK HNL 2013 7 24 958.0 1000 -2.0 1435.0 1430 5.0 HA 51 N384HA 609.0 10 0 2013-07-24T14:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336771 80 EWR PHL 2013 3 2 1926.0 1929 -3.0 2011.0 2023 -12.0 EV 4457 N21144 30.0 19 29 2013-03-03T00:00:00Z
336772 80 EWR PHL 2013 1 27 2128.0 2129 -1.0 2213.0 2224 -11.0 EV 4619 N13969 29.0 21 29 2013-01-28T02:00:00Z
336773 80 EWR PHL 2013 2 3 2153.0 2129 24.0 2247.0 2224 23.0 EV 4619 N13913 21.0 21 29 2013-02-04T02:00:00Z
336774 80 EWR PHL 2013 1 9 2126.0 2129 -3.0 2217.0 2224 -7.0 EV 4619 N17560 27.0 21 29 2013-01-10T02:00:00Z
336775 17 EWR LGA 2013 7 27 NaN 106 NaN NaN 245 NaN US 1632 NaN NaN 1 6 2013-07-27T05:00:00Z

336776 rows × 19 columns

q8 Find the earliest departures#

Find the top 10 earliest departures. How early did these depart? Do these early departures have anything in common?

Hint: You will need to combine functions to accomplish this.

# TASK: Find the top 10 earliest departures
(
    df_flights
    >> gr.tf_arrange(DF["dep_delay"])
    >> gr.tf_head(10)
    >> gr.tf_select("dep_delay", "origin", "dest", gr.everything())
)
dep_delay origin dest year month day dep_time sched_dep_time arr_time sched_arr_time arr_delay carrier flight tailnum air_time distance hour minute time_hour
0 -43.0 JFK DEN 2013 12 7 2040.0 2123 40.0 2352 48.0 B6 97 N592JB 265.0 1626 21 23 2013-12-08T02:00:00Z
1 -33.0 LGA MSY 2013 2 3 2022.0 2055 2240.0 2338 -58.0 DL 1715 N612DL 162.0 1183 20 55 2013-02-04T01:00:00Z
2 -32.0 LGA IAD 2013 11 10 1408.0 1440 1549.0 1559 -10.0 EV 5713 N825AS 52.0 229 14 40 2013-11-10T19:00:00Z
3 -30.0 LGA TPA 2013 1 11 1900.0 1930 2233.0 2243 -10.0 DL 1435 N934DL 139.0 1010 19 30 2013-01-12T00:00:00Z
4 -27.0 LGA DEN 2013 1 29 1703.0 1730 1947.0 1957 -10.0 F9 837 N208FR 250.0 1620 17 30 2013-01-29T22:00:00Z
5 -26.0 LGA DTW 2013 8 9 729.0 755 1002.0 955 7.0 MQ 3478 N711MQ 88.0 502 7 55 2013-08-09T11:00:00Z
6 -25.0 LGA DTW 2013 3 30 2030.0 2055 2213.0 2250 -37.0 MQ 4573 N725MQ 87.0 502 20 55 2013-03-31T00:00:00Z
7 -25.0 EWR TYS 2013 10 23 1907.0 1932 2143.0 2143 0.0 EV 4361 N13994 111.0 631 19 32 2013-10-23T23:00:00Z
8 -24.0 LGA STL 2013 9 18 1631.0 1655 1812.0 1845 -33.0 AA 2223 N4XXAA 125.0 888 16 55 2013-09-18T20:00:00Z
9 -24.0 JFK BUF 2013 3 2 1431.0 1455 1601.0 1631 -30.0 9E 3318 N929XJ 55.0 301 14 55 2013-03-02T19:00:00Z

Isolating to answer questions#

Before we close this exercise, let’s use these data isolation tools to answer a real question about the dataset:

q9 What are these data for?#

What are these data for? In particular, in what way are they “focused on the NYC area”? Complete the following tasks, and answer the questions under observations below.

Hint: You might find it helpful to use the or operator using the symbol |. The or keyword does not work inside a gr.tf_filter().

# TASK: Filter to only those cases where the destination airport
# is one of "JFK", "LGA", or "EWR"
df_q9dest = (
    df_flights
    >> gr.tf_filter(
        (DF["dest"] == "JFK") |
        (DF["dest"] == "LGA") |
        (DF["dest"] == "EWR")
    )
)

# NOTE: No need to edit; use this to check your work
assert \
    df_q9dest.shape[0] == 1, \
    "Incorrect filter"

df_q9dest 
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 7 27 NaN 106 NaN NaN 245 NaN US 1632 NaN EWR LGA NaN 17 1 6 2013-07-27T05:00:00Z
# TASK: Filter to only those cases where the origin airport
# is one of "JFK", "LGA", or "EWR"
df_q9origin = (
    df_flights
    >> gr.tf_filter(
        (DF["origin"] == "JFK") |
        (DF["origin"] == "LGA") |
        (DF["origin"] == "EWR")
    )
)

# NOTE: No need to edit; use this to check your work
assert \
    df_q9origin.shape[0] == df_flights.shape[0], \
    "Incorrect filter"

df_q9origin 
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15 2013-01-01T10:00:00Z
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29 2013-01-01T10:00:00Z
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40 2013-01-01T10:00:00Z
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01T11:00:00Z
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
336771 2013 9 30 NaN 1455 NaN NaN 1634 NaN 9E 3393 NaN JFK DCA NaN 213 14 55 2013-09-30T18:00:00Z
336772 2013 9 30 NaN 2200 NaN NaN 2312 NaN 9E 3525 NaN LGA SYR NaN 198 22 0 2013-10-01T02:00:00Z
336773 2013 9 30 NaN 1210 NaN NaN 1330 NaN MQ 3461 N535MQ LGA BNA NaN 764 12 10 2013-09-30T16:00:00Z
336774 2013 9 30 NaN 1159 NaN NaN 1344 NaN MQ 3572 N511MQ LGA CLE NaN 419 11 59 2013-09-30T15:00:00Z
336775 2013 9 30 NaN 840 NaN NaN 1020 NaN MQ 3531 N839MQ LGA RDU NaN 431 8 40 2013-09-30T12:00:00Z

336776 rows × 19 columns

Observations

  • How many rows have either “JFK”, “LGA”, or “EWR” as their destination?

    • Just one!

  • How many rows have either “JFK”, “LGA”, or “EWR” as their origin?

    • All of the rows have “JFK”, “LGA”, or “EWR” as their origin

  • Would we be answer questions related to flights entering the NYC area using this dataset?

    • Nope! We have virtually no data on those flights.

  • In what sense is this dataset “focused on NYC”?

    • This dataset is focused on flights departing from NYC.

Aside: Data are not just numbers. Data are numbers with context. Every dataset is put together for some reason. This reason will inform what observations (rows) and variables (columns) are in the data, and which are not in the data. Conversely, thinking carefully about what data a person or organization bothered to collect—and what they ignored—can tell you something about the perspective of those who collected the data. Thinking about these issues is partly what separates data science from programming or machine learning. (end-rant)