Data: Isolating Data
Contents
Data: Isolating Data#
Purpose: One of the keys to a successful analysis is the ability to focus on particular topics. When analyzing a dataset, our ability to focus is tied to our facility at isolating data. In this exercise, you will practice isolating columns with tf_select()
, picking specific rows with tf_filter()
, and sorting your data with tf_arrange()
to see what rises to the top.
Aside: The data-management verbs in grama are heavily inspired by the dplyr package in the R programming langauge.
Setup#
import grama as gr
DF = gr.Intention()
%matplotlib inline
We’ll use the nycflights13
package in this exercise: This is a dataset of flights involving the New York City area during 2013.
from nycflights13 import flights as df_flights
df_flights
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
The DataFrame Object#
The variable df_flights
above is a DataFrame; a way of storing data in Python.
Aside: The DataFrame class is provided by the Pandas package, but we will use Grama to work with DataFrames.
Head and Tail#
Looking at an entire DataFrame is usually overwhelming; it is useful to be able to focus on a small subset of the data. One of our most basic tools is to get the first or last few rows of a DataFrame, which we can do with DataFrame.head(n)
and DataFrame.tail(n)
. These functions are called with the following syntax:
df_flights.head(10)
q1 Get the tail of a DataFrame#
Get the last 10 rows of df_flights
.
# TASK: Get the last 10 rows of df_flights
df_flights.tail(10)
# solution-end
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
336766 | 2013 | 9 | 30 | 2240.0 | 2250 | -10.0 | 2347.0 | 7 | -20.0 | B6 | 2002 | N281JB | JFK | BUF | 52.0 | 301 | 22 | 50 | 2013-10-01T02:00:00Z |
336767 | 2013 | 9 | 30 | 2241.0 | 2246 | -5.0 | 2345.0 | 1 | -16.0 | B6 | 486 | N346JB | JFK | ROC | 47.0 | 264 | 22 | 46 | 2013-10-01T02:00:00Z |
336768 | 2013 | 9 | 30 | 2307.0 | 2255 | 12.0 | 2359.0 | 2358 | 1.0 | B6 | 718 | N565JB | JFK | BOS | 33.0 | 187 | 22 | 55 | 2013-10-01T02:00:00Z |
336769 | 2013 | 9 | 30 | 2349.0 | 2359 | -10.0 | 325.0 | 350 | -25.0 | B6 | 745 | N516JB | JFK | PSE | 196.0 | 1617 | 23 | 59 | 2013-10-01T03:00:00Z |
336770 | 2013 | 9 | 30 | NaN | 1842 | NaN | NaN | 2019 | NaN | EV | 5274 | N740EV | LGA | BNA | NaN | 764 | 18 | 42 | 2013-09-30T22:00:00Z |
336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
Piping#
Often, when carrying out studies with data, we want to perform multiple operations on the same dataset. We could do this by assigning intermediate variables, such as a temporary DataFrame:
# NOTE: No need to edit
df_tmp = df_flights.head(10) # Temporary DataFrame
df_tmp.tail(5)
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 2013 | 1 | 1 | 554.0 | 558 | -4.0 | 740.0 | 728 | 12.0 | UA | 1696 | N39463 | EWR | ORD | 150.0 | 719 | 5 | 58 | 2013-01-01T10:00:00Z |
6 | 2013 | 1 | 1 | 555.0 | 600 | -5.0 | 913.0 | 854 | 19.0 | B6 | 507 | N516JB | EWR | FLL | 158.0 | 1065 | 6 | 0 | 2013-01-01T11:00:00Z |
7 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 709.0 | 723 | -14.0 | EV | 5708 | N829AS | LGA | IAD | 53.0 | 229 | 6 | 0 | 2013-01-01T11:00:00Z |
8 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 838.0 | 846 | -8.0 | B6 | 79 | N593JB | JFK | MCO | 140.0 | 944 | 6 | 0 | 2013-01-01T11:00:00Z |
9 | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 753.0 | 745 | 8.0 | AA | 301 | N3ALAA | LGA | ORD | 138.0 | 733 | 6 | 0 | 2013-01-01T11:00:00Z |
Alternatively, we could chain together calls:
# NOTE: No need to edit
df_flights.head(10).tail(5)
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 2013 | 1 | 1 | 554.0 | 558 | -4.0 | 740.0 | 728 | 12.0 | UA | 1696 | N39463 | EWR | ORD | 150.0 | 719 | 5 | 58 | 2013-01-01T10:00:00Z |
6 | 2013 | 1 | 1 | 555.0 | 600 | -5.0 | 913.0 | 854 | 19.0 | B6 | 507 | N516JB | EWR | FLL | 158.0 | 1065 | 6 | 0 | 2013-01-01T11:00:00Z |
7 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 709.0 | 723 | -14.0 | EV | 5708 | N829AS | LGA | IAD | 53.0 | 229 | 6 | 0 | 2013-01-01T11:00:00Z |
8 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 838.0 | 846 | -8.0 | B6 | 79 | N593JB | JFK | MCO | 140.0 | 944 | 6 | 0 | 2013-01-01T11:00:00Z |
9 | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 753.0 | 745 | 8.0 | AA | 301 | N3ALAA | LGA | ORD | 138.0 | 733 | 6 | 0 | 2013-01-01T11:00:00Z |
Rather than these two approaches, using grama we can form a data pipeline using the pipe operator >>
. The following code demonstrates the use of the pipe operator:
# NOTE: No need to edit
(
df_flights
>> gr.tf_head(10)
>> gr.tf_tail(5)
)
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 2013 | 1 | 1 | 554.0 | 558 | -4.0 | 740.0 | 728 | 12.0 | UA | 1696 | N39463 | EWR | ORD | 150.0 | 719 | 5 | 58 | 2013-01-01T10:00:00Z |
6 | 2013 | 1 | 1 | 555.0 | 600 | -5.0 | 913.0 | 854 | 19.0 | B6 | 507 | N516JB | EWR | FLL | 158.0 | 1065 | 6 | 0 | 2013-01-01T11:00:00Z |
7 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 709.0 | 723 | -14.0 | EV | 5708 | N829AS | LGA | IAD | 53.0 | 229 | 6 | 0 | 2013-01-01T11:00:00Z |
8 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 838.0 | 846 | -8.0 | B6 | 79 | N593JB | JFK | MCO | 140.0 | 944 | 6 | 0 | 2013-01-01T11:00:00Z |
9 | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 753.0 | 745 | 8.0 | AA | 301 | N3ALAA | LGA | ORD | 138.0 | 733 | 6 | 0 | 2013-01-01T11:00:00Z |
It’s useful to think of the pipe operator as the English word “then”. This allows us to translate code:
(
df_flights
>> gr.tf_head(10)
>> gr.tf_tail(5)
)
into something looking like a plain-language sentence
(
Start with df_flights
"then" take the first 10
"then" take the last 5.
)
Aside: Most of the grama functions we’ll use in this exercise start with the tf_
prefix; this means they take in a DataFrame and return a DataFrame (they t
ransf
orm data).
q2 Convert to piped code#
Convert the following code to a pipe-enabled version.
Note: The pipe-enabled version of DataFrame.head()
is tf_head()
.
# TASK: Convert the following code to pipe-enabled form
df_flights.head(10)
(
df_flights
>> gr.tf_head(10)
)
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
5 | 2013 | 1 | 1 | 554.0 | 558 | -4.0 | 740.0 | 728 | 12.0 | UA | 1696 | N39463 | EWR | ORD | 150.0 | 719 | 5 | 58 | 2013-01-01T10:00:00Z |
6 | 2013 | 1 | 1 | 555.0 | 600 | -5.0 | 913.0 | 854 | 19.0 | B6 | 507 | N516JB | EWR | FLL | 158.0 | 1065 | 6 | 0 | 2013-01-01T11:00:00Z |
7 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 709.0 | 723 | -14.0 | EV | 5708 | N829AS | LGA | IAD | 53.0 | 229 | 6 | 0 | 2013-01-01T11:00:00Z |
8 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 838.0 | 846 | -8.0 | B6 | 79 | N593JB | JFK | MCO | 140.0 | 944 | 6 | 0 | 2013-01-01T11:00:00Z |
9 | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 753.0 | 745 | 8.0 | AA | 301 | N3ALAA | LGA | ORD | 138.0 | 733 | 6 | 0 | 2013-01-01T11:00:00Z |
Column Selection#
The grama function gr.tf_select()
allows us to select particular columns, which is helpful for getting a more “focused” view of a dataset.
q3 Select particular columns#
Select the columns “month”, “day”, “origin”, and “dest”.
## TASK: Select the columns "month", "day", "origin", and "dest"
(
df_flights
>> gr.tf_select("month", "day", "origin", "dest")
)
month | day | origin | dest | |
---|---|---|---|---|
0 | 1 | 1 | EWR | IAH |
1 | 1 | 1 | LGA | IAH |
2 | 1 | 1 | JFK | MIA |
3 | 1 | 1 | JFK | BQN |
4 | 1 | 1 | LGA | ATL |
... | ... | ... | ... | ... |
336771 | 9 | 30 | JFK | DCA |
336772 | 9 | 30 | LGA | SYR |
336773 | 9 | 30 | LGA | BNA |
336774 | 9 | 30 | LGA | CLE |
336775 | 9 | 30 | LGA | RDU |
336776 rows × 4 columns
Selection Helpers#
The gr.tf_select()
function is helpful, but it is made extremely powerful with a few selection helpers. Rather than specify specific column names, we can use helpers to match names that satisfy different criteria.
Here are a few of the most important selection helpers:
Helper |
Selects |
---|---|
|
Columns that start with string |
|
Columns that end with string |
|
Columns that contain the string |
|
All columns not already selected |
For instance, the following code will select all the columns whose name starts with "dep_"
.
# NOTE: No need to edit
(
df_flights
>> gr.tf_select(gr.starts_with("dep_"))
)
dep_time | dep_delay | |
---|---|---|
0 | 517.0 | 2.0 |
1 | 533.0 | 4.0 |
2 | 542.0 | 2.0 |
3 | 544.0 | -1.0 |
4 | 554.0 | -6.0 |
... | ... | ... |
336771 | NaN | NaN |
336772 | NaN | NaN |
336773 | NaN | NaN |
336774 | NaN | NaN |
336775 | NaN | NaN |
336776 rows × 2 columns
q4 Match columns#
Select only those columns whose name ends with "_time"
.
## TASK: Select only those columns whose name ends with "_time"
(
df_flights
>> gr.tf_select(gr.ends_with("_time"))
)
dep_time | sched_dep_time | arr_time | sched_arr_time | air_time | |
---|---|---|---|---|---|
0 | 517.0 | 515 | 830.0 | 819 | 227.0 |
1 | 533.0 | 529 | 850.0 | 830 | 227.0 |
2 | 542.0 | 540 | 923.0 | 850 | 160.0 |
3 | 544.0 | 545 | 1004.0 | 1022 | 183.0 |
4 | 554.0 | 600 | 812.0 | 837 | 116.0 |
... | ... | ... | ... | ... | ... |
336771 | NaN | 1455 | NaN | 1634 | NaN |
336772 | NaN | 2200 | NaN | 2312 | NaN |
336773 | NaN | 1210 | NaN | 1330 | NaN |
336774 | NaN | 1159 | NaN | 1344 | NaN |
336775 | NaN | 840 | NaN | 1020 | NaN |
336776 rows × 5 columns
Re-arranging columns with gr.everything()
#
The gr.everything()
helper may seem silly, but it is actually extremely useful; since the everything helper selects all columns not already selected, we can use it to re-arrange the columns in a DataFrame for a more convenient view. For instance, we can bring a few columns closer together to aid in column comparisons, as the following code demonstrates:
## NOTE: No need to edit
(
df_flights
>> gr.tf_select("origin", "dest", "distance", gr.everything())
)
origin | dest | distance | year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | air_time | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EWR | IAH | 1400 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | 227.0 | 5 | 15 | 2013-01-01T10:00:00Z |
1 | LGA | IAH | 1416 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | 227.0 | 5 | 29 | 2013-01-01T10:00:00Z |
2 | JFK | MIA | 1089 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | 160.0 | 5 | 40 | 2013-01-01T10:00:00Z |
3 | JFK | BQN | 1576 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | 183.0 | 5 | 45 | 2013-01-01T10:00:00Z |
4 | LGA | ATL | 762 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | 116.0 | 6 | 0 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | JFK | DCA | 213 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | NaN | 14 | 55 | 2013-09-30T18:00:00Z |
336772 | LGA | SYR | 198 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | NaN | 22 | 0 | 2013-10-01T02:00:00Z |
336773 | LGA | BNA | 764 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | NaN | 12 | 10 | 2013-09-30T16:00:00Z |
336774 | LGA | CLE | 419 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | NaN | 11 | 59 | 2013-09-30T15:00:00Z |
336775 | LGA | RDU | 431 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | NaN | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
q5 Re-arrange columns#
Re-arrange the columns to place dest, origin, carrier
at the left, but retain all other columns.
# TASK: Bring "dest", "origin", and "carrier" to the left,
# but keep all other columns
(
df_flights
>> gr.tf_select("dest", "origin", "carrier", gr.everything())
)
dest | origin | carrier | year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | flight | tailnum | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | IAH | EWR | UA | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | 1545 | N14228 | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
1 | IAH | LGA | UA | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | 1714 | N24211 | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
2 | MIA | JFK | AA | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | 1141 | N619AA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
3 | BQN | JFK | B6 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | 725 | N804JB | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
4 | ATL | LGA | DL | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | 461 | N668DN | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | DCA | JFK | 9E | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 3393 | NaN | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
336772 | SYR | LGA | 9E | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 3525 | NaN | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
336773 | BNA | LGA | MQ | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | 3461 | N535MQ | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
336774 | CLE | LGA | MQ | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | 3572 | N511MQ | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
336775 | RDU | LGA | MQ | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | 3531 | N839MQ | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
Row Filtering#
Just as we can select particular columns, we can filter to obtain particular rows.
Accessing column values#
To access a single column of a DataFrame, we can use bracket []
notation:
# NOTE: No need to edit
(
df_flights["origin"]
)
0 EWR
1 LGA
2 JFK
3 JFK
4 LGA
...
336771 JFK
336772 LGA
336773 LGA
336774 LGA
336775 LGA
Name: origin, Length: 336776, dtype: object
Note that this returns a different datatype: a Pandas series.
Making a comparison#
Remember that we have the following comparison operators:
Symbol |
Compares |
---|---|
|
|
|
|
|
|
|
|
|
|
Note that |
|
|
Error! |
With a single column, we can make a comparison against a desired value:
# NOTE: No need to edit
(
df_flights["origin"] == "JFK"
)
0 False
1 False
2 True
3 True
4 False
...
336771 True
336772 False
336773 False
336774 False
336775 False
Name: origin, Length: 336776, dtype: bool
The code above gives us True
when the "origin"
is "JFK"
, and False
when it is not.
Filtering using comparisons#
These True
/False
values are useful, because we can use them to filter to only those rows where the comparison yields True
:
# NOTE: No need to edit
(
df_flights
>> gr.tf_filter(df_flights["origin"] == "JFK")
# Show that the "origin" is indeed "JFK"
>> gr.tf_select("origin", gr.everything())
)
origin | year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | JFK | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
1 | JFK | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
2 | JFK | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 838.0 | 846 | -8.0 | B6 | 79 | N593JB | MCO | 140.0 | 944 | 6 | 0 | 2013-01-01T11:00:00Z |
3 | JFK | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 849.0 | 851 | -2.0 | B6 | 49 | N793JB | PBI | 149.0 | 1028 | 6 | 0 | 2013-01-01T11:00:00Z |
4 | JFK | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 853.0 | 856 | -3.0 | B6 | 71 | N657JB | TPA | 158.0 | 1005 | 6 | 0 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
111274 | JFK | 2013 | 9 | 30 | 2240.0 | 2250 | -10.0 | 2347.0 | 7 | -20.0 | B6 | 2002 | N281JB | BUF | 52.0 | 301 | 22 | 50 | 2013-10-01T02:00:00Z |
111275 | JFK | 2013 | 9 | 30 | 2241.0 | 2246 | -5.0 | 2345.0 | 1 | -16.0 | B6 | 486 | N346JB | ROC | 47.0 | 264 | 22 | 46 | 2013-10-01T02:00:00Z |
111276 | JFK | 2013 | 9 | 30 | 2307.0 | 2255 | 12.0 | 2359.0 | 2358 | 1.0 | B6 | 718 | N565JB | BOS | 33.0 | 187 | 22 | 55 | 2013-10-01T02:00:00Z |
111277 | JFK | 2013 | 9 | 30 | 2349.0 | 2359 | -10.0 | 325.0 | 350 | -25.0 | B6 | 745 | N516JB | PSE | 196.0 | 1617 | 23 | 59 | 2013-10-01T03:00:00Z |
111278 | JFK | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
111279 rows × 19 columns
q6 Find the early departures#
Use gr.tf_filter()
to find all the flights that left early.
# TASK: Filter for only those rows with a negative "dep_delay"
(
df_flights
>> gr.tf_filter(df_flights["dep_delay"] < 0)
)
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
1 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
2 | 2013 | 1 | 1 | 554.0 | 558 | -4.0 | 740.0 | 728 | 12.0 | UA | 1696 | N39463 | EWR | ORD | 150.0 | 719 | 5 | 58 | 2013-01-01T10:00:00Z |
3 | 2013 | 1 | 1 | 555.0 | 600 | -5.0 | 913.0 | 854 | 19.0 | B6 | 507 | N516JB | EWR | FLL | 158.0 | 1065 | 6 | 0 | 2013-01-01T11:00:00Z |
4 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 709.0 | 723 | -14.0 | EV | 5708 | N829AS | LGA | IAD | 53.0 | 229 | 6 | 0 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
183570 | 2013 | 9 | 30 | 2237.0 | 2245 | -8.0 | 2345.0 | 2353 | -8.0 | B6 | 234 | N318JB | JFK | BTV | 43.0 | 266 | 22 | 45 | 2013-10-01T02:00:00Z |
183571 | 2013 | 9 | 30 | 2240.0 | 2245 | -5.0 | 2334.0 | 2351 | -17.0 | B6 | 1816 | N354JB | JFK | SYR | 41.0 | 209 | 22 | 45 | 2013-10-01T02:00:00Z |
183572 | 2013 | 9 | 30 | 2240.0 | 2250 | -10.0 | 2347.0 | 7 | -20.0 | B6 | 2002 | N281JB | JFK | BUF | 52.0 | 301 | 22 | 50 | 2013-10-01T02:00:00Z |
183573 | 2013 | 9 | 30 | 2241.0 | 2246 | -5.0 | 2345.0 | 1 | -16.0 | B6 | 486 | N346JB | JFK | ROC | 47.0 | 264 | 22 | 46 | 2013-10-01T02:00:00Z |
183574 | 2013 | 9 | 30 | 2349.0 | 2359 | -10.0 | 325.0 | 350 | -25.0 | B6 | 745 | N516JB | JFK | PSE | 196.0 | 1617 | 23 | 59 | 2013-10-01T03:00:00Z |
183575 rows × 19 columns
The data pronoun DF
#
Way back at the beginning of this notebook, you may have noticed this line of code:
DF = gr.intention()
This assigned the “data pronoun” to the variable DF
. This pronoun can be used inside grama functions to refer to the data as it is at any stage in the pipeline. Rather than:
(
df_flights
>> gr.tf_filter(df_flights["dep_delay"] < 0)
)
we instead write
(
df_flights
>> gr.tf_filter(DF["dep_delay"] < 0)
)
The data pronoun DF
is really just an alias for the DataFrame, so we can still use bracket notation []
.
The data pronoun DF
is not just convenient; it is necessary to make some operations work! Imagine we wanted to find all cases that satisfy both dep_delay < 0
and arr_delay < 0
. Consider the following code:
(
df_flights
# This filter works properly
>> gr.tf_filter(df_flights["dep_delay"] < 0)
# We now have fewer rows! The next filter will fail
>> gr.tf_filter(df_flights["arr_delay"] < 0)
)
The data pronoun DF
allows us to refer to the data as it is in the pipeline; this resolves the issue with having fewer rows in the second filter. You’ll get a chance to fix this code in the next task.
q7 Use DF
to fix this code#
Use the data pronoun DF
to fix the following code:
# TASK: Fix this code using the data pronoun DF
(
df_flights
>> gr.tf_filter(DF["dep_delay"] < 0)
>> gr.tf_filter(DF["arr_delay"] < 0)
)
# solution-end
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
1 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
2 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 709.0 | 723 | -14.0 | EV | 5708 | N829AS | LGA | IAD | 53.0 | 229 | 6 | 0 | 2013-01-01T11:00:00Z |
3 | 2013 | 1 | 1 | 557.0 | 600 | -3.0 | 838.0 | 846 | -8.0 | B6 | 79 | N593JB | JFK | MCO | 140.0 | 944 | 6 | 0 | 2013-01-01T11:00:00Z |
4 | 2013 | 1 | 1 | 558.0 | 600 | -2.0 | 849.0 | 851 | -2.0 | B6 | 49 | N793JB | JFK | PBI | 149.0 | 1028 | 6 | 0 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
144341 | 2013 | 9 | 30 | 2237.0 | 2245 | -8.0 | 2345.0 | 2353 | -8.0 | B6 | 234 | N318JB | JFK | BTV | 43.0 | 266 | 22 | 45 | 2013-10-01T02:00:00Z |
144342 | 2013 | 9 | 30 | 2240.0 | 2245 | -5.0 | 2334.0 | 2351 | -17.0 | B6 | 1816 | N354JB | JFK | SYR | 41.0 | 209 | 22 | 45 | 2013-10-01T02:00:00Z |
144343 | 2013 | 9 | 30 | 2240.0 | 2250 | -10.0 | 2347.0 | 7 | -20.0 | B6 | 2002 | N281JB | JFK | BUF | 52.0 | 301 | 22 | 50 | 2013-10-01T02:00:00Z |
144344 | 2013 | 9 | 30 | 2241.0 | 2246 | -5.0 | 2345.0 | 1 | -16.0 | B6 | 486 | N346JB | JFK | ROC | 47.0 | 264 | 22 | 46 | 2013-10-01T02:00:00Z |
144345 | 2013 | 9 | 30 | 2349.0 | 2359 | -10.0 | 325.0 | 350 | -25.0 | B6 | 745 | N516JB | JFK | PSE | 196.0 | 1617 | 23 | 59 | 2013-10-01T03:00:00Z |
144346 rows × 19 columns
Arranging#
Filtering is particularly helpful when combined with sorting; we can sort on any column using the function gr.tf_arrange()
. For instance, the following code sorts by the "distance"
, pulling the shortest flights to the top of the DataFrame:
# NOTE: No need to edit
(
df_flights
# Sorts from smallest to largest
>> gr.tf_arrange(DF["distance"])
# Inspect the route
>> gr.tf_select("distance", "origin", "dest", gr.everything())
)
distance | origin | dest | year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | air_time | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17 | EWR | LGA | 2013 | 7 | 27 | NaN | 106 | NaN | NaN | 245 | NaN | US | 1632 | NaN | NaN | 1 | 6 | 2013-07-27T05:00:00Z |
1 | 80 | EWR | PHL | 2013 | 1 | 4 | 1240.0 | 1200 | 40.0 | 1333.0 | 1306 | 27.0 | EV | 4193 | N14972 | 30.0 | 12 | 0 | 2013-01-04T17:00:00Z |
2 | 80 | EWR | PHL | 2013 | 1 | 19 | 1617.0 | 1617 | 0.0 | 1722.0 | 1722 | 0.0 | EV | 4616 | N12540 | 34.0 | 16 | 17 | 2013-01-19T21:00:00Z |
3 | 80 | EWR | PHL | 2013 | 2 | 1 | 2128.0 | 2129 | -1.0 | 2216.0 | 2224 | -8.0 | EV | 4619 | N13969 | 24.0 | 21 | 29 | 2013-02-02T02:00:00Z |
4 | 80 | EWR | PHL | 2013 | 1 | 23 | 2128.0 | 2129 | -1.0 | 2221.0 | 2224 | -3.0 | EV | 4619 | N12135 | 23.0 | 21 | 29 | 2013-01-24T02:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | 4983 | JFK | HNL | 2013 | 12 | 18 | 928.0 | 930 | -2.0 | 1543.0 | 1535 | 8.0 | HA | 51 | N395HA | 641.0 | 9 | 30 | 2013-12-18T14:00:00Z |
336772 | 4983 | JFK | HNL | 2013 | 6 | 2 | 956.0 | 1000 | -4.0 | 1442.0 | 1435 | 7.0 | HA | 51 | N383HA | 617.0 | 10 | 0 | 2013-06-02T14:00:00Z |
336773 | 4983 | JFK | HNL | 2013 | 3 | 17 | 1006.0 | 1000 | 6.0 | 1607.0 | 1530 | 37.0 | HA | 51 | N380HA | 686.0 | 10 | 0 | 2013-03-17T14:00:00Z |
336774 | 4983 | JFK | HNL | 2013 | 5 | 28 | 953.0 | 1000 | -7.0 | 1447.0 | 1500 | -13.0 | HA | 51 | N385HA | 631.0 | 10 | 0 | 2013-05-28T14:00:00Z |
336775 | 4983 | JFK | HNL | 2013 | 8 | 11 | 950.0 | 1000 | -10.0 | 1438.0 | 1440 | -2.0 | HA | 51 | N391HA | 628.0 | 10 | 0 | 2013-08-11T14:00:00Z |
336776 rows × 19 columns
We can also reverse the order of the sort with the gr.desc()
helper, as shown below:
# NOTE: No need to edit
(
df_flights
# Sorts from largest to smallest (*desc*ending)
>> gr.tf_arrange(gr.desc(DF["distance"]))
# Inspect the route
>> gr.tf_select("distance", "origin", "dest", gr.everything())
)
distance | origin | dest | year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | air_time | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4983 | JFK | HNL | 2013 | 5 | 14 | 959.0 | 1000 | -1.0 | 1433.0 | 1500 | -27.0 | HA | 51 | N388HA | 608.0 | 10 | 0 | 2013-05-14T14:00:00Z |
1 | 4983 | JFK | HNL | 2013 | 1 | 11 | 855.0 | 900 | -5.0 | 1442.0 | 1530 | -48.0 | HA | 51 | N383HA | 613.0 | 9 | 0 | 2013-01-11T14:00:00Z |
2 | 4983 | JFK | HNL | 2013 | 4 | 6 | 957.0 | 1000 | -3.0 | 1510.0 | 1510 | 0.0 | HA | 51 | N382HA | 638.0 | 10 | 0 | 2013-04-06T14:00:00Z |
3 | 4983 | JFK | HNL | 2013 | 2 | 15 | 858.0 | 900 | -2.0 | 1436.0 | 1540 | -64.0 | HA | 51 | N382HA | 611.0 | 9 | 0 | 2013-02-15T14:00:00Z |
4 | 4983 | JFK | HNL | 2013 | 7 | 24 | 958.0 | 1000 | -2.0 | 1435.0 | 1430 | 5.0 | HA | 51 | N384HA | 609.0 | 10 | 0 | 2013-07-24T14:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | 80 | EWR | PHL | 2013 | 3 | 2 | 1926.0 | 1929 | -3.0 | 2011.0 | 2023 | -12.0 | EV | 4457 | N21144 | 30.0 | 19 | 29 | 2013-03-03T00:00:00Z |
336772 | 80 | EWR | PHL | 2013 | 1 | 27 | 2128.0 | 2129 | -1.0 | 2213.0 | 2224 | -11.0 | EV | 4619 | N13969 | 29.0 | 21 | 29 | 2013-01-28T02:00:00Z |
336773 | 80 | EWR | PHL | 2013 | 2 | 3 | 2153.0 | 2129 | 24.0 | 2247.0 | 2224 | 23.0 | EV | 4619 | N13913 | 21.0 | 21 | 29 | 2013-02-04T02:00:00Z |
336774 | 80 | EWR | PHL | 2013 | 1 | 9 | 2126.0 | 2129 | -3.0 | 2217.0 | 2224 | -7.0 | EV | 4619 | N17560 | 27.0 | 21 | 29 | 2013-01-10T02:00:00Z |
336775 | 17 | EWR | LGA | 2013 | 7 | 27 | NaN | 106 | NaN | NaN | 245 | NaN | US | 1632 | NaN | NaN | 1 | 6 | 2013-07-27T05:00:00Z |
336776 rows × 19 columns
q8 Find the earliest departures#
Find the top 10 earliest departures. How early did these depart? Do these early departures have anything in common?
Hint: You will need to combine functions to accomplish this.
# TASK: Find the top 10 earliest departures
(
df_flights
>> gr.tf_arrange(DF["dep_delay"])
>> gr.tf_head(10)
>> gr.tf_select("dep_delay", "origin", "dest", gr.everything())
)
dep_delay | origin | dest | year | month | day | dep_time | sched_dep_time | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -43.0 | JFK | DEN | 2013 | 12 | 7 | 2040.0 | 2123 | 40.0 | 2352 | 48.0 | B6 | 97 | N592JB | 265.0 | 1626 | 21 | 23 | 2013-12-08T02:00:00Z |
1 | -33.0 | LGA | MSY | 2013 | 2 | 3 | 2022.0 | 2055 | 2240.0 | 2338 | -58.0 | DL | 1715 | N612DL | 162.0 | 1183 | 20 | 55 | 2013-02-04T01:00:00Z |
2 | -32.0 | LGA | IAD | 2013 | 11 | 10 | 1408.0 | 1440 | 1549.0 | 1559 | -10.0 | EV | 5713 | N825AS | 52.0 | 229 | 14 | 40 | 2013-11-10T19:00:00Z |
3 | -30.0 | LGA | TPA | 2013 | 1 | 11 | 1900.0 | 1930 | 2233.0 | 2243 | -10.0 | DL | 1435 | N934DL | 139.0 | 1010 | 19 | 30 | 2013-01-12T00:00:00Z |
4 | -27.0 | LGA | DEN | 2013 | 1 | 29 | 1703.0 | 1730 | 1947.0 | 1957 | -10.0 | F9 | 837 | N208FR | 250.0 | 1620 | 17 | 30 | 2013-01-29T22:00:00Z |
5 | -26.0 | LGA | DTW | 2013 | 8 | 9 | 729.0 | 755 | 1002.0 | 955 | 7.0 | MQ | 3478 | N711MQ | 88.0 | 502 | 7 | 55 | 2013-08-09T11:00:00Z |
6 | -25.0 | LGA | DTW | 2013 | 3 | 30 | 2030.0 | 2055 | 2213.0 | 2250 | -37.0 | MQ | 4573 | N725MQ | 87.0 | 502 | 20 | 55 | 2013-03-31T00:00:00Z |
7 | -25.0 | EWR | TYS | 2013 | 10 | 23 | 1907.0 | 1932 | 2143.0 | 2143 | 0.0 | EV | 4361 | N13994 | 111.0 | 631 | 19 | 32 | 2013-10-23T23:00:00Z |
8 | -24.0 | LGA | STL | 2013 | 9 | 18 | 1631.0 | 1655 | 1812.0 | 1845 | -33.0 | AA | 2223 | N4XXAA | 125.0 | 888 | 16 | 55 | 2013-09-18T20:00:00Z |
9 | -24.0 | JFK | BUF | 2013 | 3 | 2 | 1431.0 | 1455 | 1601.0 | 1631 | -30.0 | 9E | 3318 | N929XJ | 55.0 | 301 | 14 | 55 | 2013-03-02T19:00:00Z |
Isolating to answer questions#
Before we close this exercise, let’s use these data isolation tools to answer a real question about the dataset:
q9 What are these data for?#
What are these data for? In particular, in what way are they “focused on the NYC area”? Complete the following tasks, and answer the questions under observations below.
Hint: You might find it helpful to use the or
operator using the symbol |
. The or
keyword does not work inside a gr.tf_filter()
.
# TASK: Filter to only those cases where the destination airport
# is one of "JFK", "LGA", or "EWR"
df_q9dest = (
df_flights
>> gr.tf_filter(
(DF["dest"] == "JFK") |
(DF["dest"] == "LGA") |
(DF["dest"] == "EWR")
)
)
# NOTE: No need to edit; use this to check your work
assert \
df_q9dest.shape[0] == 1, \
"Incorrect filter"
df_q9dest
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 7 | 27 | NaN | 106 | NaN | NaN | 245 | NaN | US | 1632 | NaN | EWR | LGA | NaN | 17 | 1 | 6 | 2013-07-27T05:00:00Z |
# TASK: Filter to only those cases where the origin airport
# is one of "JFK", "LGA", or "EWR"
df_q9origin = (
df_flights
>> gr.tf_filter(
(DF["origin"] == "JFK") |
(DF["origin"] == "LGA") |
(DF["origin"] == "EWR")
)
)
# NOTE: No need to edit; use this to check your work
assert \
df_q9origin.shape[0] == df_flights.shape[0], \
"Incorrect filter"
df_q9origin
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01T10:00:00Z |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01T10:00:00Z |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01T10:00:00Z |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01T10:00:00Z |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01T11:00:00Z |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336771 | 2013 | 9 | 30 | NaN | 1455 | NaN | NaN | 1634 | NaN | 9E | 3393 | NaN | JFK | DCA | NaN | 213 | 14 | 55 | 2013-09-30T18:00:00Z |
336772 | 2013 | 9 | 30 | NaN | 2200 | NaN | NaN | 2312 | NaN | 9E | 3525 | NaN | LGA | SYR | NaN | 198 | 22 | 0 | 2013-10-01T02:00:00Z |
336773 | 2013 | 9 | 30 | NaN | 1210 | NaN | NaN | 1330 | NaN | MQ | 3461 | N535MQ | LGA | BNA | NaN | 764 | 12 | 10 | 2013-09-30T16:00:00Z |
336774 | 2013 | 9 | 30 | NaN | 1159 | NaN | NaN | 1344 | NaN | MQ | 3572 | N511MQ | LGA | CLE | NaN | 419 | 11 | 59 | 2013-09-30T15:00:00Z |
336775 | 2013 | 9 | 30 | NaN | 840 | NaN | NaN | 1020 | NaN | MQ | 3531 | N839MQ | LGA | RDU | NaN | 431 | 8 | 40 | 2013-09-30T12:00:00Z |
336776 rows × 19 columns
Observations
How many rows have either “JFK”, “LGA”, or “EWR” as their destination?
Just one!
How many rows have either “JFK”, “LGA”, or “EWR” as their origin?
All of the rows have “JFK”, “LGA”, or “EWR” as their origin
Would we be answer questions related to flights entering the NYC area using this dataset?
Nope! We have virtually no data on those flights.
In what sense is this dataset “focused on NYC”?
This dataset is focused on flights departing from NYC.
Aside: Data are not just numbers. Data are numbers with context. Every dataset is put together for some reason. This reason will inform what observations (rows) and variables (columns) are in the data, and which are not in the data. Conversely, thinking carefully about what data a person or organization bothered to collect—and what they ignored—can tell you something about the perspective of those who collected the data. Thinking about these issues is partly what separates data science from programming or machine learning. (end-rant
)