Data: Isolating Data

Data: Isolating Data#

Purpose: One of the keys to a successful analysis is the ability to focus on particular topics. When analyzing a dataset, our ability to focus is tied to our facility at isolating data. In this exercise, you will practice isolating columns with tf_select(), picking specific rows with tf_filter(), and sorting your data with tf_arrange() to see what rises to the top.

Aside: The data-management verbs in grama are heavily inspired by the dplyr package in the R programming langauge.

Setup#

import grama as gr
DF = gr.Intention()
%matplotlib inline

We’ll use the nycflights13 package in this exercise: This is a dataset of flights involving the New York City area during 2013.

from nycflights13 import flights as df_flights
df_flights

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	1545	N14228	EWR	IAH	227.0	1400	5	15	2013-01-01T10:00:00Z
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	1714	N24211	LGA	IAH	227.0	1416	5	29	2013-01-01T10:00:00Z
2	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	JFK	MIA	160.0	1089	5	40	2013-01-01T10:00:00Z
3	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01T10:00:00Z
4	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01T11:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336771	2013	9	30	NaN	1455	NaN	NaN	1634	NaN	9E	3393	NaN	JFK	DCA	NaN	213	14	55	2013-09-30T18:00:00Z
336772	2013	9	30	NaN	2200	NaN	NaN	2312	NaN	9E	3525	NaN	LGA	SYR	NaN	198	22	0	2013-10-01T02:00:00Z
336773	2013	9	30	NaN	1210	NaN	NaN	1330	NaN	MQ	3461	N535MQ	LGA	BNA	NaN	764	12	10	2013-09-30T16:00:00Z
336774	2013	9	30	NaN	1159	NaN	NaN	1344	NaN	MQ	3572	N511MQ	LGA	CLE	NaN	419	11	59	2013-09-30T15:00:00Z
336775	2013	9	30	NaN	840	NaN	NaN	1020	NaN	MQ	3531	N839MQ	LGA	RDU	NaN	431	8	40	2013-09-30T12:00:00Z

336776 rows × 19 columns

The DataFrame Object#

The variable df_flights above is a DataFrame; a way of storing data in Python.

Aside: The DataFrame class is provided by the Pandas package, but we will use Grama to work with DataFrames.

Head and Tail#

Looking at an entire DataFrame is usually overwhelming; it is useful to be able to focus on a small subset of the data. One of our most basic tools is to get the first or last few rows of a DataFrame, which we can do with DataFrame.head(n) and DataFrame.tail(n). These functions are called with the following syntax:

df_flights.head(10)

q1 Get the tail of a DataFrame#

Get the last 10 rows of df_flights.

# TASK: Get the last 10 rows of df_flights

df_flights.tail(10)
# solution-end

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
336766	2013	9	30	2240.0	2250	-10.0	2347.0	7	-20.0	B6	2002	N281JB	JFK	BUF	52.0	301	22	50	2013-10-01T02:00:00Z
336767	2013	9	30	2241.0	2246	-5.0	2345.0	1	-16.0	B6	486	N346JB	JFK	ROC	47.0	264	22	46	2013-10-01T02:00:00Z
336768	2013	9	30	2307.0	2255	12.0	2359.0	2358	1.0	B6	718	N565JB	JFK	BOS	33.0	187	22	55	2013-10-01T02:00:00Z
336769	2013	9	30	2349.0	2359	-10.0	325.0	350	-25.0	B6	745	N516JB	JFK	PSE	196.0	1617	23	59	2013-10-01T03:00:00Z
336770	2013	9	30	NaN	1842	NaN	NaN	2019	NaN	EV	5274	N740EV	LGA	BNA	NaN	764	18	42	2013-09-30T22:00:00Z
336771	2013	9	30	NaN	1455	NaN	NaN	1634	NaN	9E	3393	NaN	JFK	DCA	NaN	213	14	55	2013-09-30T18:00:00Z
336772	2013	9	30	NaN	2200	NaN	NaN	2312	NaN	9E	3525	NaN	LGA	SYR	NaN	198	22	0	2013-10-01T02:00:00Z
336773	2013	9	30	NaN	1210	NaN	NaN	1330	NaN	MQ	3461	N535MQ	LGA	BNA	NaN	764	12	10	2013-09-30T16:00:00Z
336774	2013	9	30	NaN	1159	NaN	NaN	1344	NaN	MQ	3572	N511MQ	LGA	CLE	NaN	419	11	59	2013-09-30T15:00:00Z
336775	2013	9	30	NaN	840	NaN	NaN	1020	NaN	MQ	3531	N839MQ	LGA	RDU	NaN	431	8	40	2013-09-30T12:00:00Z

Piping#

Often, when carrying out studies with data, we want to perform multiple operations on the same dataset. We could do this by assigning intermediate variables, such as a temporary DataFrame:

# NOTE: No need to edit
df_tmp = df_flights.head(10) # Temporary DataFrame
df_tmp.tail(5)

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
5	2013	1	1	554.0	558	-4.0	740.0	728	12.0	UA	1696	N39463	EWR	ORD	150.0	719	5	58	2013-01-01T10:00:00Z
6	2013	1	1	555.0	600	-5.0	913.0	854	19.0	B6	507	N516JB	EWR	FLL	158.0	1065	6	0	2013-01-01T11:00:00Z
7	2013	1	1	557.0	600	-3.0	709.0	723	-14.0	EV	5708	N829AS	LGA	IAD	53.0	229	6	0	2013-01-01T11:00:00Z
8	2013	1	1	557.0	600	-3.0	838.0	846	-8.0	B6	79	N593JB	JFK	MCO	140.0	944	6	0	2013-01-01T11:00:00Z
9	2013	1	1	558.0	600	-2.0	753.0	745	8.0	AA	301	N3ALAA	LGA	ORD	138.0	733	6	0	2013-01-01T11:00:00Z

Alternatively, we could chain together calls:

# NOTE: No need to edit
df_flights.head(10).tail(5)

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
5	2013	1	1	554.0	558	-4.0	740.0	728	12.0	UA	1696	N39463	EWR	ORD	150.0	719	5	58	2013-01-01T10:00:00Z
6	2013	1	1	555.0	600	-5.0	913.0	854	19.0	B6	507	N516JB	EWR	FLL	158.0	1065	6	0	2013-01-01T11:00:00Z
7	2013	1	1	557.0	600	-3.0	709.0	723	-14.0	EV	5708	N829AS	LGA	IAD	53.0	229	6	0	2013-01-01T11:00:00Z
8	2013	1	1	557.0	600	-3.0	838.0	846	-8.0	B6	79	N593JB	JFK	MCO	140.0	944	6	0	2013-01-01T11:00:00Z
9	2013	1	1	558.0	600	-2.0	753.0	745	8.0	AA	301	N3ALAA	LGA	ORD	138.0	733	6	0	2013-01-01T11:00:00Z

Rather than these two approaches, using grama we can form a data pipeline using the pipe operator >>. The following code demonstrates the use of the pipe operator:

# NOTE: No need to edit
(
    df_flights
    >> gr.tf_head(10)
    >> gr.tf_tail(5)
)

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
5	2013	1	1	554.0	558	-4.0	740.0	728	12.0	UA	1696	N39463	EWR	ORD	150.0	719	5	58	2013-01-01T10:00:00Z
6	2013	1	1	555.0	600	-5.0	913.0	854	19.0	B6	507	N516JB	EWR	FLL	158.0	1065	6	0	2013-01-01T11:00:00Z
7	2013	1	1	557.0	600	-3.0	709.0	723	-14.0	EV	5708	N829AS	LGA	IAD	53.0	229	6	0	2013-01-01T11:00:00Z
8	2013	1	1	557.0	600	-3.0	838.0	846	-8.0	B6	79	N593JB	JFK	MCO	140.0	944	6	0	2013-01-01T11:00:00Z
9	2013	1	1	558.0	600	-2.0	753.0	745	8.0	AA	301	N3ALAA	LGA	ORD	138.0	733	6	0	2013-01-01T11:00:00Z

It’s useful to think of the pipe operator as the English word “then”. This allows us to translate code:

(
    df_flights
    >> gr.tf_head(10)
    >> gr.tf_tail(5)
)

into something looking like a plain-language sentence

(
    Start with df_flights
    "then" take the first 10
    "then" take the last 5.
)

Aside: Most of the grama functions we’ll use in this exercise start with the tf_ prefix; this means they take in a DataFrame and return a DataFrame (they transform data).

q2 Convert to piped code#

Convert the following code to a pipe-enabled version.

Note: The pipe-enabled version of DataFrame.head() is tf_head().

# TASK: Convert the following code to pipe-enabled form
df_flights.head(10)

(
    df_flights

    >> gr.tf_head(10)
)

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	1545	N14228	EWR	IAH	227.0	1400	5	15	2013-01-01T10:00:00Z
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	1714	N24211	LGA	IAH	227.0	1416	5	29	2013-01-01T10:00:00Z
2	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	JFK	MIA	160.0	1089	5	40	2013-01-01T10:00:00Z
3	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01T10:00:00Z
4	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01T11:00:00Z
5	2013	1	1	554.0	558	-4.0	740.0	728	12.0	UA	1696	N39463	EWR	ORD	150.0	719	5	58	2013-01-01T10:00:00Z
6	2013	1	1	555.0	600	-5.0	913.0	854	19.0	B6	507	N516JB	EWR	FLL	158.0	1065	6	0	2013-01-01T11:00:00Z
7	2013	1	1	557.0	600	-3.0	709.0	723	-14.0	EV	5708	N829AS	LGA	IAD	53.0	229	6	0	2013-01-01T11:00:00Z
8	2013	1	1	557.0	600	-3.0	838.0	846	-8.0	B6	79	N593JB	JFK	MCO	140.0	944	6	0	2013-01-01T11:00:00Z
9	2013	1	1	558.0	600	-2.0	753.0	745	8.0	AA	301	N3ALAA	LGA	ORD	138.0	733	6	0	2013-01-01T11:00:00Z

Column Selection#

The grama function gr.tf_select() allows us to select particular columns, which is helpful for getting a more “focused” view of a dataset.

q3 Select particular columns#

Select the columns “month”, “day”, “origin”, and “dest”.

## TASK: Select the columns "month", "day", "origin", and "dest"
(
    df_flights
    >> gr.tf_select("month", "day", "origin", "dest")
)

	month	day	origin	dest
0	1	1	EWR	IAH
1	1	1	LGA	IAH
2	1	1	JFK	MIA
3	1	1	JFK	BQN
4	1	1	LGA	ATL
...	...	...	...	...
336771	9	30	JFK	DCA
336772	9	30	LGA	SYR
336773	9	30	LGA	BNA
336774	9	30	LGA	CLE
336775	9	30	LGA	RDU

336776 rows × 4 columns

Selection Helpers#

The gr.tf_select() function is helpful, but it is made extremely powerful with a few selection helpers. Rather than specify specific column names, we can use helpers to match names that satisfy different criteria.

Here are a few of the most important selection helpers:

Helper	Selects
`gr.starts_with(s)`	Columns that start with string `s`
`gr.ends_with(s)`	Columns that end with string `s`
`gr.contains(s)`	Columns that contain the string `s`
`gr.everything()`	All columns not already selected

For instance, the following code will select all the columns whose name starts with "dep_".

# NOTE: No need to edit
(
    df_flights
    >> gr.tf_select(gr.starts_with("dep_"))
)

	dep_time	dep_delay
0	517.0	2.0
1	533.0	4.0
2	542.0	2.0
3	544.0	-1.0
4	554.0	-6.0
...	...	...
336771	NaN	NaN
336772	NaN	NaN
336773	NaN	NaN
336774	NaN	NaN
336775	NaN	NaN

336776 rows × 2 columns

q4 Match columns#

Select only those columns whose name ends with "_time".

## TASK: Select only those columns whose name ends with "_time"
(
    df_flights
    >> gr.tf_select(gr.ends_with("_time"))
)

	dep_time	sched_dep_time	arr_time	sched_arr_time	air_time
0	517.0	515	830.0	819	227.0
1	533.0	529	850.0	830	227.0
2	542.0	540	923.0	850	160.0
3	544.0	545	1004.0	1022	183.0
4	554.0	600	812.0	837	116.0
...	...	...	...	...	...
336771	NaN	1455	NaN	1634	NaN
336772	NaN	2200	NaN	2312	NaN
336773	NaN	1210	NaN	1330	NaN
336774	NaN	1159	NaN	1344	NaN
336775	NaN	840	NaN	1020	NaN

336776 rows × 5 columns

Re-arranging columns with `gr.everything()`#

The gr.everything() helper may seem silly, but it is actually extremely useful; since the everything helper selects all columns not already selected, we can use it to re-arrange the columns in a DataFrame for a more convenient view. For instance, we can bring a few columns closer together to aid in column comparisons, as the following code demonstrates:

## NOTE: No need to edit
(
    df_flights
    >> gr.tf_select("origin", "dest", "distance", gr.everything())
)

	origin	dest	distance	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	air_time	hour	minute	time_hour
0	EWR	IAH	1400	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	1545	N14228	227.0	5	15	2013-01-01T10:00:00Z
1	LGA	IAH	1416	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	1714	N24211	227.0	5	29	2013-01-01T10:00:00Z
2	JFK	MIA	1089	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	160.0	5	40	2013-01-01T10:00:00Z
3	JFK	BQN	1576	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	183.0	5	45	2013-01-01T10:00:00Z
4	LGA	ATL	762	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	116.0	6	0	2013-01-01T11:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336771	JFK	DCA	213	2013	9	30	NaN	1455	NaN	NaN	1634	NaN	9E	3393	NaN	NaN	14	55	2013-09-30T18:00:00Z
336772	LGA	SYR	198	2013	9	30	NaN	2200	NaN	NaN	2312	NaN	9E	3525	NaN	NaN	22	0	2013-10-01T02:00:00Z
336773	LGA	BNA	764	2013	9	30	NaN	1210	NaN	NaN	1330	NaN	MQ	3461	N535MQ	NaN	12	10	2013-09-30T16:00:00Z
336774	LGA	CLE	419	2013	9	30	NaN	1159	NaN	NaN	1344	NaN	MQ	3572	N511MQ	NaN	11	59	2013-09-30T15:00:00Z
336775	LGA	RDU	431	2013	9	30	NaN	840	NaN	NaN	1020	NaN	MQ	3531	N839MQ	NaN	8	40	2013-09-30T12:00:00Z

336776 rows × 19 columns

q5 Re-arrange columns#

Re-arrange the columns to place dest, origin, carrier at the left, but retain all other columns.

# TASK: Bring "dest", "origin", and "carrier" to the left,
# but keep all other columns
(
    df_flights
    >> gr.tf_select("dest", "origin", "carrier", gr.everything())
)

	dest	origin	carrier	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	flight	tailnum	air_time	distance	hour	minute	time_hour
0	IAH	EWR	UA	2013	1	1	517.0	515	2.0	830.0	819	11.0	1545	N14228	227.0	1400	5	15	2013-01-01T10:00:00Z
1	IAH	LGA	UA	2013	1	1	533.0	529	4.0	850.0	830	20.0	1714	N24211	227.0	1416	5	29	2013-01-01T10:00:00Z
2	MIA	JFK	AA	2013	1	1	542.0	540	2.0	923.0	850	33.0	1141	N619AA	160.0	1089	5	40	2013-01-01T10:00:00Z
3	BQN	JFK	B6	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	725	N804JB	183.0	1576	5	45	2013-01-01T10:00:00Z
4	ATL	LGA	DL	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	461	N668DN	116.0	762	6	0	2013-01-01T11:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336771	DCA	JFK	9E	2013	9	30	NaN	1455	NaN	NaN	1634	NaN	3393	NaN	NaN	213	14	55	2013-09-30T18:00:00Z
336772	SYR	LGA	9E	2013	9	30	NaN	2200	NaN	NaN	2312	NaN	3525	NaN	NaN	198	22	0	2013-10-01T02:00:00Z
336773	BNA	LGA	MQ	2013	9	30	NaN	1210	NaN	NaN	1330	NaN	3461	N535MQ	NaN	764	12	10	2013-09-30T16:00:00Z
336774	CLE	LGA	MQ	2013	9	30	NaN	1159	NaN	NaN	1344	NaN	3572	N511MQ	NaN	419	11	59	2013-09-30T15:00:00Z
336775	RDU	LGA	MQ	2013	9	30	NaN	840	NaN	NaN	1020	NaN	3531	N839MQ	NaN	431	8	40	2013-09-30T12:00:00Z

336776 rows × 19 columns

Row Filtering#

Just as we can select particular columns, we can filter to obtain particular rows.

Accessing column values#

To access a single column of a DataFrame, we can use bracket [] notation:

# NOTE: No need to edit
(
    df_flights["origin"]
)

0         EWR
1         LGA
2         JFK
3         JFK
4         LGA
         ... 
336771    JFK
336772    LGA
336773    LGA
336774    LGA
336775    LGA
Name: origin, Length: 336776, dtype: object

Note that this returns a different datatype: a Pandas series.

Making a comparison#

Remember that we have the following comparison operators:

Symbol	Compares
`x < y`	`x` less than `y`
`x <= y`	`x` less than or equal to `y`
`x > y`	`x` greater than `y`
`x >= y`	`x` greater than or equal to `y`
`x == y`	`x` (exactly) equals `y`
	Note that `==` works for strings too!
`x = y`	Error!

With a single column, we can make a comparison against a desired value:

# NOTE: No need to edit
(
    df_flights["origin"] == "JFK"
)

0         False
1         False
2          True
3          True
4         False
          ...  
336771     True
336772    False
336773    False
336774    False
336775    False
Name: origin, Length: 336776, dtype: bool

The code above gives us True when the "origin" is "JFK", and False when it is not.

Filtering using comparisons#

These True/False values are useful, because we can use them to filter to only those rows where the comparison yields True:

# NOTE: No need to edit
(
    df_flights
    >> gr.tf_filter(df_flights["origin"] == "JFK")
    # Show that the "origin" is indeed "JFK"
    >> gr.tf_select("origin", gr.everything())
)

	origin	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	dest	air_time	distance	hour	minute	time_hour
0	JFK	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	MIA	160.0	1089	5	40	2013-01-01T10:00:00Z
1	JFK	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	BQN	183.0	1576	5	45	2013-01-01T10:00:00Z
2	JFK	2013	1	1	557.0	600	-3.0	838.0	846	-8.0	B6	79	N593JB	MCO	140.0	944	6	0	2013-01-01T11:00:00Z
3	JFK	2013	1	1	558.0	600	-2.0	849.0	851	-2.0	B6	49	N793JB	PBI	149.0	1028	6	0	2013-01-01T11:00:00Z
4	JFK	2013	1	1	558.0	600	-2.0	853.0	856	-3.0	B6	71	N657JB	TPA	158.0	1005	6	0	2013-01-01T11:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
111274	JFK	2013	9	30	2240.0	2250	-10.0	2347.0	7	-20.0	B6	2002	N281JB	BUF	52.0	301	22	50	2013-10-01T02:00:00Z
111275	JFK	2013	9	30	2241.0	2246	-5.0	2345.0	1	-16.0	B6	486	N346JB	ROC	47.0	264	22	46	2013-10-01T02:00:00Z
111276	JFK	2013	9	30	2307.0	2255	12.0	2359.0	2358	1.0	B6	718	N565JB	BOS	33.0	187	22	55	2013-10-01T02:00:00Z
111277	JFK	2013	9	30	2349.0	2359	-10.0	325.0	350	-25.0	B6	745	N516JB	PSE	196.0	1617	23	59	2013-10-01T03:00:00Z
111278	JFK	2013	9	30	NaN	1455	NaN	NaN	1634	NaN	9E	3393	NaN	DCA	NaN	213	14	55	2013-09-30T18:00:00Z

111279 rows × 19 columns

q6 Find the early departures#

Use gr.tf_filter() to find all the flights that left early.

# TASK: Filter for only those rows with a negative "dep_delay"
(
    df_flights
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
)

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01T10:00:00Z
1	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01T11:00:00Z
2	2013	1	1	554.0	558	-4.0	740.0	728	12.0	UA	1696	N39463	EWR	ORD	150.0	719	5	58	2013-01-01T10:00:00Z
3	2013	1	1	555.0	600	-5.0	913.0	854	19.0	B6	507	N516JB	EWR	FLL	158.0	1065	6	0	2013-01-01T11:00:00Z
4	2013	1	1	557.0	600	-3.0	709.0	723	-14.0	EV	5708	N829AS	LGA	IAD	53.0	229	6	0	2013-01-01T11:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
183570	2013	9	30	2237.0	2245	-8.0	2345.0	2353	-8.0	B6	234	N318JB	JFK	BTV	43.0	266	22	45	2013-10-01T02:00:00Z
183571	2013	9	30	2240.0	2245	-5.0	2334.0	2351	-17.0	B6	1816	N354JB	JFK	SYR	41.0	209	22	45	2013-10-01T02:00:00Z
183572	2013	9	30	2240.0	2250	-10.0	2347.0	7	-20.0	B6	2002	N281JB	JFK	BUF	52.0	301	22	50	2013-10-01T02:00:00Z
183573	2013	9	30	2241.0	2246	-5.0	2345.0	1	-16.0	B6	486	N346JB	JFK	ROC	47.0	264	22	46	2013-10-01T02:00:00Z
183574	2013	9	30	2349.0	2359	-10.0	325.0	350	-25.0	B6	745	N516JB	JFK	PSE	196.0	1617	23	59	2013-10-01T03:00:00Z

183575 rows × 19 columns

The data pronoun `DF`#

Way back at the beginning of this notebook, you may have noticed this line of code:

DF = gr.intention()

This assigned the “data pronoun” to the variable DF. This pronoun can be used inside grama functions to refer to the data as it is at any stage in the pipeline. Rather than:

(
    df_flights
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
)

we instead write

(
    df_flights
    >> gr.tf_filter(DF["dep_delay"] < 0)
)

The data pronoun DF is really just an alias for the DataFrame, so we can still use bracket notation [].

The data pronoun DF is not just convenient; it is necessary to make some operations work! Imagine we wanted to find all cases that satisfy both dep_delay < 0 and arr_delay < 0. Consider the following code:

(
    df_flights
    # This filter works properly
    >> gr.tf_filter(df_flights["dep_delay"] < 0)
    # We now have fewer rows! The next filter will fail
    >> gr.tf_filter(df_flights["arr_delay"] < 0)
)

The data pronoun DF allows us to refer to the data as it is in the pipeline; this resolves the issue with having fewer rows in the second filter. You’ll get a chance to fix this code in the next task.

q7 Use `DF` to fix this code#

Use the data pronoun DF to fix the following code:

# TASK: Fix this code using the data pronoun DF

(
    df_flights
    >> gr.tf_filter(DF["dep_delay"] < 0)
    >> gr.tf_filter(DF["arr_delay"] < 0)
)
# solution-end

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01T10:00:00Z
1	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01T11:00:00Z
2	2013	1	1	557.0	600	-3.0	709.0	723	-14.0	EV	5708	N829AS	LGA	IAD	53.0	229	6	0	2013-01-01T11:00:00Z
3	2013	1	1	557.0	600	-3.0	838.0	846	-8.0	B6	79	N593JB	JFK	MCO	140.0	944	6	0	2013-01-01T11:00:00Z
4	2013	1	1	558.0	600	-2.0	849.0	851	-2.0	B6	49	N793JB	JFK	PBI	149.0	1028	6	0	2013-01-01T11:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
144341	2013	9	30	2237.0	2245	-8.0	2345.0	2353	-8.0	B6	234	N318JB	JFK	BTV	43.0	266	22	45	2013-10-01T02:00:00Z
144342	2013	9	30	2240.0	2245	-5.0	2334.0	2351	-17.0	B6	1816	N354JB	JFK	SYR	41.0	209	22	45	2013-10-01T02:00:00Z
144343	2013	9	30	2240.0	2250	-10.0	2347.0	7	-20.0	B6	2002	N281JB	JFK	BUF	52.0	301	22	50	2013-10-01T02:00:00Z
144344	2013	9	30	2241.0	2246	-5.0	2345.0	1	-16.0	B6	486	N346JB	JFK	ROC	47.0	264	22	46	2013-10-01T02:00:00Z
144345	2013	9	30	2349.0	2359	-10.0	325.0	350	-25.0	B6	745	N516JB	JFK	PSE	196.0	1617	23	59	2013-10-01T03:00:00Z

144346 rows × 19 columns

Arranging#

Filtering is particularly helpful when combined with sorting; we can sort on any column using the function gr.tf_arrange(). For instance, the following code sorts by the "distance", pulling the shortest flights to the top of the DataFrame:

# NOTE: No need to edit
(
    df_flights
    # Sorts from smallest to largest
    >> gr.tf_arrange(DF["distance"])
    # Inspect the route
    >> gr.tf_select("distance", "origin", "dest", gr.everything())
)

	distance	origin	dest	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	air_time	hour	minute	time_hour
0	17	EWR	LGA	2013	7	27	NaN	106	NaN	NaN	245	NaN	US	1632	NaN	NaN	1	6	2013-07-27T05:00:00Z
1	80	EWR	PHL	2013	1	4	1240.0	1200	40.0	1333.0	1306	27.0	EV	4193	N14972	30.0	12	0	2013-01-04T17:00:00Z
2	80	EWR	PHL	2013	1	19	1617.0	1617	0.0	1722.0	1722	0.0	EV	4616	N12540	34.0	16	17	2013-01-19T21:00:00Z
3	80	EWR	PHL	2013	2	1	2128.0	2129	-1.0	2216.0	2224	-8.0	EV	4619	N13969	24.0	21	29	2013-02-02T02:00:00Z
4	80	EWR	PHL	2013	1	23	2128.0	2129	-1.0	2221.0	2224	-3.0	EV	4619	N12135	23.0	21	29	2013-01-24T02:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336771	4983	JFK	HNL	2013	12	18	928.0	930	-2.0	1543.0	1535	8.0	HA	51	N395HA	641.0	9	30	2013-12-18T14:00:00Z
336772	4983	JFK	HNL	2013	6	2	956.0	1000	-4.0	1442.0	1435	7.0	HA	51	N383HA	617.0	10	0	2013-06-02T14:00:00Z
336773	4983	JFK	HNL	2013	3	17	1006.0	1000	6.0	1607.0	1530	37.0	HA	51	N380HA	686.0	10	0	2013-03-17T14:00:00Z
336774	4983	JFK	HNL	2013	5	28	953.0	1000	-7.0	1447.0	1500	-13.0	HA	51	N385HA	631.0	10	0	2013-05-28T14:00:00Z
336775	4983	JFK	HNL	2013	8	11	950.0	1000	-10.0	1438.0	1440	-2.0	HA	51	N391HA	628.0	10	0	2013-08-11T14:00:00Z

336776 rows × 19 columns

We can also reverse the order of the sort with the gr.desc() helper, as shown below:

# NOTE: No need to edit
(
    df_flights
    # Sorts from largest to smallest (*desc*ending)
    >> gr.tf_arrange(gr.desc(DF["distance"]))
    # Inspect the route
    >> gr.tf_select("distance", "origin", "dest", gr.everything())
)

	distance	origin	dest	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	air_time	hour	minute	time_hour
0	4983	JFK	HNL	2013	5	14	959.0	1000	-1.0	1433.0	1500	-27.0	HA	51	N388HA	608.0	10	0	2013-05-14T14:00:00Z
1	4983	JFK	HNL	2013	1	11	855.0	900	-5.0	1442.0	1530	-48.0	HA	51	N383HA	613.0	9	0	2013-01-11T14:00:00Z
2	4983	JFK	HNL	2013	4	6	957.0	1000	-3.0	1510.0	1510	0.0	HA	51	N382HA	638.0	10	0	2013-04-06T14:00:00Z
3	4983	JFK	HNL	2013	2	15	858.0	900	-2.0	1436.0	1540	-64.0	HA	51	N382HA	611.0	9	0	2013-02-15T14:00:00Z
4	4983	JFK	HNL	2013	7	24	958.0	1000	-2.0	1435.0	1430	5.0	HA	51	N384HA	609.0	10	0	2013-07-24T14:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336771	80	EWR	PHL	2013	3	2	1926.0	1929	-3.0	2011.0	2023	-12.0	EV	4457	N21144	30.0	19	29	2013-03-03T00:00:00Z
336772	80	EWR	PHL	2013	1	27	2128.0	2129	-1.0	2213.0	2224	-11.0	EV	4619	N13969	29.0	21	29	2013-01-28T02:00:00Z
336773	80	EWR	PHL	2013	2	3	2153.0	2129	24.0	2247.0	2224	23.0	EV	4619	N13913	21.0	21	29	2013-02-04T02:00:00Z
336774	80	EWR	PHL	2013	1	9	2126.0	2129	-3.0	2217.0	2224	-7.0	EV	4619	N17560	27.0	21	29	2013-01-10T02:00:00Z
336775	17	EWR	LGA	2013	7	27	NaN	106	NaN	NaN	245	NaN	US	1632	NaN	NaN	1	6	2013-07-27T05:00:00Z

336776 rows × 19 columns

q8 Find the earliest departures#

Find the top 10 earliest departures. How early did these depart? Do these early departures have anything in common?

Hint: You will need to combine functions to accomplish this.

# TASK: Find the top 10 earliest departures
(
    df_flights
    >> gr.tf_arrange(DF["dep_delay"])
    >> gr.tf_head(10)
    >> gr.tf_select("dep_delay", "origin", "dest", gr.everything())
)

	dep_delay	origin	dest	year	month	day	dep_time	sched_dep_time	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	air_time	distance	hour	minute	time_hour
0	-43.0	JFK	DEN	2013	12	7	2040.0	2123	40.0	2352	48.0	B6	97	N592JB	265.0	1626	21	23	2013-12-08T02:00:00Z
1	-33.0	LGA	MSY	2013	2	3	2022.0	2055	2240.0	2338	-58.0	DL	1715	N612DL	162.0	1183	20	55	2013-02-04T01:00:00Z
2	-32.0	LGA	IAD	2013	11	10	1408.0	1440	1549.0	1559	-10.0	EV	5713	N825AS	52.0	229	14	40	2013-11-10T19:00:00Z
3	-30.0	LGA	TPA	2013	1	11	1900.0	1930	2233.0	2243	-10.0	DL	1435	N934DL	139.0	1010	19	30	2013-01-12T00:00:00Z
4	-27.0	LGA	DEN	2013	1	29	1703.0	1730	1947.0	1957	-10.0	F9	837	N208FR	250.0	1620	17	30	2013-01-29T22:00:00Z
5	-26.0	LGA	DTW	2013	8	9	729.0	755	1002.0	955	7.0	MQ	3478	N711MQ	88.0	502	7	55	2013-08-09T11:00:00Z
6	-25.0	LGA	DTW	2013	3	30	2030.0	2055	2213.0	2250	-37.0	MQ	4573	N725MQ	87.0	502	20	55	2013-03-31T00:00:00Z
7	-25.0	EWR	TYS	2013	10	23	1907.0	1932	2143.0	2143	0.0	EV	4361	N13994	111.0	631	19	32	2013-10-23T23:00:00Z
8	-24.0	LGA	STL	2013	9	18	1631.0	1655	1812.0	1845	-33.0	AA	2223	N4XXAA	125.0	888	16	55	2013-09-18T20:00:00Z
9	-24.0	JFK	BUF	2013	3	2	1431.0	1455	1601.0	1631	-30.0	9E	3318	N929XJ	55.0	301	14	55	2013-03-02T19:00:00Z

Isolating to answer questions#

Before we close this exercise, let’s use these data isolation tools to answer a real question about the dataset:

q9 What are these data for?#

What are these data for? In particular, in what way are they “focused on the NYC area”? Complete the following tasks, and answer the questions under observations below.

Hint: You might find it helpful to use the or operator using the symbol |. The or keyword does not work inside a gr.tf_filter().

# TASK: Filter to only those cases where the destination airport
# is one of "JFK", "LGA", or "EWR"
df_q9dest = (
    df_flights
    >> gr.tf_filter(
        (DF["dest"] == "JFK") |
        (DF["dest"] == "LGA") |
        (DF["dest"] == "EWR")
    )
)

# NOTE: No need to edit; use this to check your work
assert \
    df_q9dest.shape[0] == 1, \
    "Incorrect filter"

df_q9dest 

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	7	27	NaN	106	NaN	NaN	245	NaN	US	1632	NaN	EWR	LGA	NaN	17	1	6	2013-07-27T05:00:00Z

# TASK: Filter to only those cases where the origin airport
# is one of "JFK", "LGA", or "EWR"
df_q9origin = (
    df_flights
    >> gr.tf_filter(
        (DF["origin"] == "JFK") |
        (DF["origin"] == "LGA") |
        (DF["origin"] == "EWR")
    )
)

# NOTE: No need to edit; use this to check your work
assert \
    df_q9origin.shape[0] == df_flights.shape[0], \
    "Incorrect filter"

df_q9origin 

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	1545	N14228	EWR	IAH	227.0	1400	5	15	2013-01-01T10:00:00Z
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	1714	N24211	LGA	IAH	227.0	1416	5	29	2013-01-01T10:00:00Z
2	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	JFK	MIA	160.0	1089	5	40	2013-01-01T10:00:00Z
3	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01T10:00:00Z
4	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01T11:00:00Z
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336771	2013	9	30	NaN	1455	NaN	NaN	1634	NaN	9E	3393	NaN	JFK	DCA	NaN	213	14	55	2013-09-30T18:00:00Z
336772	2013	9	30	NaN	2200	NaN	NaN	2312	NaN	9E	3525	NaN	LGA	SYR	NaN	198	22	0	2013-10-01T02:00:00Z
336773	2013	9	30	NaN	1210	NaN	NaN	1330	NaN	MQ	3461	N535MQ	LGA	BNA	NaN	764	12	10	2013-09-30T16:00:00Z
336774	2013	9	30	NaN	1159	NaN	NaN	1344	NaN	MQ	3572	N511MQ	LGA	CLE	NaN	419	11	59	2013-09-30T15:00:00Z
336775	2013	9	30	NaN	840	NaN	NaN	1020	NaN	MQ	3531	N839MQ	LGA	RDU	NaN	431	8	40	2013-09-30T12:00:00Z

336776 rows × 19 columns

Observations

How many rows have either “JFK”, “LGA”, or “EWR” as their destination?
- Just one!
How many rows have either “JFK”, “LGA”, or “EWR” as their origin?
- All of the rows have “JFK”, “LGA”, or “EWR” as their origin
Would we be answer questions related to flights entering the NYC area using this dataset?
- Nope! We have virtually no data on those flights.
In what sense is this dataset “focused on NYC”?
- This dataset is focused on flights departing from NYC.

Aside: Data are not just numbers. Data are numbers with context. Every dataset is put together for some reason. This reason will inform what observations (rows) and variables (columns) are in the data, and which are not in the data. Conversely, thinking carefully about what data a person or organization bothered to collect—and what they ignored—can tell you something about the perspective of those who collected the data. Thinking about these issues is partly what separates data science from programming or machine learning. (end-rant)

Data: Isolating Data

Contents

Data: Isolating Data#

Setup#

The DataFrame Object#

Head and Tail#

q1 Get the tail of a DataFrame#

Piping#

q2 Convert to piped code#

Column Selection#

q3 Select particular columns#

Selection Helpers#

q4 Match columns#

Re-arranging columns with gr.everything()#

q5 Re-arrange columns#

Row Filtering#

Accessing column values#

Making a comparison#

Filtering using comparisons#

q6 Find the early departures#

The data pronoun DF#

q7 Use DF to fix this code#

Arranging#

q8 Find the earliest departures#

Isolating to answer questions#

q9 What are these data for?#

Re-arranging columns with `gr.everything()`#

The data pronoun `DF`#

q7 Use `DF` to fix this code#