17 Data: Isolating Data
Purpose: One of the keys to a successful analysis is the ability to focus on particular topics. When analyzing a dataset, our ability to focus is tied to our facility at isolating data. In this exercise, you will practice isolating columns with select()
, picking specific rows with filter()
, and sorting your data with arrange()
to see what rises to the top.
Reading: (None, this is the reading)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We’ll use the nycflights13
dataset for this exercise; upon loading the package, the data are stored in the variable name flights
. For instance:
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
17.1 select()
columns
Sometimes, our data has so many columns that it’s just overwhelming. The flights
dataset only has 19 columns, but that’s already quite a bit to deal with. Thankfully, we can select()
a subset of columns to focus on a few at a time. To select a column, we simply provide its name:
## # A tibble: 336,776 × 3
## flight carrier dep_time
## <int> <chr> <int>
## 1 1545 UA 517
## 2 1714 UA 533
## 3 1141 AA 542
## 4 725 B6 544
## 5 461 DL 554
## 6 1696 UA 554
## 7 507 B6 555
## 8 5708 EV 557
## 9 79 B6 557
## 10 301 AA 558
## # ℹ 336,766 more rows
This is fine when we know exactly what we’re looking for. But select()
is even more powerful when combined with selection helpers.
17.1.1 Matching selection helpers
The following helpers perform different kinds of column matching:
starts_with(str)
: all columns that start with the prefixstr
ends_with(str)
: all columns that end with the suffixstr
contains(str)
: all columns that contain the substringstr
matches(expr)
: matches a regular expressionexpr
*numrange()
: builds ranges of variables, like"x0", "x1", "x2", ...
*We’ll learn more about regular expressions in e-data06-strings
We use selection helpers inside select()
, just like we’d provide a column name.
## # A tibble: 336,776 × 2
## dep_time dep_delay
## <int> <dbl>
## 1 517 2
## 2 533 4
## 3 542 2
## 4 544 -1
## 5 554 -6
## 6 554 -4
## 7 555 -5
## 8 557 -3
## 9 557 -3
## 10 558 -2
## # ℹ 336,766 more rows
17.1.2 q1 Select matches
Select all the variables whose name ends with _time
.
## # A tibble: 336,776 × 5
## dep_time sched_dep_time arr_time sched_arr_time air_time
## <int> <int> <int> <int> <dbl>
## 1 517 515 830 819 227
## 2 533 529 850 830 227
## 3 542 540 923 850 160
## 4 544 545 1004 1022 183
## 5 554 600 812 837 116
## 6 554 558 740 728 150
## 7 555 600 913 854 158
## 8 557 600 709 723 53
## 9 557 600 838 846 140
## 10 558 600 753 745 138
## # ℹ 336,766 more rows
The following is a unit test of your code; if you managed to solve task q1 correctly, the following code will execute without error.
## NOTE: No need to change this
assertthat::assert_that(
all(names(df_q1) %>% str_detect(., "_time$"))
)
## [1] TRUE
## [1] "Nice!"
17.1.3 The everything()
helper
There’s another strange helper: everything()
. This seems useless at first, until we realize that we can first re-arrange the other columns, then put everything()
else at the end!
## # A tibble: 336,776 × 19
## year month day sched_dep_time dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 515 517 2 830 819
## 2 2013 1 1 529 533 4 850 830
## 3 2013 1 1 540 542 2 923 850
## 4 2013 1 1 545 544 -1 1004 1022
## 5 2013 1 1 600 554 -6 812 837
## 6 2013 1 1 558 554 -4 740 728
## 7 2013 1 1 600 555 -5 913 854
## 8 2013 1 1 600 557 -3 709 723
## 9 2013 1 1 600 557 -3 838 846
## 10 2013 1 1 600 558 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
17.1.4 q2 Re-arrange the columns
Re-arrange the columns to place dest, origin, carrier
at the front, but retain all other columns.
Hint: The function everything()
will be useful!
## # A tibble: 336,776 × 19
## dest origin carrier year month day dep_time sched_dep_time dep_delay
## <chr> <chr> <chr> <int> <int> <int> <int> <int> <dbl>
## 1 IAH EWR UA 2013 1 1 517 515 2
## 2 IAH LGA UA 2013 1 1 533 529 4
## 3 MIA JFK AA 2013 1 1 542 540 2
## 4 BQN JFK B6 2013 1 1 544 545 -1
## 5 ATL LGA DL 2013 1 1 554 600 -6
## 6 ORD EWR UA 2013 1 1 554 558 -4
## 7 FLL EWR B6 2013 1 1 555 600 -5
## 8 IAD LGA EV 2013 1 1 557 600 -3
## 9 MCO JFK B6 2013 1 1 557 600 -3
## 10 ORD LGA AA 2013 1 1 558 600 -2
## # ℹ 336,766 more rows
## # ℹ 10 more variables: arr_time <int>, sched_arr_time <int>, arr_delay <dbl>,
## # flight <int>, tailnum <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Use the following to check your code.
## NOTE: No need to change this
assertthat::assert_that(
assertthat::are_equal(names(df_q2)[1:5], c("dest", "origin", "carrier", "year", "month"))
)
## [1] TRUE
## [1] "Well done!"
Since R will only show the first few columns of a tibble, using select()
in this fashion will help us see the values of particular columns.
17.2 filter()
rows
With analyzing data, we’re often looking for rows that match particular criteria. To find rows, we can use filter()
along with logical statements. For instance, we could select only those rows where the flight was in February (month == 2
).
## # A tibble: 24,951 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 2 1 456 500 -4 652 648
## 2 2013 2 1 520 525 -5 816 820
## 3 2013 2 1 527 530 -3 837 829
## 4 2013 2 1 532 540 -8 1007 1017
## 5 2013 2 1 540 540 0 859 850
## 6 2013 2 1 552 600 -8 714 715
## 7 2013 2 1 552 600 -8 919 910
## 8 2013 2 1 552 600 -8 655 709
## 9 2013 2 1 553 600 -7 833 815
## 10 2013 2 1 553 600 -7 821 825
## # ℹ 24,941 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Important note: Keep in mind that you have to use ==
to check equality. Using =
inside a filter()
will raise a helpful error message telling you to use ==
instead.
17.2.1 Filter conditions
We can use a variety of conditions with filter()
:
==
: strict equality!=
: not equal>
,>=
: greater than (or equal)<
, “<=`: less than (or equal)- Logical operations:
&
: and|
: or!
: notxor
: and/or
17.2.2 Filter helpers
There are also some useful helper functions we can use with filter()
:
is.na(x)
: Returns rows that containNaN
values- Note that we can use
!is.na(x)
to return non-NaN
values
- Note that we can use
between(x, l, r)
: Returns rows withl < x < r
near(x, y)
: Returns rows wherex
is “close” toy
; we can optionally set the tolerance withtol = ???
17.2.3 q3 Find near-arrivals
Find all the flights where the plane arrived within 10 minutes of its planned arrival time, but was not exactly on time.
Hint: You can use ?flights
to look up the units for each column.
## TASK:
df_q3 <-
flights %>%
filter(
arr_time != sched_arr_time,
near(arr_time, sched_arr_time, tol = 10),
)
df_q3
## # A tibble: 86,700 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 557 600 -3 838 846
## 2 2013 1 1 558 600 -2 753 745
## 3 2013 1 1 558 600 -2 849 851
## 4 2013 1 1 558 600 -2 853 856
## 5 2013 1 1 558 600 -2 924 917
## 6 2013 1 1 559 559 0 702 706
## 7 2013 1 1 600 600 0 851 858
## 8 2013 1 1 601 600 1 844 850
## 9 2013 1 1 602 610 -8 812 820
## 10 2013 1 1 606 610 -4 837 845
## # ℹ 86,690 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
## NOTE: No need to change this!
assertthat::assert_that(
(
df_q3 %>%
summarize(diff = max(arr_time - sched_arr_time)) %>%
pull(diff) %>%
.[[1]] <= 10
) |
(
df_q3 %>%
summarize(diff = max(abs(arr_delay))) %>%
pull(diff) %>%
.[[1]] <= 10
)
)
## [1] TRUE
assertthat::assert_that(
df_q3 %>%
summarize(diff = max(arr_time - sched_arr_time)) %>%
pull(diff) %>%
.[[1]] > 0
)
## [1] TRUE
## [1] "Nice!"
17.3 Tidyverse quirks
Note that in Tidyverse functions like select()
and filter()
, we can refer to column names directly—we don’t have to quote them. However, the same doesn’t hold if we’re trying to compare string values. Make sure to quote a string if you’re using it in a filter()
comparison!
17.3.1 q4 Fix a bug
Fix the following code. What is the mistake here? What is the code trying to accomplish?
## # A tibble: 16,174 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 558 600 -2 924 917
## 2 2013 1 1 628 630 -2 1016 947
## 3 2013 1 1 658 700 -2 1027 1025
## 4 2013 1 1 702 700 2 1058 1014
## 5 2013 1 1 743 730 13 1107 1100
## 6 2013 1 1 828 823 5 1150 1143
## 7 2013 1 1 829 830 -1 1152 1200
## 8 2013 1 1 856 900 -4 1226 1220
## 9 2013 1 1 859 900 -1 1223 1225
## 10 2013 1 1 921 900 21 1237 1227
## # ℹ 16,164 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
The next error is far more insidious….
17.3.2 q5 Fix a bug
This code doesn’t quite what the user intended—they were trying to filter for only those flights where the desination was Boston airport. What went wrong?
## It's more likely the user is trying to find flights to Boston;
## they just did something a bit misguided by storing LaGuardia
## in a variable named BOS
flights %>% filter(dest == "BOS")
## # A tibble: 15,508 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 559 559 0 702 706
## 2 2013 1 1 639 640 -1 739 749
## 3 2013 1 1 801 805 -4 900 919
## 4 2013 1 1 803 810 -7 903 925
## 5 2013 1 1 820 830 -10 940 954
## 6 2013 1 1 923 919 4 1026 1030
## 7 2013 1 1 957 733 144 1056 853
## 8 2013 1 1 1033 1017 16 1130 1136
## 9 2013 1 1 1154 1200 -6 1253 1306
## 10 2013 1 1 1237 1245 -8 1340 1350
## # ℹ 15,498 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
It will take practice to get used to when and when not to use quotations. Don’t worry—we’ll get lots of practice!
The filter()
tool is very simple, but already allows us to do a lot: This dataset is called nycflights
; in what sense is it focused on New York city? Let’s do a quick check to get an idea:
17.3.3 q6 Understand the data
Perform two filters; first filter for flights where the destination was a New York airport (JFK, LGA, or EWR
), then for flights where the origin was a New York airport (the same three). Answer the questions below.
df_q6a <-
flights %>%
filter(
dest == "JFK" | dest == "LGA" | dest == "EWR"
)
df_q6b <-
flights %>%
filter(
origin == "JFK" | origin == "LGA" | origin == "EWR"
)
Use the following code to check your answer.
## NOTE: No need to change this!
assertthat::assert_that(
df_q6a %>%
mutate(flag = dest %in% c("JFK", "LGA", "EWR")) %>%
summarize(flag = all(flag)) %>%
pull(flag)
)
## [1] TRUE
assertthat::assert_that(
df_q6b %>%
mutate(flag = origin %in% c("JFK", "LGA", "EWR")) %>%
summarize(flag = all(flag)) %>%
pull(flag)
)
## [1] TRUE
## [1] "Nice!"
Observations - Was this dataset assembled to study flights out of NYC, or into NYC? How do you know? - The fact that all observations have their origin in NYC highlights (but does not itself prove!) that these data were collected to study flights departing from the NYC area.
Aside: Data are not just numbers. Data are numbers with context. Every dataset is put together for some reason. This reason will inform what observations (rows) and variables (columns) are in the data, and which are not in the data. Conversely, thinking carefully about what data a person or organization bothered to collect—and what they ignored—can tell you something about the perspective of those who collected the data. Thinking about these issues is partly what separates data science from programming or machine learning.
17.4 arrange()
rows
One more simple tool; rather than remove rows, we can re-arrange rows with arrange()
to see what comes to the top. arrange()
takes a set of columns by which to sort the data. For instance, we can find the earliest-departing flights with arrange(dep_delay)
.
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 7 2040 2123 -43 40 2352
## 2 2013 2 3 2022 2055 -33 2240 2338
## 3 2013 11 10 1408 1440 -32 1549 1559
## 4 2013 1 11 1900 1930 -30 2233 2243
## 5 2013 1 29 1703 1730 -27 1947 1957
## 6 2013 8 9 729 755 -26 1002 955
## 7 2013 10 23 1907 1932 -25 2143 2143
## 8 2013 3 30 2030 2055 -25 2213 2250
## 9 2013 3 2 1431 1455 -24 1601 1631
## 10 2013 5 5 934 958 -24 1225 1309
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Aside: What flight leaves 43
minutes early? That’s crazy!!
We can also use arrange(desc(x))
to reverse the sort.
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 6 27 959 1900 899 1236 2226
## 9 2013 7 22 2257 759 898 121 1026
## 10 2013 12 5 756 1700 896 1058 2020
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Aside: A 21.6 hour delay sounds like hell.
Now I know what you’re thinking: “desc()
seems silly; why not just arrange(-x)
?”
Ah, but what about using arrange(s)
with string data? How will you take a negative string? You can’t! But you can use arrange(desc(s))
to sort in reverse alphabetical order:
## # A tibble: 336,776 × 19
## carrier year month day dep_time sched_dep_time dep_delay arr_time
## <chr> <int> <int> <int> <int> <int> <dbl> <int>
## 1 YV 2013 1 3 1428 1435 -7 1539
## 2 YV 2013 1 3 1551 1602 -11 1659
## 3 YV 2013 1 4 1430 1435 -5 1546
## 4 YV 2013 1 4 1731 1602 89 1837
## 5 YV 2013 1 6 1557 1605 -8 1714
## 6 YV 2013 1 7 1430 1435 -5 1541
## 7 YV 2013 1 7 1556 1602 -6 1721
## 8 YV 2013 1 8 1432 1435 -3 1537
## 9 YV 2013 1 8 1555 1602 -7 1727
## 10 YV 2013 1 9 1432 1435 -3 1543
## # ℹ 336,766 more rows
## # ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
17.4.1 q7 Sort to find
Sort the flights in descending order by their air_time
. Bring air_time, dest
to the front. What can you tell about the longest flights?
## # A tibble: 336,776 × 19
## air_time dest year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <chr> <int> <int> <int> <int> <int> <dbl> <int>
## 1 695 HNL 2013 3 17 1337 1335 2 1937
## 2 691 HNL 2013 2 6 853 900 -7 1542
## 3 686 HNL 2013 3 15 1001 1000 1 1551
## 4 686 HNL 2013 3 17 1006 1000 6 1607
## 5 683 HNL 2013 3 16 1001 1000 1 1544
## 6 679 HNL 2013 2 5 900 900 0 1555
## 7 676 HNL 2013 11 12 936 930 6 1630
## 8 676 HNL 2013 3 14 958 1000 -2 1542
## 9 675 HNL 2013 11 20 1006 1000 6 1639
## 10 671 HNL 2013 3 15 1342 1335 7 1924
## # ℹ 336,766 more rows
## # ℹ 10 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Use the following to check your work
## NOTE: No need to change this!
assertthat::assert_that(
assertthat::are_equal(
df_q7 %>% head(1) %>% pull(air_time),
flights %>% pull(air_time) %>% max(na.rm = TRUE)
)
)
## [1] TRUE
assertthat::assert_that(
assertthat::are_equal(
df_q7 %>% filter(!is.na(air_time)) %>% tail(1) %>% pull(air_time),
flights %>% pull(air_time) %>% min(na.rm = TRUE)
)
)
## [1] TRUE
## [1] TRUE
## [1] "Great job!"
The longest flights are to “HNL”; this makes sense—Honolulu is quite far from NYC!