28 Vis: Scatterplots and Layers
Purpose: Scatterplots are a key tool for EDA. Scatteplots help us inspect the relationship between two variables. To enhance our scatterplots, we’ll learn how to use layers in ggplot to add multiple pieces of information to our plots.
Reading: Scatterplots Topics: (All topics) Reading Time: ~40 minutes
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
28.1 A Note on Layers
In the reading we learned about layers in ggplot. Formally, ggplot is a
“layered grammar of graphics”; each layer has the option to use built-in or
inherited defaults, or override those defaults. There are two major settings we
might want to change: the source of data
or the mapping
which defines the
aesthetics. If we’re being verbose, we write a ggplot call like:
## NOTE: No need to modify! Just example code
ggplot(
data = mpg,
mapping = aes(x = displ, y = hwy)
) +
geom_point()
However, ggplot makes a number of sensible defaults to help save us typing.
Ggplot assumes an order for data, mapping
, so we can drop the keywords:
Similarly the aesthetic function aes()
assumes the first two arguments will be
x, y
, so we can drop those arguments as well
Above geom_point()
inherits the mapping
from the base ggplot
call;
however, we can override this. This can be helpful for a number of different
purposes:
## NOTE: No need to modify! Just example code
ggplot(mpg, aes(x = displ)) +
geom_point(aes(y = hwy, color = "hwy")) +
geom_point(aes(y = cty, color = "cty"))
Later, we’ll learn more concise ways to construct graphs like the one above. But for now, we’ll practice using layers to add more information to scatterplots.
28.2 Exercises
28.2.1 q1 Add two geom_smooth
trends to the following plot. Use “gam” for one
trend and “lm” for the other. Comment on how linear or nonlinear the “gam” trend looks.
diamonds %>%
ggplot(aes(carat, price)) +
geom_point() +
geom_smooth(aes(color = "gam"), method = "gam") +
geom_smooth(aes(color = "lm"), method = "lm")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using formula = 'y ~ x'
Observations: - No; the “gam” trend curves below then above the linear trend
28.2.2 q2 Add non-overlapping labels to the following scattterplot using the
provided df_annotate
.
Hint 1: geom_label_repel
comes from the ggrepel
package. Make sure to load
it, and adhere to best-practices!
Hint 2: You’ll have to use the data
keyword to override the data layer!
## TODO: Use df_annotate below to add text labels to the scatterplot
df_annotate <-
mpg %>%
group_by(class) %>%
summarize(
displ = mean(displ),
hwy = mean(hwy)
)
mpg %>%
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_label_repel(
data = df_annotate,
aes(label = class, fill = class)
)
28.2.3 q3 Study the following scatterplot: Note whether city (cty
) or highway
(hwy
) mileage tends to be greater. Describe the trend (visualized by
geom_smooth
) in mileage with engine displacement (a measure of engine size).
Note: The grey region around the smooth trend is a confidence bound; we’ll discuss these further as we get deeper into statistical literacy.
## NOTE: No need to modify! Just analyze the scatterplot
mpg %>%
pivot_longer(names_to = "source", values_to = "mpg", c(hwy, cty)) %>%
ggplot(aes(displ, mpg, color = source)) +
geom_point() +
geom_smooth() +
scale_color_discrete(name = "Mileage Type") +
labs(
x = "Engine displacement (liters)",
y = "Mileage (mpg)"
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Observations:
- hwy
mileage tends to be larger; driving on the highway is more efficient
- Mileage tends to decrease with engine size; cars with larger engines tend to be less efficient
28.3 Aside: Scatterplot vs bar chart
Why use a scatterplot vs a bar chart? A bar chart is useful for emphasizing some threshold. Let’s look at a few examples:
28.4 Raw populations
Two visuals of the same data:
Here we’re emphasizing zero, so we don’t see much of a change
Here’s we’re not emphasizing zero; the scale is adjusted to emphasize the trend in the data.
28.5 Population changes
Two visuals of the same data:
economics %>%
mutate(pop_delta = pop - lag(pop)) %>%
filter(date > lubridate::ymd("2005-01-01")) %>%
ggplot(aes(date, pop_delta)) +
geom_col()
Here we’re emphasizing zero, so we can easily see the month of negative change.
economics %>%
mutate(pop_delta = pop - lag(pop)) %>%
filter(date > lubridate::ymd("2005-01-01")) %>%
ggplot(aes(date, pop_delta)) +
geom_point()
Here we’re not emphasizing zero; we can easily see the outlier month, but we have to read the axis to see that this is a case of negative growth.
For more, see Bars vs Dots.