28 Vis: Scatterplots and Layers

Purpose: Scatterplots are a key tool for EDA. Scatteplots help us inspect the relationship between two variables. To enhance our scatterplots, we’ll learn how to use layers in ggplot to add multiple pieces of information to our plots.

Reading: Scatterplots Topics: (All topics) Reading Time: ~40 minutes

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggrepel)

28.1 A Note on Layers

In the reading we learned about layers in ggplot. Formally, ggplot is a “layered grammar of graphics”; each layer has the option to use built-in or inherited defaults, or override those defaults. There are two major settings we might want to change: the source of data or the mapping which defines the aesthetics. If we’re being verbose, we write a ggplot call like:

## NOTE: No need to modify! Just example code
ggplot(
  data = mpg,
  mapping = aes(x = displ, y = hwy)
) +
  geom_point()

However, ggplot makes a number of sensible defaults to help save us typing. Ggplot assumes an order for data, mapping, so we can drop the keywords:

## NOTE: No need to modify! Just example code
ggplot(
  mpg,
  aes(x = displ, y = hwy)
) +
  geom_point()

Similarly the aesthetic function aes() assumes the first two arguments will be x, y, so we can drop those arguments as well

## NOTE: No need to modify! Just example code
ggplot(
  mpg,
  aes(displ, hwy)
) +
  geom_point()

Above geom_point() inherits the mapping from the base ggplot call; however, we can override this. This can be helpful for a number of different purposes:

## NOTE: No need to modify! Just example code
ggplot(mpg, aes(x = displ)) +
  geom_point(aes(y = hwy, color = "hwy")) +
  geom_point(aes(y = cty, color = "cty"))

Later, we’ll learn more concise ways to construct graphs like the one above. But for now, we’ll practice using layers to add more information to scatterplots.

28.2 Exercises

28.2.2 q2 Add non-overlapping labels to the following scattterplot using the

provided df_annotate.

Hint 1: geom_label_repel comes from the ggrepel package. Make sure to load it, and adhere to best-practices!

Hint 2: You’ll have to use the data keyword to override the data layer!

## TODO: Use df_annotate below to add text labels to the scatterplot
df_annotate <-
  mpg %>%
  group_by(class) %>%
  summarize(
    displ = mean(displ),
    hwy = mean(hwy)
  )

mpg %>%
  ggplot(aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_label_repel(
    data = df_annotate,
    aes(label = class, fill = class)
  )

28.2.3 q3 Study the following scatterplot: Note whether city (cty) or highway

(hwy) mileage tends to be greater. Describe the trend (visualized by geom_smooth) in mileage with engine displacement (a measure of engine size).

Note: The grey region around the smooth trend is a confidence bound; we’ll discuss these further as we get deeper into statistical literacy.

## NOTE: No need to modify! Just analyze the scatterplot
mpg %>%
  pivot_longer(names_to = "source", values_to = "mpg", c(hwy, cty)) %>%
  ggplot(aes(displ, mpg, color = source)) +
  geom_point() +
  geom_smooth() +
  scale_color_discrete(name = "Mileage Type") +
  labs(
    x = "Engine displacement (liters)",
    y = "Mileage (mpg)"
  )
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Observations: - hwy mileage tends to be larger; driving on the highway is more efficient - Mileage tends to decrease with engine size; cars with larger engines tend to be less efficient

28.3 Aside: Scatterplot vs bar chart

Why use a scatterplot vs a bar chart? A bar chart is useful for emphasizing some threshold. Let’s look at a few examples:

28.4 Raw populations

Two visuals of the same data:

economics %>%
  filter(date > lubridate::ymd("2010-01-01")) %>%
  ggplot(aes(date, pop)) +
  geom_col()

Here we’re emphasizing zero, so we don’t see much of a change

economics %>%
  filter(date > lubridate::ymd("2010-01-01")) %>%
  ggplot(aes(date, pop)) +
  geom_point()

Here’s we’re not emphasizing zero; the scale is adjusted to emphasize the trend in the data.

28.5 Population changes

Two visuals of the same data:

economics %>%
  mutate(pop_delta = pop - lag(pop)) %>%
  filter(date > lubridate::ymd("2005-01-01")) %>%
  ggplot(aes(date, pop_delta)) +
  geom_col()

Here we’re emphasizing zero, so we can easily see the month of negative change.

economics %>%
  mutate(pop_delta = pop - lag(pop)) %>%
  filter(date > lubridate::ymd("2005-01-01")) %>%
  ggplot(aes(date, pop_delta)) +
  geom_point()

Here we’re not emphasizing zero; we can easily see the outlier month, but we have to read the axis to see that this is a case of negative growth.

For more, see Bars vs Dots.