20 Vis: Histograms

Purpose: Histograms are a key tool for EDA. In this exercise we’ll get a little more practice constructing and interpreting histograms and densities.

Reading: Histograms Topics: (All topics) Reading Time: ~20 minutes

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

20.0.1 q1 Using the graphs generated in the chunks q1-vis1 and q1-vis2 below, answer:

  • Which class has the most vehicles?
  • Which class has the broadest distribution of cty values?
  • Which graph—vis1 or vis2—best helps you answer each question?
## NOTE: No need to modify
mpg %>%
  ggplot(aes(cty, color = class)) +
  geom_freqpoly(bins = 10)

  • From this graph, it’s easy to see that suv is the most numerous class
## NOTE: No need to modify
mpg %>%
  ggplot(aes(cty, color = class)) +
  geom_density()

  • From this graph, it’s easy to see that subcompact has the broadest distribution

In my opinion, it’s easier to see the broadness of subcompact by the density plot q1-vis2.

In the previous exercise, we learned how to facet a graph. Let’s use that part of the grammar of graphics to clean up the graph above.

20.0.2 q2 Modify q1-vis2 to use a facet_wrap() on the class. “Free” the vertical axis with the scales keyword to allow for a different y scale in each facet.

mpg %>%
  ggplot(aes(cty)) +
  geom_density() +
  facet_wrap(~class, scales = "free_y")

In the reading, we learned that the “most important thing” to keep in mind with geom_histogram() and geom_freqpoly() is to explore different binwidths. We’ll explore this idea in the next question.

20.0.3 q3 Analyze the following graph; make sure to test different binwidths. What patterns do you see? Which patterns remain as you change the binwidth?

## TODO: Run this chunk; play with differnet bin widths
diamonds %>%
  filter(carat < 1.1) %>%

  ggplot(aes(carat)) +
  geom_histogram(binwidth = 0.01, boundary = 0.005) +
  scale_x_continuous(
    breaks = seq(0, 1, by = 0.1)

  )

Observations - The largest number of diamonds tend to fall on or above even 10-ths of a carat. - The peak near 0.5 is very broad, compared to the others. - The peak at 0.3 is most numerous