Vis: Bar Charts

Vis: Bar Charts#

Purpose: Bar charts are a key tool for EDA. In this exercise, we’ll learn how to construct a variety of different bar charts, as well as when—and when not—to use various charts.

Setup#

import grama as gr
DF = gr.Intention()
%matplotlib inline

We’ll use the mpg dataset from plotnine: This is a dataset describing different automobiles, including their mileage (hence mpg).

from plotnine.data import mpg as df_mpg

Bars and Cols#

A bar chart visualizes data using bars. A bar chart is most effective at showing a continuous variable against a discrete one.

With ggplot we have two ways to make a bar chart: The first is geom_bar(), which takes just one aesthetic x. The geometry geom_bar() visualizes the number of observations (count) in the dataset associated with each unique value of the given variable. For instance, the following plot shows the number of vehicles according to each class.

# NOTE: No need to edit
(
    df_mpg
    >> gr.ggplot(mapping=gr.aes(x="class"))
    + gr.geom_bar()
)

../_images/d3b24bcd3258244ed1f10d8b206e47bd35bbf5b10a13d53953594ead11c60a64.png

<ggplot: (8773447564129)>

Clearly, there are far more SUVs, compacts, and midsize vehicles in the dataset than other classes.

The other bar geometry is geom_col(), which takes two aesthetics. The geometry geom_col() extends from zero to a desired value y, within each x value. The following gives a simple demo with made-up data.

# NOTE: No need to edit
(
    gr.df_make(
        category=["A", "B"],
        value=[3, 5],
    )
    >> gr.ggplot(gr.aes(x="category", y="value"))
    + gr.geom_col()
)

../_images/f91059543a7e0c985f8ce58915840b286092083aadba67c1d6d7bd7b16e6410a.png

<ggplot: (8773447855737)>

We can actually recreate a geom_bar() plot by using tf_count() and geom_col(), which you’ll do in the next task.

q1 Convert bars to cols#

Recreate the following plot using geom_col().

# TASK: Convert this plot to use geom_col()

(
    df_mpg
    >> gr.tf_count(DF.trans)
    >> gr.ggplot(gr.aes(x="trans", y="n"))
    + gr.geom_col()
)
# solution-end

../_images/d46769455a93582382c39f1ba96fc79b5ea6a3cdcfe54ab178723831bc33c8db.png

<ggplot: (8773460739292)>

Note that the labels for trans overlap; we’ll fix that in the next section.

Challenges with bar charts#

There are a few “gotchas” when visualizing with bar charts; we’ll go over two:

Overlapping Labels#

We saw in the previous plot that when our x variable has a lot of levels, the labels can overlap. One simple way to fix this is to flip the coordinates. We can’t simply swap the aesthetics x and y, as this will not give us what we want:

# NOTE: No need to edit; run and inspect
(
    df_mpg
    >> gr.tf_count(DF.trans)
    >> gr.ggplot(gr.aes(y="trans", x="n"))
    + gr.geom_col()
)

../_images/3b85c6ccd1eea21b849e90d707fb8b8dcadc211cfdbcb0bf7bed88d8fa7f1506.png

<ggplot: (8773447758800)>

Instead, we can flip the entire plot using coord_flip(). We use this by adding it to the ggplot object:

(
    df_data
    >> gr.ggplot(gr.aes(x="x", y="y"))
    + gr.geom_col()
    + gr.coord_flip()
)

q2 Flip coordinates to fix overlap#

Flip the coordinates to fix the overlapping labels in the following plot.

# TASK: Flip the coordinates to fix the overlapping labels
(
    df_mpg
    >> gr.ggplot(gr.aes(x="trans"))
    + gr.geom_bar()
    + gr.coord_flip()
)

../_images/5a2c89d6a343398b2ef3541a7a70fbcbc5c812854b554a62db560530a45aea3d.png

<ggplot: (8773412858534)>

1-to-1 Data#

A bar chart draws a bar for every observation, this means that the data need to be “1-to-1”. This is an important limitation of bar charts, which is best understood through an example:

q3 Inspect the plot#

Inspect the following plot, and answer the questions under observations below.

# TASK: No need to edit; run and inspect
(
    df_mpg
    >> gr.ggplot(gr.aes(x="cty", y="hwy"))
    + gr.geom_col()
)

../_images/b4862767ff2752da4880cd195eaf0414824003410c57e5b277685d555087fd23.png

<ggplot: (8773447808994)>

Observations

What is the largest hwy value shown in the plot above? Does this seem like a realistic value for the highway mileage?
- The largest hwy value is over 600; this is totally unreasonable!

The following plot helps us understand the issue: With outlines around each bar, we can see that there are multiple stacked bars at each x level.

# TASK: No need to edit; run and inspect
(
    df_mpg
    >> gr.ggplot(gr.aes(x="cty", y="hwy"))
    + gr.geom_col(color="black")
)

../_images/916f2fbd7f63c84772b45032811cc3e032fee58b5886363c9114441ad864d8ec.png

<ggplot: (8773460820312)>

In order to avoid overlap, the data need to have just one observation for each level of the horizontal factor. Put differently, the data must be 1-to-1. We can check this with some simple counting.

q4 Check if data are 1-to-1#

If the data were 1-to-1 in the cty to hwy values, then there would be only one hwy value for each unique cty value. Check whether this is the case in df_mpg.

# TASK: Check if the data are 1-to-1 (in cty and hwy)
(
    df_mpg
    >> gr.tf_count(DF.cty, DF.hwy)
    >> gr.tf_head()
)

	cty	hwy	n
0	9	12	5
1	11	14	2
2	11	15	10
3	11	16	3
4	11	17	5

Observations

Is the data 1-to-1? Why or why not?
- No, the data are not 1-to-1: For instance, for the value cty==11, hwy takes multiple different values.

Design Considerations#

To close this exercise, we’ll cover some design considerations when making (bar) charts.

Picking aesthetics#

A major part of designing any plot is making choices about assigning variables to aesthetics.

One option we have is to “double-assign” a variable to multiple aesthetics. In the next task you’ll compare the efficacy of double-assigning aesthetics.

q5 Compare two plots#

Compare the following two plots, and answer the questions under observations below.

# TASK: No need to edit; run and inspect
(
    df_mpg
    >> gr.ggplot(gr.aes(x="class", fill="class"))
    + gr.geom_bar()
)

../_images/24b536c8d03abcd6c9c779ddafc2db351414914f720b98468afe4bc5af8bc4ca.png

<ggplot: (8773427264856)>

Observations

What observations can you make?
- The suv observations are most numerous
- There are fewest 2seater observations

# TASK: No need to edit; run and inspect
# NOTE: the "drv" variable represent the "drivetrain" for each vehicle entry
# where r is rear, f is front, and 4 is 4-wheel/all wheel drive
(
    df_mpg
    >> gr.ggplot(gr.aes(x="class", fill="drv")) 
    + gr.geom_bar()
)

../_images/f4b5d0202edd7044d277b91aa6194ed4f2b86a147e6d055352572f78acd2dbb4.png

<ggplot: (8773461126304)>

Observations

What additional observations can you make on this version of the plot?
- Rear-wheel drive vehicles in the dataset are only 2seater, subcompact, and suv.
- All the 2seater vehicles are rear-wheel drive.
- All the minivan vehicles are forward-wheel drive.
What is different in the design of this graph, as compared with the previous one?
- This version of the graph uses fill for an additional variable drv, rather than repeating the x aesthetic class.

q6 Pros and cons of double-assignment#

Answer the questions below:

What are some pros of double-assigning a single variable to multiple aesthetics?
- Double-assigning a variable can more highly-emphasize a variable.
What are some pros of single-assigning aesthetics, in order to show more variables?
- Showing more variables opens up the possibility of seeing more patterns.

Vis: Bar Charts

Contents

Vis: Bar Charts#

Setup#

Bars and Cols#

q1 Convert bars to cols#

Challenges with bar charts#

Overlapping Labels#

q2 Flip coordinates to fix overlap#

1-to-1 Data#

q3 Inspect the plot#

q4 Check if data are 1-to-1#

Design Considerations#

Picking aesthetics#

q5 Compare two plots#

q6 Pros and cons of double-assignment#