# Vis: Bar Charts

## Contents

# Vis: Bar Charts#

*Purpose*: *Bar charts* are a key tool for EDA. In this exercise, we’ll learn how to construct a variety of different bar charts, as well as when—and when *not*—to use various charts.

## Setup#

```
import grama as gr
DF = gr.Intention()
%matplotlib inline
```

We’ll use the `mpg`

dataset from `plotnine`

: This is a dataset describing different automobiles, including their mileage (hence mpg).

```
from plotnine.data import mpg as df_mpg
```

# Bars and Cols#

A bar chart visualizes data using bars. A bar chart is most effective at showing a continuous variable against a discrete one.

With ggplot we have two ways to make a bar chart: The first is `geom_bar()`

, which takes just one aesthetic `x`

. The geometry `geom_bar()`

visualizes the number of observations (count) in the dataset associated with each unique value of the given variable. For instance, the following plot shows the number of vehicles according to each `class`

.

```
# NOTE: No need to edit
(
df_mpg
>> gr.ggplot(mapping=gr.aes(x="class"))
+ gr.geom_bar()
)
```

```
<ggplot: (8773447564129)>
```

Clearly, there are far more SUVs, compacts, and midsize vehicles in the dataset than other classes.

The other bar geometry is `geom_col()`

, which takes two aesthetics. The geometry `geom_col()`

extends from zero to a desired value `y`

, within each `x`

value. The following gives a simple demo with made-up data.

```
# NOTE: No need to edit
(
gr.df_make(
category=["A", "B"],
value=[3, 5],
)
>> gr.ggplot(gr.aes(x="category", y="value"))
+ gr.geom_col()
)
```

```
<ggplot: (8773447855737)>
```

We can actually recreate a `geom_bar()`

plot by using `tf_count()`

and `geom_col()`

, which you’ll do in the next task.

**q1** Convert bars to cols#

Recreate the following plot using `geom_col()`

.

```
# TASK: Convert this plot to use geom_col()
(
df_mpg
>> gr.tf_count(DF.trans)
>> gr.ggplot(gr.aes(x="trans", y="n"))
+ gr.geom_col()
)
# solution-end
```

```
<ggplot: (8773460739292)>
```

Note that the labels for `trans`

overlap; we’ll fix that in the next section.

# Challenges with bar charts#

There are a few “gotchas” when visualizing with bar charts; we’ll go over two:

## Overlapping Labels#

We saw in the previous plot that when our `x`

variable has a lot of levels, the labels can overlap. One simple way to fix this is to flip the coordinates. We can’t simply swap the aesthetics `x`

and `y`

, as this will not give us what we want:

```
# NOTE: No need to edit; run and inspect
(
df_mpg
>> gr.tf_count(DF.trans)
>> gr.ggplot(gr.aes(y="trans", x="n"))
+ gr.geom_col()
)
```

```
<ggplot: (8773447758800)>
```

Instead, we can *flip* the entire plot using `coord_flip()`

. We use this by adding it to the `ggplot`

object:

```
(
df_data
>> gr.ggplot(gr.aes(x="x", y="y"))
+ gr.geom_col()
+ gr.coord_flip()
)
```

**q2** Flip coordinates to fix overlap#

Flip the coordinates to fix the overlapping labels in the following plot.

```
# TASK: Flip the coordinates to fix the overlapping labels
(
df_mpg
>> gr.ggplot(gr.aes(x="trans"))
+ gr.geom_bar()
+ gr.coord_flip()
)
```

```
<ggplot: (8773412858534)>
```

## 1-to-1 Data#

A bar chart draws a bar for every observation, this means that the data need to be “1-to-1”. This is an important limitation of bar charts, which is best understood through an example:

**q3** Inspect the plot#

Inspect the following plot, and answer the questions under *observations* below.

```
# TASK: No need to edit; run and inspect
(
df_mpg
>> gr.ggplot(gr.aes(x="cty", y="hwy"))
+ gr.geom_col()
)
```

```
<ggplot: (8773447808994)>
```

*Observations*

What is the largest

`hwy`

value shown in the plot above? Does this seem like a realistic value for the highway mileage?The largest

`hwy`

value is over 600; this is totally unreasonable!

The following plot helps us understand the issue: With outlines around each bar, we can see that there are *multiple stacked bars* at each `x`

level.

```
# TASK: No need to edit; run and inspect
(
df_mpg
>> gr.ggplot(gr.aes(x="cty", y="hwy"))
+ gr.geom_col(color="black")
)
```

```
<ggplot: (8773460820312)>
```

In order to avoid overlap, the data need to have just one observation for each level of the horizontal factor. Put differently, the data must be 1-to-1. We can check this with some simple counting.

**q4** Check if data are 1-to-1#

If the data were 1-to-1 in the `cty`

to `hwy`

values, then there would be only one `hwy`

value for each unique `cty`

value. Check whether this is the case in `df_mpg`

.

```
# TASK: Check if the data are 1-to-1 (in cty and hwy)
(
df_mpg
>> gr.tf_count(DF.cty, DF.hwy)
>> gr.tf_head()
)
```

cty | hwy | n | |
---|---|---|---|

0 | 9 | 12 | 5 |

1 | 11 | 14 | 2 |

2 | 11 | 15 | 10 |

3 | 11 | 16 | 3 |

4 | 11 | 17 | 5 |

*Observations*

Is the data 1-to-1? Why or why not?

No, the data are not 1-to-1: For instance, for the value

`cty==11`

,`hwy`

takes multiple different values.

# Design Considerations#

To close this exercise, we’ll cover some design considerations when making (bar) charts.

## Picking aesthetics#

A major part of designing any plot is making choices about assigning variables to aesthetics.

One option we have is to “double-assign” a variable to multiple aesthetics. In the next task you’ll compare the efficacy of double-assigning aesthetics.

**q5** Compare two plots#

Compare the following two plots, and answer the questions under *observations* below.

```
# TASK: No need to edit; run and inspect
(
df_mpg
>> gr.ggplot(gr.aes(x="class", fill="class"))
+ gr.geom_bar()
)
```

```
<ggplot: (8773427264856)>
```

*Observations*

What observations can you make?

The

`suv`

observations are most numerousThere are fewest

`2seater`

observations

```
# TASK: No need to edit; run and inspect
# NOTE: the "drv" variable represent the "drivetrain" for each vehicle entry
# where r is rear, f is front, and 4 is 4-wheel/all wheel drive
(
df_mpg
>> gr.ggplot(gr.aes(x="class", fill="drv"))
+ gr.geom_bar()
)
```

```
<ggplot: (8773461126304)>
```

*Observations*

What

**additional**observations can you make on this version of the plot?Rear-wheel drive vehicles in the dataset are only

`2seater`

,`subcompact`

, and`suv`

.All the

`2seater`

vehicles are rear-wheel drive.All the

`minivan`

vehicles are forward-wheel drive.

What is different in the design of this graph, as compared with the previous one?

This version of the graph uses

`fill`

for an additional variable`drv`

, rather than repeating the`x`

aesthetic`class`

.

**q6** Pros and cons of double-assignment#

Answer the questions below:

What are some pros of double-assigning a single variable to multiple aesthetics?

Double-assigning a variable can more highly-emphasize a variable.

What are some pros of single-assigning aesthetics, in order to show more variables?

Showing more variables opens up the possibility of seeing more patterns.