# Stats: Introduction to Exploratory Data Analysis

## Contents

# Stats: Introduction to Exploratory Data Analysis#

*Purpose*: *Exploratory Data Analysis* (EDA) is a **crucial** skill for a practicing data scientist. Unfortunately, much like human-centered design EDA is hard to teach. This is because EDA is **not** a strict procedure, so much as it is a **mindset**. Also, much like human-centered design, EDA is an *iterative, nonlinear process*. There are two key principles to keep in mind when doing EDA:

Curiosity: Generate lots of ideas and hypotheses about your data.

Skepticism: Remain unconvinced of those ideas, unless you can find credible patterns to support them.

Since EDA is both *crucial* and *difficult*, we will practice doing EDA *a lot* in this course!

## Reading#

*Reading*: Exploratory Data Analysis

*Topics*: (All topics)

*Reading Time*: ~45 minutes

## Setup#

```
import grama as gr
DF = gr.Intention()
%matplotlib inline
```

We’ll study the diamonds dataset for this exercise.

```
from grama.data import df_diamonds
df_diamonds = (
df_diamonds
>> gr.tf_mutate(
# Order the cut to aid in plotting
cut=gr.as_factor(
DF.cut,
categories=[
"Fair",
"Good",
"Very Good",
"Premium",
"Ideal"
]
)
)
)
```

# Basic EDA Tools#

There are a few simple tools we can use to investigate a dataset. We should use these tools even before making visuals of the data.

**q1** Take the head#

Use the appropriate function to get the first 5 observations in `df_diamonds`

. Answer the questions under *observations* below.

```
# TASK: Get the first 10 observations
(
df_diamonds
>> gr.tf_head(5)
)
```

carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |

1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |

2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |

3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |

4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |

*Observations*

What variables does this dataset have?

`carat`

,`cut`

,`color`

,`clarity`

,`depth`

,`table`

,`price`

,`x`

,`y`

,`z`

**q2** Use descriptive statistics#

The `gr.tf_describe()`

function gives useful descriptive statistics on a dataset. Use these values to answer the questions under *observations* below.

```
# NOTE: No need to edit; run and inspect
(
df_diamonds
>> gr.tf_describe()
)
```

carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|

count | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |

mean | 0.797940 | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |

std | 0.474011 | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |

min | 0.200000 | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |

25% | 0.400000 | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |

50% | 0.700000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |

75% | 1.040000 | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |

max | 5.010000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |

*Observations*

How many observations are in the dataset?

There are

`53940`

observations.

What is a typical value for the

`price`

of a diamond, according to this dataset?A typical price is around

`3000`

. The mean price is`3932.80`

and the median price is`2401`

.

What is the largest diamond in the dataset? (According to

`carat`

.) What is the smallest?The smallest carat is

`0.2`

and the largest carat is`5.01`

.

You identified all the variables in the dataset in

**q1**above. Do the results from`gr.tf_describe()`

provide information on**all**of these variables?No: we do not see results for

`cut`

,`color`

, or`clarity`

.

## Distinct Values (levels)#

Variables that do not take numerical values are sometimes called *categorical variables*; there are other tools that are useful for investigating categorical variables.

The verb `gr.tf_distinct()`

is like `gr.tf_filter()`

, but it filters for rows that are *distinct* according to the given variables. For instance, if we wanted to know what distinct values of `x`

exist in `df_data`

, we would call:

```
(
df_data
>> gr.tf_distinct(DF.x)
)
```

*Aside*: A categorical variable is sometimes called a *factor*. The unique values of a categorical variable are called *levels*.

We can use `gr.tf_distinct()`

to figure out what values show up for a categorical variable.

**q3** Find the distinct `cut`

values#

Use `gr.tf_distinct()`

to find the unique values of `cut`

in `df_diamonds`

.

```
# TASK: Find the distinct `cut` values in the dataset
(
df_diamonds
>> gr.tf_distinct(DF.cut)
)
```

carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |

1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |

2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |

3 | 0.24 | Very Good | J | VVS2 | 62.8 | 57.0 | 336 | 3.94 | 3.96 | 2.48 |

4 | 0.22 | Fair | E | VS2 | 65.1 | 61.0 | 337 | 3.87 | 3.78 | 2.49 |

## Counts#

Another approach to assessing a categorical is to simply count the number of rows that correspond to each distinct value. We can do this with the `gr.tf_count()`

verb. For instance, if we wanted to know how may rows there are for each value of `x`

in `df_data`

, we would call:

```
(
df_data
>> gr.tf_count(DF.x)
)
```

**q4** Find the count of cut values#

Use `gr.tf_count()`

to find the number of rows for each distinct `cut`

value in `df_diamonds`

.

```
# TASK: Find the distinct `cut` values in the dataset
(
df_diamonds
>> gr.tf_count(DF.cut)
)
```

cut | n | |
---|---|---|

0 | Fair | 1610 |

1 | Good | 4906 |

2 | Very Good | 12082 |

3 | Premium | 13791 |

4 | Ideal | 21551 |

# Guided EDA#

I’m going to walk you through a train of thought I had when studying the diamonds dataset.

There are four standard “C’s” of judging a diamond. These are `carat, cut, color`

and `clarity`

, all of which are in the `df_diamonds`

dataset.

*Note*: This remainder of this exercise will consist of interpreting pre-made graphs. You can run the whole notebook to generate all the figures at once. Just make sure to do all the exercises and write your observations!

## Hypothesis 1#

Here’s a hypothesis:

`Ideal`

is the “best” value of`cut`

for a diamond. Since an`Ideal`

cut seems more labor-intensive, I hypothesize that`Ideal`

cut diamonds are less numerous than other cuts.

**q5** Assess hypothesis 1#

Run the chunk below, and study the plot. Was hypothesis 1 correct? Why or why not?

```
# NOTE: No need to edit; run and inspect
(
df_diamonds
>> gr.ggplot(gr.aes("cut"))
+ gr.geom_bar()
)
```

```
<ggplot: (8762373088910)>
```

*Observations*

Is hypothesis 1 true or not?

The hypothesis was wrong:

`Ideal`

cut diamonds are*more*numerous than all other cuts! Perhaps because cutting a diamond is easier than mining a new one, gemcutters add value to a diamond by striving for an ideal cut.

## Hypothesis 2#

Another hypothesis:

The

`Ideal`

cut diamonds should be the most pricey.

**q6** Assess hypothesis 2#

Study the following graph; does it support, contradict, or not relate to hypothesis 2?

```
# NOTE: No need to edit; run and inspect
(
df_diamonds
>> gr.ggplot(gr.aes("cut", "price"))
+ gr.geom_point()
)
```

```
<ggplot: (8762375049593)>
```

*Observations*

Does this plot support, contradict, or not relate to hypothesis 2?

This graph is virtually useless! There is severe overplotting. We cannot address Hypothesis 2 with this graph.

The following is a set of *boxplots*; the middle bar denotes the median, the boxes denote the *quartiles* (upper and lower “quarters” of the data), and the lines and dots denote large values and outliers.

**q7** Assess hypothesis 2, take 2#

Study the following graph; does it support or contradict hypothesis 2?

```
# NOTE: No need to edit; run and inspect
(
df_diamonds
>> gr.ggplot(gr.aes("cut", "price"))
+ gr.geom_boxplot()
)
```

```
<ggplot: (8762375125170)>
```

*Observations*

Does this plot support or contradict hypothesis 2?

Surprisingly,

`Ideal`

diamonds tend to be the*least*pricey! This was very surprising to me.

Upon making the graph in **q3**, I was *very* surprised, so I did some reading on diamond cuts. It turns out that some gemcutters sacrifice cut for carat. Could this effect explain the surprising pattern above?

**q8** Unravel hypothesis 2#

Study the following graph; does it support a “sacrifice cut for carat” hypothesis? How might this relate to price?

*Hint*: The article linked above will help you answer these questions!

```
# NOTE: No need to edit; run and inspect
(
df_diamonds
>> gr.ggplot(gr.aes("cut", "carat"))
+ gr.geom_boxplot()
)
```

```
<ggplot: (8762373433293)>
```

*Observations*

The median of

`Ideal`

diamonds is a fair bit lower in`carat`

than other cuts. This provides some evidence that gemcutters trade`cut`

for`carat`

.The very largest

`carat`

diamonds tend to be of`Fair`

cut; this makes sense, as cutting the gemstone will only reduce weight.It seems that many diamond purchasers are more interested in carat than fine cut. This provides some rationale for why

`Ideal`

diamonds are cheaper; they are necessarily lower-carat.