8 Data: Basics
Purpose: When first studying a new dataset, there are very simple checks we should perform first. These are those checks.
Additionally, we’ll have our first look at the pipe operator, which will be super useful for writing code that’s readable.
Reading: (None)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
8.1 First Checks
8.1.1 q0 Run the following chunk:
Hint: You can do this either by clicking the green arrow at the top-right of
the chunk, or by using the keybaord shortcut Shift + Cmd/Ctrl + Enter.
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
This is a dataset; the fundamental object we’ll study throughout this course. Some nomenclature:
- The 
1, 2, 3, ...on the left enumerate the rows of the dataset - The names 
Sepal.Length,Sepal.Width,...name the columns of the dataset - The column 
Sepal.Lengthtakes numeric values - The column 
Speciestakes string values 
8.1.2 q1 Load the tidyverse and inspect the diamonds dataset. What do the
cut, color, and clarity variables mean?
Hint: You can run ?diamonds to get information on a built-in dataset.
8.1.3 q2 Run glimpse(diamonds); what variables does diamonds have?
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
The diamonds dataset has variables carat, cut, color, clarity, depth, table, price, x, y, z.
8.1.4 q3 Run summary(diamonds); what are the common values for each of the
variables? How widely do each of the variables vary?
Hint: The Median and Mean are common values, while Min and Max give us
a sense of variation.
Observations:
caratseems to be bounded between0and5- The highest-priced diamond in this set is $18,823!
 - Some of the variables do not have 
min, maxetc. values. These are factors; variables that take one of a finite set of possible values. 
You should always analyze your dataset in the simplest way possible, build
hypotheses, and devise more specific analyses to probe those hypotheses. The
glimpse() and summary() functions are two of the simplest tools we have.
8.2 The Pipe Operator
Throughout this class we’re going to make heavy use of the pipe operator
%>%. This handy little function will help us make our code more readable.
Whenever you see %>%, you can translate that into the word “then”. For
instance
## # A tibble: 5 × 2
##   cut       carat_mean
##   <ord>          <dbl>
## 1 Fair           1.05 
## 2 Good           0.849
## 3 Very Good      0.806
## 4 Premium        0.892
## 5 Ideal          0.703
Would translate into the tiny “story”
- Take the 
diamondsdataset, then - Group it by the variable 
cut, then - summarize it by computing the 
meanofcarat 
What the pipe actually does. The pipe operator LHS %>% RHS takes its
left-hand side (LHS) and inserts it as an the first argument to the function on
its right-hand side (RHS). So the pipe will let us take glimpse(diamonds) and
turn it into diamonds %>% glimpse().
8.2.1 q4 Use the pipe operator to re-write summary(diamonds).
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 
8.3 Reading Data
So far we’ve only been looking at built-in datasets. Ultimately, we’ll want to read in our own data. We’ll get to the art of loading and wrangling data later, but for now, know that the readr package provides us tools to read data. Let’s quickly practice loading data below.
8.3.1 q5 Use the function read_csv() to load the file "./data/tiny.csv".
## Rows: 2 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): x, y
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 2 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     2
## 2     3     4