55 Data: Cleaning
Purpose: Most of the data you’ll find in the wild is messy; you’ll need to clean those data before you can do useful work. In this case study, you’ll learn some more tricks for cleaning data. We’ll use these data for a future exercise on modeling, so we’ll build on the work you do here today.
Reading: (None, this exercise is the reading.)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Background: This exercise’s data comes from the UCI Machine Learning Database; specifically their Heart Disease Data Set. These data consist of clinical measurements on patients, and are intended to help predict heart disease.
## NOTE: No need to edit; run and inspect
url_disease <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
filename_disease <- "./data/uci_heart_disease.csv"
## Download the data locally
curl::curl_download(
url_disease,
destfile = filename_disease
)
This is a messy dataset; one we’ll have to clean if we want to make sense of it. Let’s load the data and document the ways in which it’s messy:
## New names:
## Rows: 302 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (2): 0.0...12, 6.0 dbl (12): 63.0, 1.0...2, 1.0...3, 145.0, 233.0, 1.0...6,
## 2.0, 150.0, 0.0...9...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `1.0` -> `1.0...2`
## • `1.0` -> `1.0...3`
## • `1.0` -> `1.0...6`
## • `0.0` -> `0.0...9`
## • `0.0` -> `0.0...12`
## Rows: 302
## Columns: 14
## $ `63.0` <dbl> 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57,…
## $ `1.0...2` <dbl> 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ `1.0...3` <dbl> 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1,…
## $ `145.0` <dbl> 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,…
## $ `233.0` <dbl> 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,…
## $ `1.0...6` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `2.0` <dbl> 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2,…
## $ `150.0` <dbl> 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142,…
## $ `0.0...9` <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ `2.3` <dbl> 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6,…
## $ `3.0` <dbl> 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2,…
## $ `0.0...12` <chr> "3.0", "2.0", "0.0", "0.0", "0.0", "2.0", "0.0", "1.0", "0.…
## $ `6.0` <chr> "3.0", "7.0", "3.0", "3.0", "3.0", "3.0", "3.0", "7.0", "7.…
## $ `0` <dbl> 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0,…
Observations:
- The CSV comes without column names!
read_csv()
got confused and assigned the first row of data as names. - Some of the numerical columns were incorrectly assigned
character
type. - Some of the columns are coded as binary values
0, 1
, but they really represent variables likesex %in% c("male", "female")
.
Let’s tackle these problems one at a time:
55.1 Problem 1: No column names
We’ll have a hard time making sense of these data without column names. Let’s fix that.
55.1.1 q1 Obtain the data.
Following the dataset documentation, transcribe the correct column names and assign them as a character vector. You will use this to give the dataset sensible column names when you load it in q2.
Hint 1: The relevant section from the dataset documentation is quoted here:
Only 14 attributes used: 1. #3 (age) 2. #4 (sex) 3. #9 (cp) 4. #10 (trestbps) 5. #12 (chol) 6. #16 (fbs) 7. #19 (restecg) 8. #32 (thalach) 9. #38 (exang) 10. #40 (oldpeak) 11. #41 (slope) 12. #44 (ca) 13. #51 (thal) 14. #58 (num) (the predicted attribute)
Hint 2: A “copy-paste-edit” is probably the most effective approach here!
## TODO: Assign the column names to col_names; make sure they are strings
col_names <- c(
"age",
"sex",
"cp",
"trestbps",
"chol",
"fbs",
"restecg",
"thalach",
"exang",
"oldpeak",
"slope",
"ca",
"thal",
"num"
)
Use the following to check your code.
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] "Well done!"
55.2 Problem 2: Incorrect types
We saw above that read_csv()
incorrectly guessed some of the column types. Let’s fix that by manually specifying each column’s type.
55.2.1 q2 Call read_csv()
with the col_names
and col_types
arguments. Use the column names you assigned above, and set all column types to col_number()
.
Hint: Remember that you can always read the documentation to learn how to use a new argument!
## TODO: Use the col_names and col_types arguments to give the data the
## correct column names, and to set their types to col_number()
df_q2 <-
read_csv(
filename_disease,
col_names = col_names,
col_types = cols(
"age" = col_number(),
"sex" = col_number(),
"cp" = col_number(),
"trestbps" = col_number(),
"chol" = col_number(),
"fbs" = col_number(),
"restecg" = col_number(),
"thalach" = col_number(),
"exang" = col_number(),
"oldpeak" = col_number(),
"slope" = col_number(),
"ca" = col_number(),
"thal" = col_number(),
"num" = col_number()
)
)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 303
## Columns: 14
## $ age <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 5…
## $ sex <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ cp <dbl> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1…
## $ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 1…
## $ chol <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 2…
## $ fbs <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2…
## $ thalach <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 1…
## $ exang <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
## $ oldpeak <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0…
## $ slope <dbl> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2…
## $ ca <dbl> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ thal <dbl> 6, 3, 7, 3, 3, 3, 3, 3, 7, 7, 6, 3, 6, 7, 7, 3, 7, 3, 3, 3, 3…
## $ num <dbl> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0…
Use the following to check your code.
## NOTE: No need to change this
assertthat::assert_that(assertthat::are_equal(names(df_q2), col_names))
## [1] TRUE
## [1] TRUE
## [1] "Nice!"
55.3 Problem 3: Uninformative values
The numeric codes given for some of the variables are uninformative; let’s replace those with more human-readable values.
Rather than go and modify our raw data, we will instead recode the variables in our loaded dataset. It is bad practice to modify your raw data! Modifying your data in code provides traceable documentation for the edits you made; this is a key part of doing reproducible science. It takes more work, but your results will be more trustworthy if you do things the right way!
55.3.1 q3 Create conversion functions to recode factor values as human-readable strings. I have provided one function (convert_sex
) as an example.
Note: “In the wild” you would be responsible for devising your own sensible level names. However, I’m going to provide specific codes such that I can write unittests to check your answers:
Variable | Levels |
---|---|
sex |
1 = "male", 0 = "female" |
fbs |
1 = TRUE, 0 = FALSE |
restecg |
0 = "normal", 1 = "ST-T wave abnormality", 2 = "Estes' criteria" |
exang |
1 = TRUE, 0 = FALSE |
slope |
1 = "upsloping", 2 = "flat", 3 = "downsloping" |
thal |
3 = "normal", 6 = "fixed defect", 7 = "reversible defect" |
## NOTE: This is an example conversion
convert_sex <- function(x) {
case_when(
x == 1 ~ "male",
x == 0 ~ "female",
TRUE ~ NA_character_
)
}
convert_cp <- function(x) {
case_when(
x == 1 ~ "typical angina",
x == 2 ~ "atypical angina",
x == 3 ~ "non-anginal pain",
x == 4 ~ "asymptomatic",
TRUE ~ NA_character_
)
}
convert_fbs <- function(x) {
if_else(x == 1, TRUE, FALSE)
}
convert_restecv <- function(x) {
case_when(
x == 0 ~ "normal",
x == 1 ~ "ST-T wave abnormality",
x == 2 ~ "Estes' criteria",
TRUE ~ NA_character_
)
}
convert_exang <- function(x) {
if_else(x == 1, TRUE, FALSE)
}
convert_slope <- function(x) {
case_when(
x == 1 ~ "upsloping",
x == 2 ~ "flat",
x == 3 ~ "downsloping",
TRUE ~ NA_character_
)
}
convert_thal <- function(x) {
case_when(
x == 3 ~ "normal",
x == 6 ~ "fixed defect",
x == 7 ~ "reversible defect",
TRUE ~ NA_character_
)
}
Use the following to check your code.
## NOTE: No need to change this
assertthat::assert_that(assertthat::are_equal(
convert_cp(c(1, 2, 3, 4)),
c("typical angina", "atypical angina", "non-anginal pain", "asymptomatic")
))
## [1] TRUE
## [1] TRUE
assertthat::assert_that(assertthat::are_equal(
convert_restecv(c(0, 1, 2)),
c("normal", "ST-T wave abnormality", "Estes' criteria")
))
## [1] TRUE
## [1] TRUE
assertthat::assert_that(assertthat::are_equal(
convert_slope(c(1, 2, 3)),
c("upsloping", "flat", "downsloping")
))
## [1] TRUE
assertthat::assert_that(assertthat::are_equal(
convert_thal(c(3, 6, 7)),
c("normal", "fixed defect", "reversible defect")
))
## [1] TRUE
## [1] "Excellent!"
55.3.2 q4 Use your convert_
functions from q3 to mutate the columns and recode the variables.
df_q4 <-
df_q2 %>%
mutate(
sex = convert_sex(sex),
cp = convert_cp(cp),
fbs = convert_fbs(fbs),
restecg = convert_restecv(restecg),
exang = convert_exang(exang),
slope = convert_slope(slope),
thal = convert_thal(thal)
)
df_q4
## # A tibble: 303 × 14
## age sex cp trest…¹ chol fbs restecg thalach exang oldpeak slope
## <dbl> <chr> <chr> <dbl> <dbl> <lgl> <chr> <dbl> <lgl> <dbl> <chr>
## 1 63 male typical… 145 233 TRUE Estes'… 150 FALSE 2.3 down…
## 2 67 male asympto… 160 286 FALSE Estes'… 108 TRUE 1.5 flat
## 3 67 male asympto… 120 229 FALSE Estes'… 129 TRUE 2.6 flat
## 4 37 male non-ang… 130 250 FALSE normal 187 FALSE 3.5 down…
## 5 41 female atypica… 130 204 FALSE Estes'… 172 FALSE 1.4 upsl…
## 6 56 male atypica… 120 236 FALSE normal 178 FALSE 0.8 upsl…
## 7 62 female asympto… 140 268 FALSE Estes'… 160 FALSE 3.6 down…
## 8 57 female asympto… 120 354 FALSE normal 163 TRUE 0.6 upsl…
## 9 63 male asympto… 130 254 FALSE Estes'… 147 FALSE 1.4 flat
## 10 53 male asympto… 140 203 TRUE Estes'… 155 TRUE 3.1 down…
## # … with 293 more rows, 3 more variables: ca <dbl>, thal <chr>, num <dbl>, and
## # abbreviated variable name ¹trestbps
55.4 Prepare the Data for Modeling
Now we have a clean dataset we can use for EDA and modeling—great! Before we finish this exercise, let’s do some standard checks to understand these data:
55.4.1 q5 Perform your first checks on df_q4
. Answer the questions below.
Hint: You may need to do some “deeper checks” to answer some of the questions below.
## age sex cp trestbps
## Min. :29.00 Length:303 Length:303 Min. : 94.0
## 1st Qu.:48.00 Class :character Class :character 1st Qu.:120.0
## Median :56.00 Mode :character Mode :character Median :130.0
## Mean :54.44 Mean :131.7
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
##
## chol fbs restecg thalach
## Min. :126.0 Mode :logical Length:303 Min. : 71.0
## 1st Qu.:211.0 FALSE:258 Class :character 1st Qu.:133.5
## Median :241.0 TRUE :45 Mode :character Median :153.0
## Mean :246.7 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:166.0
## Max. :564.0 Max. :202.0
##
## exang oldpeak slope ca
## Mode :logical Min. :0.00 Length:303 Min. :0.0000
## FALSE:204 1st Qu.:0.00 Class :character 1st Qu.:0.0000
## TRUE :99 Median :0.80 Mode :character Median :0.0000
## Mean :1.04 Mean :0.6722
## 3rd Qu.:1.60 3rd Qu.:1.0000
## Max. :6.20 Max. :3.0000
## NA's :4
## thal num
## Length:303 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.9373
## 3rd Qu.:2.0000
## Max. :4.0000
##
Observations:
Variables:
- Numerical: age, trestbps, chol, thalach, oldpeak, ca, num
- Factors: sex, cp, restecg, slope, thal, heart_disease
- Logical: fbs, exang, heart_disease
Missingness:
## $age
## [1] 0
##
## $sex
## [1] 0
##
## $cp
## [1] 0
##
## $trestbps
## [1] 0
##
## $chol
## [1] 0
##
## $fbs
## [1] 0
##
## $restecg
## [1] 0
##
## $thalach
## [1] 0
##
## $exang
## [1] 0
##
## $oldpeak
## [1] 0
##
## $slope
## [1] 0
##
## $ca
## [1] 4
##
## $thal
## [1] 2
##
## $num
## [1] 0
From this, we can see that most variables have no missing values, but ca
has 4
and thal
has 2
.
Missingness pattern:
## # A tibble: 6 × 14
## ca thal age sex cp trest…¹ chol fbs restecg thalach exang
## <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <lgl> <chr> <dbl> <lgl>
## 1 0 <NA> 53 fema… non-… 128 216 FALSE Estes'… 115 FALSE
## 2 NA normal 52 male non-… 138 223 FALSE normal 169 FALSE
## 3 NA reversible … 43 male asym… 132 247 TRUE Estes'… 143 TRUE
## 4 0 <NA> 52 male asym… 128 204 TRUE normal 156 TRUE
## 5 NA reversible … 58 male atyp… 125 220 FALSE normal 144 FALSE
## 6 NA normal 38 male non-… 138 175 FALSE normal 173 FALSE
## # … with 3 more variables: oldpeak <dbl>, slope <chr>, num <dbl>, and
## # abbreviated variable name ¹trestbps
There are six rows with missing values.
If we were just doing EDA, we could stop here. However we’re going to use these data for modeling in a future exercise. Most models can’t deal with NA
values, so we must choose how to handle rows with NA
’s. In cases where only a few observations are missing values, we can simply filter out those rows.
55.4.2 q6 Filter out the rows with missing values.
## # A tibble: 297 × 14
## age sex cp trest…¹ chol fbs restecg thalach exang oldpeak slope
## <dbl> <chr> <chr> <dbl> <dbl> <lgl> <chr> <dbl> <lgl> <dbl> <chr>
## 1 63 male typical… 145 233 TRUE Estes'… 150 FALSE 2.3 down…
## 2 67 male asympto… 160 286 FALSE Estes'… 108 TRUE 1.5 flat
## 3 67 male asympto… 120 229 FALSE Estes'… 129 TRUE 2.6 flat
## 4 37 male non-ang… 130 250 FALSE normal 187 FALSE 3.5 down…
## 5 41 female atypica… 130 204 FALSE Estes'… 172 FALSE 1.4 upsl…
## 6 56 male atypica… 120 236 FALSE normal 178 FALSE 0.8 upsl…
## 7 62 female asympto… 140 268 FALSE Estes'… 160 FALSE 3.6 down…
## 8 57 female asympto… 120 354 FALSE normal 163 TRUE 0.6 upsl…
## 9 63 male asympto… 130 254 FALSE Estes'… 147 FALSE 1.4 flat
## 10 53 male asympto… 140 203 TRUE Estes'… 155 TRUE 3.1 down…
## # … with 287 more rows, 3 more variables: ca <dbl>, thal <chr>, num <dbl>, and
## # abbreviated variable name ¹trestbps
Use the following to check your code.
## NOTE: No need to change this
assertthat::assert_that(
dim(
df_q6 %>%
filter(rowSums(across(everything(), is.na)) > 0)
)[1] == 0
)
## [1] TRUE
## [1] "Well done!"