21 Reproducibility: Structuring Directories

Purpose: A key part of being successful in data science is getting organized. In this exercise, you’ll set yourself up for future success by learning how to structure a project in terms of its folder hierarchy: its directories. You’ll learn how to isolate your projects, structure your subdirectories, and protect your raw data.

Reading: Project Structure (Skim: Provided so you can see an example of a well-structured project directory)

21.1 Reading: Structuring a Project

In this exercise, we’ll assume that you are working on a project.

21.2 Isolate your project

While there are many ways to organize a given project, you should definitely organize your work into individual projects. Each project should have its own directory, and should generally have all its related files under that directory.

For instance, data-science-curriculum is in my mind an entire project, containing both exercises and challenges. Thus for me, it makes sense to organize all of these subdirectories under data-science-curriculum.

21.3 Structure your subdirectories

Subdirectories are folders (directories!) under the main (root) project folder. For instance, the data-science-curriculum has subdirectories exercises, challenges, and more.

Each subdirectory should have some purpose. For instance, there are both exercises and exercises_sequenced directories. I build the exercises in exercises out-of-sequence, because I don’t want to have to decide on which specific day each exercise should occur before I start building it. However, it’s useful for you the student to be able to see all the exercises in daily order, so I have an additional directory where I can place the sequenced exercises.

Knitr will automatically create some sensible subdirectories for report files; for instance, when you knit a document knitr will automatically create a filename_files directory with image files. This helps prevent clutter in your directory, and places the file adjacent to the source file in your (sub)directory.

Your project can have deeper levels of hierarchy: subdirectors under subdirectories. For instance, the exercises folder has a subdirectory exercises/data; this is where exercise datasets live. In your own projects, you should keep your data unaltered in some kind of data directory.

21.4 Protect your data!

Never make alterations to your raw data! Even if your data have errors, it’s far better to document those errors somewhere, such that you have a papertrail for what changed and why. It is far better practice to keep your data unedited, and simply write processed data to an additional file.

If your data needs preprocessing—say to fix errors or simply to wrangle data—write a preprocessing Rmarkdown file that takes in the raw data, and writes out a processed version of the data. You can then load the processed data in later scripts, and all of your processing steps will be permanently documented.

21.5 Practice: Set up a project

Let’s set you up for future success by creating a directory for your Project 1.

21.5.1 q1 Create a project directory in your personal data science rep called

p01-name. If you already know what you’re going to work on, feel free to replace name with something sensible. If you don’t know what to call it yet, don’t worry—you can change it later!

21.5.2 q2 Create a data directory under p01-name; you should then have

p01-name/data. This is where you will put any data files you read or write in the project.

There’s a trick to committing an empty folder with Git. We’ll need to introduce some kind of file to preserve an empty directory structure.

21.5.3 q3 Create an empty file called .gitignore under your data folder. You

can do this from the root of your data science repo by using the following from your command line:

# Move to your data science directory
$ ~/path/to/your/data-science-work # Replace with your real directory!
# Add the special .gitignore file
$ touch p01-name/data/.gitignore

A .gitignore file is a useful tool; it tells git what kinds of files to ignore when it comes to tracking files. We’re using it here for a different purpose; we can now commit that .gitignore in order to commit our directory structure.

$ git add p01-name/data/.gitignore
$ git commit -m "add p01 directory structure"

Commit and push your directory structure. You can fill this in once you start Project 1.

Key Takeaways:

  • Isolate your project by making a folder for that specific project.
  • Structure your project by creating directories for different filetypes: data, analysis code, and outputs.
  • Protect your raw data somewhere, do not make irreversible edits to the raw data.