21 Reproducibility: Structuring Directories
Purpose: A key part of being successful in data science is getting organized. In this exercise, you’ll set yourself up for future success by learning how to structure a project in terms of its folder hierarchy: its directories. You’ll learn how to isolate your projects, structure your subdirectories, and protect your raw data.
Reading: Project Structure (Skim: Provided so you can see an example of a well-structured project directory)
21.1 Reading: Structuring a Project
In this exercise, we’ll assume that you are working on a project.
21.2 Isolate your project
While there are many ways to organize a given project, you should definitely organize your work into individual projects. Each project should have its own directory, and should generally have all its related files under that directory.
For instance, data-science-curriculum
is in my mind an entire project, containing both exercises and challenges. Thus for me, it makes sense to organize all of these subdirectories under data-science-curriculum
.
21.3 Structure your subdirectories
Subdirectories are folders (directories!) under the main (root) project
folder. For instance, the data-science-curriculum
has subdirectories
exercises
, challenges
, and more.
Each subdirectory should have some purpose. For instance, there are both
exercises
and exercises_sequenced
directories. I build the exercises in
exercises
out-of-sequence, because I don’t want to have to decide on which
specific day each exercise should occur before I start building it. However,
it’s useful for you the student to be able to see all the exercises in daily
order, so I have an additional directory where I can place the sequenced
exercises.
Knitr will automatically create some sensible subdirectories for report files;
for instance, when you knit a document knitr will automatically create a
filename_files
directory with image files. This helps prevent clutter in your
directory, and places the file adjacent to the source file in your
(sub)directory.
Your project can have deeper levels of hierarchy: subdirectors under
subdirectories. For instance, the exercises
folder has a subdirectory
exercises/data
; this is where exercise datasets live. In your own projects,
you should keep your data unaltered in some kind of data directory.
21.4 Protect your data!
Never make alterations to your raw data! Even if your data have errors, it’s far better to document those errors somewhere, such that you have a papertrail for what changed and why. It is far better practice to keep your data unedited, and simply write processed data to an additional file.
If your data needs preprocessing—say to fix errors or simply to wrangle data—write a preprocessing Rmarkdown file that takes in the raw data, and writes out a processed version of the data. You can then load the processed data in later scripts, and all of your processing steps will be permanently documented.
21.5 Practice: Set up a project
Let’s set you up for future success by creating a directory for your Project 1.
21.5.1 q1 Create a project directory in your personal data science rep called
p01-name
. If you already know what you’re going to work on, feel free to
replace name
with something sensible. If you don’t know what to call it yet,
don’t worry—you can change it later!
21.5.2 q2 Create a data
directory under p01-name
; you should then have
p01-name/data
. This is where you will put any data files you read or write in the project.
There’s a trick to committing an empty folder with Git. We’ll need to introduce some kind of file to preserve an empty directory structure.
21.5.3 q3 Create an empty file called .gitignore
under your data
folder. You
can do this from the root of your data science repo by using the following from your command line:
# Move to your data science directory
$ ~/path/to/your/data-science-work # Replace with your real directory!
# Add the special .gitignore file
$ touch p01-name/data/.gitignore
A .gitignore
file is a useful
tool; it
tells git what kinds of files to ignore when it comes to tracking files. We’re
using it here for a different purpose; we can now commit that .gitignore
in
order to commit our directory structure.
Commit and push your directory structure. You can fill this in once you start Project 1.
Key Takeaways:
- Isolate your project by making a folder for that specific project.
- Structure your project by creating directories for different filetypes: data, analysis code, and outputs.
- Protect your raw data somewhere, do not make irreversible edits to the raw data.