Introduction to Jupyter Notebooks#

Original Authors: Enze Chen, Malcolm Davidson, Zach del Rosario

Purpose: We’re going to use programming tools throughout this workshop. There are many benefits to using programming to do data science, but to reap these benefits you will need to know some basics about coding in Python, and working in Jupyter notebooks.

Highest-level View#

Throughout this day of the workshop, we will be writing computer software. Before diving into the details, it is beneficial to understand what tools we’ll be using, and why.

Python#

Python is a programming language, and an extremely popular one at that. Python is praised for its readability, is consistently ranked as a top programming language, and is used by both top software companies and scientists doing computation. All of this is to say that knowing how to write Python code is a valuable skill.

Jupyter#

Jupyter is a programming environment, often used for writing and sharing Python code. Just as one could choose to write English on stone tablets or on paper, one could choose to write Python code in simple files or in jupyter notebooks. These notebooks are useful for a number of reasons:

  1. Multimedia: This document you’re reading now is a jupyter notebook – it is a combination of (executable) Python code and (formatted) plain text. By allowing the juxtaposition of different media (code and natural language, to name a couple options), one can provide human-readable context next to executable software.

  2. Saves state: Jupyter notebooks save state, allowing you to run some code, save the results, close the document and come back to it later. This is certainly possible in other coding environments, but jupyter has built-in facilities to make this easy.

  3. Reproducible: Reproducibility is a critical issue in science – results cannot be considered scientifically valid unless they can be reproduced under similar conditions. The same principles apply to computational experiments; all too often software is written under time pressure, resulting in scattered documentation and analysis. Consolidating documentation and analysis within a single notebook facilitates scientific reproducibility.

Jupyter notebooks are not always the best media for writing code; for instance, python packages are not written in jupyter notebooks. However, for our purposes, jupyter is a good solution.


How to use this Notebook#

This jupyter notebook is an interactive exercise – it contains some reading, but the focus is on fill-in-the-blank sections that are meant to test your understanding and spark further learning. Pedagogy research shows that so-called active learning – learning by doing – is far more effective than lecture alone. Thus, you will get the most out of this workshop if you do the work! Try – to the best of your ability – to complete the exercises in these notebooks, and please feel free to ask the instructors questions if you get stuck. (It’s what we’re here for.)

Run this notebook locally!

If you’re reading this on the workshop website—and not in your local jupyter reader—then it is necessary that you download this notebook locally and open it in jupyter. Look for the download bar in the top-right of this page, as pictured below.

The download button is in the top-right of the page

Exercise portions of the notebook will be marked with a “q” subheading, like the following:

q0: An example question#

Here’s an example of an exercise prompt. Follow the directions here to start the exercise.

Jupyter Introduction#

Cells#

A Jupyter notebook is organized as a sequence of different cells. Cells contain chunks of information and allow you to switch between text and code.

Current cell: The currently-selected cell will have a color bar appear on the far left. Use your cursor or arrow keys to select different cells.

Create new cell: To create a new cell, click the “+” icon at the top. This will create a cell below the currently selected cell.

Delete a cell: To delete the currently-selected cell, click the “scissors” icon at the top.

q1 Create a new cell#

Create a new cell below this one. Note that it has a different background color! This is because cells have different types.

Types of cells#

There are two types of cells we will discuss here: Markdown and Code. A single cell can only have 1 type.

Markdown: These are text cells that use the Markdown language for formatting. You can put instructions, headings, links, images, and much more; even \(\LaTeX\) (a language for typesetting mathematics). Here is a cheatsheet for Markdown syntax.

Code: These are cells where you can type Python code, just as you would in a .py file. They will have a label In [ ]: to their left, or just [] if you are using “jupyter lab”.

To switch the type of cell, you can use the dropdown menu at the top of the top of the page. (Note: While you have this cell selected, the dropdown menu should read Markdown.)

Edit and format / execute cells#

To edit a Markdown cell, triple-click within the cell if you currently see a blue bar. Then type and edit as normal.

To format (“execute”) a Markdown cell, press Ctrl+Enter when in “edit” mode. The cursor will disappear. Your formatting will also take effect.


To edit a Code cell, click within the cell. Then type and edit as normal.

To execute a Code cell, press Ctrl+Enter while it is selected. The cursor will disappear.

Cells can also be executed using the Run symbol (play button) at the top of the page.

Code cells#

So far we’ve only seen Markdown cells. Let’s see what we can do with Code cells! Try selecting the code cell below this one and executing it.

# This is a python code cell
6 * 7
42

Note that code cells are visually distinct from markdown cells – they have a grey background and the [-]: prefix.

Python Introduction#

Simple operations#

We can use python like a calculator to carry out simple operations. Some of these include:

  • Simple math operators (+, -, *, /)

  • Other math operators; exponentiation 10 ** 2 == 100; modulus 3 % 2 == 1

  • Parentheses (, ) for grouping operations

  • Python functions; console output print("Hello")

q2 Average some values#

Use basic python operations to average the following numbers: [2, 4, 6, 8].

###
# TASK: Average the numbers [2, 4, 6, 8]
# TODO: Use basic python operations to compute the arithmetic mean
###


(2 + 4 + 6 + 8) / 4
5.0

Variables#

Jupyter notebooks will save variables across different cells. We can assign values with the = operator. Furthermore, typing just the variable on the last line of a code cell will typically display the variable’s values/attributes.

a = 1
b = 2
c = a * b

q3 Display a variable#

Display the value of c in the chunk below.

###
# TASK: Display a variable
# TODO: Type the variable name below and execute
###
c
2

A couple things happened here. First, the variable was saved. Second, Jupyter notebooks will display the value of any lone variables in the last line of each Code cell. This is helpful for debugging.

If you want to display multiple quantities from a single cell, you can always use a print statement print(variable). For instance, we could write

print(a)
print(b)
print(c)

You can also see that Code cells with output will have an associated Out [#] right below. Notice what happens to the cell below when the code is running.

Lists#

A list is a Python data structure. We’ll use lists (and other iterable data structures) quite a bit when doing data analysis – generally we have many observations (elements) to consider.

Lists (i.e. arrays) in Python are created with square brackets. The entries of a list are called elements.

l = [1, 2, 3]

The append() method adds a single element to the end of the array. Note how we can add elements of different types to the same list (unlike Java or C++).

l = [1, 2, 3]
l.append("a")
l
[1, 2, 3, 'a']

Note that I needed to place l at the end of the cell to display the result. append() modified the list in place – it makes the list longer.

We can also combine lists with the + operator.

[1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]

Note that I did not need to write an additional line to print the result. This is because the + operator returns a list as a result, which is printed by jupyter.

q4 Working with lists#

Test your understanding of lists; make a prediction about what the following code will display:

list_a = [1, 2, 3]
list_b = [4, 5, 6]
print(list_a + list_b)
print(list_a.append(list_b))

After you’ve made a prediction, make a new cell and test your prediction.

Indexing and Slicing#

We can access specific elements of a list (and other iterables) by indexing. Python is uses zero-based indexing, meaning the first element of an iterable is 0, the second is 1, and so on.

Index    0    1    2    3    4
List  [ "a", "b", "c", "d", "e" ]

For instance:

l = ["a", "b", "c"]
print("l[0] = {}".format(l[0]))
print("l[1] = {}".format(l[1]))
print("l[2] = {}".format(l[2]))
l[0] = a
l[1] = b
l[2] = c

We can also use negative indices to conveniently access from the “right”.

print("l[-1] = {}".format(l[-1]))
print("l[-2] = {}".format(l[-2]))
l[-1] = c
l[-2] = b

In Python, it’s easy to take subsets of lists and strings (i.e. substrings) using slicing. Given a list (string) named var, we can take a subset of elements (characters) ranging from low to high-1 with the syntax var[low:high]. It can be helpful to remember that the high value represents the first value that is not in the selected slice. Run the following code and observe what happens.

var = [0, 1, 2, 3, 4, 5]
# var = '012345'
print('The second element:', var[1])
print('The second and third elements:', var[1:3])
print('The second element onwards:', var[1:])
print('Everything before the second element:', var[:1])
print('Everything before the second TO LAST element:', var[:-2])
The second element: 1
The second and third elements: [1, 2]
The second element onwards: [1, 2, 3, 4, 5]
Everything before the second element: [0]
Everything before the second TO LAST element: [0, 1, 2, 3]

q5 Take the middle#

Using your understanding of python indexing & slicing, select the middle three elements of the following list.

###
# TASK: Take the middle values
# TODO: Use python slicing to take the middle three values
###
test = [-2, -1, 0, +1, +2]

# -- WRITE YOUR CODE BELOW -----
test[1:-1]
[-1, 0, 1]

Modules#

By default, python provides a relatively small set of functionality. To carry out more specialized operations, we can import module. A module is a set of tools that help us carry out some tasks. For instance, the following module will help us control the execution of python code.

# Import a module first
from time import sleep

# Pause code execution for 5 seconds
sleep(5)

# Print some output to signal code has reached this point
print('Cell is finished running!')
Cell is finished running!

In the first line of the above Code cell, we imported a function sleep() from a module. A function takes some number of inputs (possibly zero, like sleep()), and returns some number of outputs (possibly zero, like sleep()). The inputs to a function are called arguments, and are put inside the parentheses, if needed.

Python modules, even when installed on your machine, must be explicitly imported before they can be used. Some modules, like time, are bundled with Python (we might call these “standard modules”), while others must be installed separately from an external source (“third-party modules”) before importing.

In later parts of the workshop, we will use some highly-specialized modules to carry out materials informatics tasks.

If we don’t import a specific object from a module, then we need to explicitly reference the module in order to use its contents. For instance, we could write the above as the following.

import time
time.sleep(5)
print("Cell is finished running!")
Cell is finished running!

We’ll see this a lot later when we start working with the numpy and pandas modules, where we tend to use canonical aliases:

import numpy as np
import pandas as pd

X = np.array([[1, 2, 3]])
df = pd.DataFrame(
    data = X,
    columns = ["X"]
)

q6 Do the math#

Import the functions sin, cos and the constant pi from the math module, and use them to complete the following code.

###
# TASK: Compute some trigonometric operations
# TODO: Import sin, cos, pi from the math module
# TODO: Convert degrees to radians
# TODO: Use sin() and cos() to complete the code below
###

from math import sin, cos, pi
angle_degrees = 15

angle_radians = angle_degrees / 180 * pi
my_sine = sin(angle_radians)
my_cosine = cos(angle_radians)

result = my_sine**2 + my_cosine**2
# -- PRINT THE ANSWER ----
result
1.0

Conditionals and logic#

Conditionals and logic allow us to adapt a program to the data. Python conditional statements are if, elif, and else, and they follow a particular syntax:

if True:
    print("True")
else:
    print("Not printed....")
True

Note that we must either indent or provide four spaces to all lines falling under a conditional.

if True:
pass # This will throw an error!
###
if True:
    print("This will work!")

The reason for this is because python uses whitespace in the same way C/C++ uses braces, or MATLAB uses end statements. Proponents of python argue that the structure of the language enforces readability.

  • An if statement will execute only if its argument evaluates to True; we can use this to select particular actions.

  • An else statement must follow an if or elif statement; it will evaluate if none of the other conditionals are triggered.

  • An elif statement is a python-specific construct, and must follow an if or another elif statement. An elif is like an if, but only evaluates if the preceding conditionals are not met.

Comparisons for equality are similar to other programming languages (<, <=, ==, >=, >).

Logic can be done using keywords not, and, and or.

# Uncomment and pick a number!
# n = ???
n = 0
if n > 0:
    print('Positive.')
elif not n >= 0:
    print('Negative.')
else:
    print('Zero.')
Zero.

Functions#

Functions are defined in Python using the def keyword. We specify the arguments by putting variable names in the parentheses (). We can specify a value to provide as output by using the return keyword, as demonstrated below:

def fcn_example(x, y=1):
    return x + y

Like with conditional statements, we must end a function definition line with a colon :, and start the following lines indented:

 # This will throw an error: Missing colon
def fcn_wrong1(x)
    return x + 3
 # This will throw an error: Missing indentation
def fcn_wrong2(x):
return x + 4

We can then “call” (or “execute”) that function by typing the function name and including the function inputs in parentheses.

fcn_example(1)
fcn_example(1, y=2)

The inputs x and y are called arguments. Arguments without default values must be provided (such as x), while arguments with default values are called keyword arguments and do not need to be specified when calling the function (such as y). Providing a keyword argument (fcn_example(1, y=2)) allows us to override the default value.

q7 Define a function#

Define a function named fcn_q7 that takes a single input, adds 2, and returns the result.

## TASK: Define a function `fcn_q7` that adds `2` to the input


def fcn_q7(x):
    return x + 2
# No need to edit this; use this to check your work
assert \
    fcn_q7(2) == 4, \
    "Incorrect value"
"Correct value"
'Correct value'

Getting help#

If there is a python function that is mysterious to you, you can always consult the documentation. You can call the python built-in help() function on an object to display its documentation. The following documentation for list points to a number of functions that operate on lists that we did not cover; for instance, the list.sort() method is extremely useful for data science.

help(list.sort)
Help on method_descriptor:

sort(self, /, *, key=None, reverse=False)
    Sort the list in ascending order and return None.
    
    The sort is in-place (i.e. the list itself is modified) and stable (i.e. the
    order of two equal elements is maintained).
    
    If a key function is given, apply it once to each list item and sort them,
    ascending or descending, according to their function values.
    
    The reverse flag can be set to sort in descending order.

There is a caveat here, which is that help() will generally provide very technical details on the object in question, which is useful for reference, but often unhelpful when first learning. If you find a particular python object very mysterious, try to formulate your question and type it into Google. Learning how to find useful information about programming concepts is one of the key skills of learning to program well.

Jupyter help shortcut

Note that in Jupyter, there’s another way to access documentation on a function. If you move your cursor over a function, click there, and press Shift + Tab, this will open a documentation panel for easy reference!

q8 Consult the documentation#

Consult the documentation for list.sort() to reverse-sort the following list.

Hint: You will need to set a certain keyword argument to complete this task.

## TASK: Reverse-sort the following list
l_rev = [5, 3, 1, 4, 2]

l_rev.sort()
l_rev.sort(reverse=True)
l_rev
[5, 4, 3, 2, 1]

Pressing Tab (without holding Shift) will perform tab completion. This is useful to both save yourself some typing, and to help you remember variable and function names while you’re working. If there are multiple variables or functions with similar names, you’ll see a menu of selections. You can use the arrow keys to select, and press Enter to make your selection.

If there are multiple options, tab completion will show you all possibilities.

Practice using tab completion below.

q9 Use tab completion#

Uncomment the lines below and use tab completion to complete the variable names.

Hint: Make sure your cursor is at the end of the line when you press Tab; otherwise you’ll indent the line!

## TASK: Uncomment the following code, 
#  use `Tab` (without Shift!) to complete
#  the variable name

l_rev

my_cosine
0.9659258262890683

End Notes#

The following notes are not required, but you may find these additional Jupyter tips helpful.

Save, close, quit, and re-open this notebook#

You can save by clicking the floppy-disk icon in the top-left of this notebook pane, and can close the notebook by clicking the “X” of its tab. Re-open by double-clicking on e-data00-notebooks-assignment.ipynb in the left-hand navigation pane.

You should find that everything is exactly as you had left it! You will have to re-run the cells to load variables, but having the output saved can be helpful, particularly since Jupyter saves figures.

Additional notebook tips#

  • Save your notebooks frequently! Use the “Save” icon at the top or the Ctrl+S/Cmd+S shortcut.

  • Use Shift+Enter to run a cell and move on to the next cell. This allows you to sequentially execute all the cells in your notebook.

  • Restart & Clear Output. This option is found under the Kernel menu. It is extremely helpful to reset all code cell executions, stale package imports, and variable values.

  • Take the time to explore other menu options and tinker around. There’s a ton of keyboard shortcuts and nifty customizations you can do with notebooks.

Jupyter magic functions#

Magic functions are a set of commands exclusively for Jupyter notebooks that are part of what makes notebooks so powerful. All magic commands start with a % sign, and we give examples of some below.

import numpy as np

# This command will output how long it takes the following inline command to run
%time np.mean(np.random.randint(low=1, high=7, size=1000))
CPU times: user 85 µs, sys: 35 µs, total: 120 µs
Wall time: 124 µs
3.487
# You'll see this command later; it allows Jupyter to render plots
# as output in the notebook instead of opening a new window.
%matplotlib inline
# This shows which variables are in your environment.
# Can be packages, functions, variables, and more!
%who
a	 angle_degrees	 angle_radians	 b	 c	 cos	 fcn_q7	 l	 l_rev	 
my_cosine	 my_sine	 n	 np	 pi	 result	 sin	 sleep	 test	 
time	 var	 
# You'll see this command later; it allows Jupyter to automatically
# pick up on changes in referenced code without reloading the kernel.
%load_ext autoreload
# This command will list your system environment variables.
%env
{'GJS_DEBUG_TOPICS': 'JS ERROR;JS LOG',
 'SSH_AUTH_SOCK': '/run/user/1000/keyring/ssh',
 'SESSION_MANAGER': 'local/squire:@/tmp/.ICE-unix/1806,unix/squire:/tmp/.ICE-unix/1806',
 'GNOME_TERMINAL_SCREEN': '/org/gnome/Terminal/screen/f283b854_2e83_4cde_89a9_d45ecf7b811d',
 'SSH_AGENT_PID': '1769',
 'XDG_CURRENT_DESKTOP': 'ubuntu:GNOME',
 'LANG': 'en_US.UTF-8',
 'DEFAULTS_PATH': '/usr/share/gconf/ubuntu.default.path',
 'IM_CONFIG_PHASE': '1',
 'COLORTERM': 'truecolor',
 'QT_IM_MODULE': 'ibus',
 'GPG_AGENT_INFO': '/run/user/1000/gnupg/S.gpg-agent:0:1',
 'DESKTOP_SESSION': 'ubuntu',
 'USER': 'zach',
 'XDG_MENU_PREFIX': 'gnome-',
 'HOME': '/home/zach',
 'GJS_DEBUG_OUTPUT': 'stderr',
 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus',
 'PWD': '/home/zach/Git/Olin/evc-course',
 'GTK_MODULES': 'gail:atk-bridge',
 'XDG_CONFIG_DIRS': '/etc/xdg/xdg-ubuntu:/etc/xdg',
 'WINDOWPATH': '2',
 'XDG_SESSION_DESKTOP': 'ubuntu',
 'JOURNAL_STREAM': '8:42691',
 'QT_ACCESSIBILITY': '1',
 '_': '/home/zach/Bin/anaconda3/envs/evc/bin/jupyter',
 'MANAGERPID': '1567',
 'MANDATORY_PATH': '/usr/share/gconf/ubuntu.mandatory.path',
 'GNOME_DESKTOP_SESSION_ID': 'this-is-deprecated',
 'LOGNAME': 'zach',
 'GNOME_TERMINAL_SERVICE': ':1.651',
 'VTE_VERSION': '6003',
 'GNOME_SHELL_SESSION_MODE': 'ubuntu',
 'PATH': '/home/zach/Bin/anaconda3/envs/evc/bin:/home/zach/Bin/anaconda3/condabin:/home/zach/Bin/webplotdigitizer:/home/zach/Bin/p4merge/bin:/home/zach/Bin/matlab/bin:/home/zach/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin',
 'XMODIFIERS': '@im=ibus',
 'SHELL': '/usr/bin/zsh',
 'XDG_RUNTIME_DIR': '/run/user/1000',
 'XDG_SESSION_TYPE': 'x11',
 'USERNAME': 'zach',
 'INVOCATION_ID': '40e1f0789fe448579dbcf89ae9fde03a',
 'SHLVL': '1',
 'XAUTHORITY': '/run/user/1000/gdm/Xauthority',
 'XDG_DATA_DIRS': '/usr/share/ubuntu:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop',
 'XDG_SESSION_CLASS': 'user',
 'TERM': 'xterm-color',
 'GDMSESSION': 'ubuntu',
 'DISPLAY': ':1',
 'OLDPWD': '/home/zach/Git/py_grama',
 'PYTHONPATH': '/home/zach/Git/pybuck:/home/zach/Git/py_grama:',
 'CITRINATION_API_KEY': '<hidden>',
 'SL_RESULTS': '/home/zach/Documents/sl-results/',
 'ZSH': '/home/zach/.oh-my-zsh',
 'PAGER': 'cat',
 'LESS': '-R',
 'LSCOLORS': 'Gxfxcxdxbxegedabagacad',
 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:',
 'CONDA_EXE': '/home/zach/Bin/anaconda3/bin/conda',
 '_CE_M': '',
 '_CE_CONDA': '',
 'CONDA_PYTHON_EXE': '/home/zach/Bin/anaconda3/bin/python',
 'CONDA_SHLVL': '2',
 'CONDA_PREFIX': '/home/zach/Bin/anaconda3/envs/evc',
 'CONDA_DEFAULT_ENV': 'evc',
 'CONDA_PROMPT_MODIFIER': '(evc) ',
 'RSTUDIO_PASS': 'temporarylocalkeyphrase',
 'VISUAL': 'vim',
 'EDITOR': 'vim',
 'CONDA_PREFIX_1': '/home/zach/Bin/anaconda3',
 'PYDEVD_USE_FRAME_EVAL': 'NO',
 'JPY_PARENT_PID': '1648777',
 'CLICOLOR': '1',
 'GIT_PAGER': 'cat',
 'MPLBACKEND': 'module://matplotlib_inline.backend_inline'}