Introduction to Python (optional)

Overview

This module introduces you to Python, making heavy use of this Software Carpentry lesson.

Preparation

Our primary computing context will be the UC Berkeley DataHub. This provides us with a common platform on which we are assured to get the same behavior as each other, allowing us to use the command line/shell/terminal and Python, including Jupyter notebooks. In particular getting a common command line environment would be hard otherwise.

If you haven’t already used the shell/terminal and Python on your laptop, we recommend you use the DataHub so that we don’t have to troubleshoot problems on multiple laptops.

If you do use your laptop, you’ll need Python installed, for which we recommend using Miniforge. Another option is the Anaconda distribution of Python](https://www.anaconda.com/distribution)

For the shell/terminal parts of this material, you can use the Terminal on your Mac, but not on Windows.

If you do use your own laptop, you’ll need to clone this GitHub repository. If you don’t know how to do that, just download this zip file and unzip it into your home directory on your computer.

Ways of running Python

There are various options for how to save our code and in what context to run it.

For our purposes here, we’ll focus on having the code either in a Jupyter notebook or in a script (a code file in a simple text format).

Suppose we want to run this code (already seen in the previous session):

Code
import numpy as np
import time

def run_linalg(n):
    z = np.random.normal(0, 1, size=(n, n))
    print(time.time())
    x = z.T.dot(z)   # x = z'z
    print(time.time())
    U = np.linalg.cholesky(x)  # factorize as x = U'U
    print(time.time())

The file calc.py contains a variation on the code shown above and it can be run in the shell as

python calc.py

Or we can run the code in a notebook using calc.py as a module that provides the function of interest:

Code
import calc
calc.run_linalg(4000)
1724285648.2511733
1724285648.784521
1724285649.2074902

Having code in a script file will be convenient when creating packages, using Git/version control (but see nbdime if using notebooks), and running code in a batch mode (i.e., in the background rather than interactively).

Ways to use Python

One can also run Python or (IPython for some enhanced capabilities, including easier copy/pasting) from the terminal.

Finally running Python within VS Code or an IDE such as PyCharm is also common.

For various reasons, for most of the demos, I will run code in an IPython session in the terminal.

Basic use of Python

We’ll use this Software Carpentry lesson for most of this material.

1. Python Fundamentals

We’ll work through the Python Fundamentals section of the Software Carpentry workshop. The primary ideas covered are variables, data types, and basic function usage.

Exercise 1

Using the section on “Built-in Types” from the official “The Python Standard Library” reference, figure out how to compute:

  • \(\sqrt{-1}\)
  • \(\exp(1.5)\)
  • \((\lceil \frac{3}{4} \rceil \times 4)^3\).
Exercise 2
  • Is 1.0 of the integer type? What about 1? Is the result of 5/3 - 2/3 of the integer type? Is the mathematical value seen in Python an integer?

  • Here’s a numerical puzzle. Why does the last computation not work, when the others do? And, for those of you coming from R, which of these computations don’t work in R?

    Code
    100000**10
    100000.0**10
    100000**100
    100000.0**100

2. Analyzing Patient Data

We’ll work through the Analyzing Patient Data section of the Software Carpentry workshop. The primary ideas covered are libraries (aka packages), numpy arrays, and slicing.

Exercise 1
Exercise 2

Note that ? and ?? only work in IPython (or a Jupyter notebook). For help in plain Python, use help(np.ndim).

  • What happens if you type np.ndim?? (i.e., use two question marks)? What additional do you see compared to np.ndim??

  • What does np.ndim() do? How does it execute under the hood? Consider why the following uses of ndim both work.

    Code
    a = np.array([0, 1, 2])
    a.ndim
    np.ndim(a)

    Now explain why only one of these works.

    Code
    a = [0, 1, 2]
    a.ndim
    np.ndim(a)
  • Type np.array? in a Notebook or at the IPython prompt. Briefly skim the docstring. nparray allows you to construct numpy arrays.

  • Type np. followed by the <Tab> key in a Notebook or at the IPython prompt. Choose two or three of the completions and use ? to view their docstrings. In particular, pay attention to the examples provided near the end of the docstring and see whether you can figure out how you might use this functionality.

3. Visualizing Tabular Data

We’ll work through the Visualizing Tabular Data section of the Software Carpentry workshop. The primary ideas covered are making basic plots using the matplotlib package.

Sidenote: For the plots to show in a Jupyter Notebook, it seems that we need all the code for creating a given plot to be in a single cell.

Exercise

Using the following code, read in the GapMinder data using Pandas (to be discussed later) and run the following code to make the variables easily available (you don’t need to know anything about Pandas).

Do this in a Jupyter notebook so that it’s easy to see the plots.

Code
import numpy as np
import pandas as pd
dat = pd.read_csv('gapminder.csv')
lifeExp = np.array(dat.lifeExp)
gdpPercap = np.array(dat.gdpPercap)
year = np.array(dat.year)

## Hint: slicing using an array of booleans
## gdpPercap[year > 2010]
  • Make a scatterplot of lifeExp vs gdpPerCap for 2007.
  • Consider whether plotting income on a logarithmic axis is a better way to display the data.
  • Using at least two years, make an array of plots (in one figure) where each subplot is a different year.
  • Add nice axis labels and titles.

4. Storing Multiple Values in Lists

We’ll work through the Storing Multiple Values in Lists section of the Software Carpentry workshop. The primary ideas covered are creating, extracting from, and manipulating lists.

Exercise 1

Create a list of numbers, called x1. Reverse the order of the items in the list using slicing. Now reverse the order of the items using a list method. How does using the method differ from slicing? Hint: you can type x. followed by the <Tab> key in a Notebook or at the IPython prompt to find the various methods that can be applied to a list.

Exercise 2
  • Figure out some different ways of combining your list of numbers with a list of strings to create a single list of mixed type elements. (Hint: you can type x. followed by the <Tab> key in a Notebook or at the IPython prompt to find the various methods that can be applied to a list.)
  • Now try to sort the resulting list. What happens?
Exercise 3

Answer the question related to Overloading in the Software Carpentry section.

Exercise 4

What does the following tell you about copying and use of memory in lists in Python?

Code
a = [1, 3, 5]
b = a
id(a)
id(b)
# this should confirm what you might suspect
a[1] = 5

5. Repeating Actions with Loops

We’ll work through the Repeating Actions with Loops section of the Software Carpentry workshop. The primary ideas covered relate to using for loops to automate operations.

In addition to the explicit for looping shown in the unit, Python also has a construct called list comprehension to do something similar. Note that some people really dislike the list comprehension syntax as being hard to read.

Code
vals = [0, 5, 7]
vals_times2 = [val*2 for val in vals]
vals_times2
[0, 10, 14]
Exercises
  • See what [1, 2, 3] + 3 returns. Try to explain what happened and why.
  • How would you do the same task using a for loop? The range function may be helpful as might the enumerate function.
  • Use list comprehension to perform the same element-wise addition of the scalar to the list of scalars.
  • Change [1, 2, 3] to be a numpy array and then add three using + 3.

7. Making Choices

We’ll work through the Making Choices section of the Software Carpentry workshop. The primary ideas covered are using conditionals (if-then-else statements) and boolean (logical) operations.

Exercise 1

Answer this question from the Software Carpentry section.

Exercise 2

Answer this question from the Software Carpentry section.

8. Creating Functions

We’ll work through the Creating Functions section of the Software Carpentry workshop. The primary ideas covered are defining, using, testing, and debugging functions.

We’ll skip over testing and documentation as we’ll cover those in depth later this week (and definitely NOT because they are unimportant).

Exercises

Exercise 1

Define a function called sqrt that will take the square root of a number and will (if requested by the user) set the square root of a negative number to 0.

Exercise 2
  • What happens if you modify a list within a function in Python; why do you think this is?
  • What happens if you modify a single number (scalar) within a function in Python; why do you think this is?
Exercise 3

Answer the question related to local vs. global variables in the Software Carpentry section.

9. Errors and Exceptions

We’ll work through the Errors and Exceptions section of the Software Carpentry workshop. The primary ideas covered relate to understanding error messages and tracebacks. In the Computational Tools and Practices session, we’ll talk about writing your own code that handles errors.

Exercise 1

Answer this question from the Software Carpentry section.

Exercise 2

Answer these questions from Sofware Carpentry:

EXTRA Part 1. Some other useful data structures

Tuples

Tuples are immutable sequences of (zero or more) objects. Functions in Python often return tuples.

Code
x = 1; y = 'foo'

xy = (x, y)
type(xy)
xy = x,y
type(xy)

xy
xy[1]

xy[1] = 3   immutable!
  Cell In[3], line 11
    xy[1] = 3   immutable!
                ^
SyntaxError: invalid syntax
Exercise 1
  • What’s weird about this? What are the types involved?

    Code
    z = x,y
    a,b = x,y
  • Create the following: x=5 and y=99. Now swap their values using a single line of code. (For R users, how would you do this in R?)

Exercise 2

What happens when you multiply a tuple by a number?

Exercise 3
  • Why do you think there is no reverse method for tuples?
  • What’s nice about using immutable objects in your code?

Dictionaries

Dictionaries are mutable, unordered collections of key-value pairs. They’re a very important data structure in Python so I’m surprised they weren’t discussed in the Software Carpentry sections.

Code
students = {"Francis Bacon": ['A', 'B+', 'A-'], 
            "Marie Curie": ['A-', 'A-'], 
            "Aristotle": 'and now for something completely different'}
students
students.keys()
students.values()
students["Marie Curie"]
students["Marie Curie"][1]
'A-'
Code
students.
students.clear(       students.items(       students.setdefault(
students.copy(        students.keys(        students.update(
students.fromkeys(    students.pop(         students.values(
students.get(         students.popitem()    
Exercise 4

How can you add an item to a dictionary?

Exercise 5

How do you combine two dictionaries into a single dictionary?

Exercise 6 (for R users)
  • What are some analogs to dictionaries in R?
  • How are dictionaries different from such analogous structures in R?

EXTRA Part 2: A bit on numpy, scipy and pandas

Numpy and scipy

Standard lists in Python are not amenable to mathematical manipulation, unlike standard vectors in R. Instead we generally work with numpy arrays. These arrays can be of various dimensions (i.e., vectors, matrices, multi-dimensional arrays).

Code
import numpy as np
z = [0, 1, 2] 

y = np.array(z)
y*3

y.dtype


x = np.array([[1, 2], [3, 4]], dtype=np.float64)
x*x
x.dot(x)
x.T

np.linalg.svd(x)

e = np.linalg.eig(x)

e[0]  # first eigenvalue (not the largest in this case...)
e[1][:, 0] # corresponding eigenvector
array([-0.82456484,  0.56576746])

All of the elements of the array must be of the same type.

There are a variety of numpy functions that allow us to do standard mathematical/statistical manipulations.

Here we’ll use some of those functions in addition to some syntax for subsetting and vectorized calculations.

Code
np.linspace(0, 1, 5)

np.random.seed(0)
x = np.random.normal(size=10)

pos = x > 0

y = x[pos]

x[[1, 3, 4]]

x[pos] = 0

np.cos(x)
array([1.        , 1.        , 1.        , 1.        , 1.        ,
       0.55928119, 1.        , 0.98856735, 0.99467766, 1.        ])

scipy has even more numerical routines, including working with distributions and additional linear algebra.

Code
import scipy.stats as st
st.norm.cdf(1.96, 0, 1)
st.norm.cdf(1.96, 0.5, 2)
st.norm(0.5, 2).cdf(1.96)

Pandas

Pandas provides a Python implementation of R’s dataframe capabilities. Let’s see some example code.

Code
import pandas as pd
dat = pd.read_csv('gapminder.csv')
dat.head()

dat.columns
dat['year']
dat.year
dat[0:5]

dat.sort_values(['year', 'country'])

dat.loc[0:5, ['year', 'country']]  # R-style indexing

dat[dat.year == 1952]

ndat = dat[['pop','lifeExp','gdpPercap']]
ndat.apply(lambda col: col.max() - col.min())
pop          1.318623e+09
lifeExp      5.900400e+01
gdpPercap    1.132820e+05
dtype: float64

Now let’s see the sort of split-apply-combine functionality that is popular in dplyr and related R packages.

Code
dat2007 = dat[dat.year == 2007].copy()  

dat2007.groupby('continent', as_index=False)[['lifeExp','gdpPercap']].mean()

def stdize(vals):
    return((vals - vals.mean()) / vals.std())

dat2007['lifeExpZ'] = dat2007.groupby('continent')['lifeExp'].transform(stdize)