This tutorial assumes you are have a bit of familiarity with some high-level language such as R or MATLAB. Some of the exercises as you to compare the Python functionality to functionality in R. Feel free to ignore these if you’re not familiar with R.

You will also need Python (IPython will also be helpful) installed on your computer, as well as a few core additional packages, including re, numpy, scipy, matplotlib, and pandas.

We recommend using the Anaconda distribution of Python, but you can see various options in the short course overview.

Packages can generally be installed via pip, or via conda if you have the Anaconda or Miniconda installation of Python. The following shows how to do this from the command line (this will work on MacOS or Linux, not sure about Windows).

conda list
conda install numpy

## using mamba (a fast drop-in replacement for `conda`)
mamba list
mamba install numpy

## using pip
pip install numpy
# or to install within your home directory if you do not have admin control of the computer
pip install --user numpy

For additional help with installation, please see the this SCF documentation or this Statistics 243 documentation.


Useful written references and tutorials:

While working through this tutorial, you should type the example code snippets at an interactive Python terminal. You may wish to use either the IPython shell (which has some additional functionality relative to a plain Python session) or a Jupyter IPython notebook. To start an IPython shell, type the following at a bash prompt:


To start an Jupyter IPython notebook locally on your computer, type this at the command line and a notebook should open in your browser.

jupyter notebook

Alternatively you can access Jupyter notebooks through a service called JupyterHub, in particular the campus DataHub or SCF JupyterHub.

Side note: to have all output of printing objects to the screen by typing the object name (not just the last result) printed in the Jupyter notebook, you can run this in a cell in your notebook.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Python 2 vs. 3

Until a few years ago, many people still used Python 2 even though Python 3 was available. It’s possible you’ll run across Python 2 code or the occasional person still using Python 2.

Python 3 is the current version of Python (more specifically Python 3.11), which is in some ways incompatible with Python 2. You should be using Python 3, though most of the code here will also work in Python 2.


Formatting Python code

Unlike most languages, in Python indentation determines code blocks, including functions, loops, and if-else statements.

The standard is 4 spaces (some people use a tab instead), but you can use other spacing if it’s consistent within a block of code.

a = 3
 a = 3  # this will cause an IndentationError in Python itself, but not IPython/Jupytr

if a>=4:    
    print('a is big')
    if(a == 4):
        print('a is 4')
    print('a is small')

if a>=4:    
  print('a is big')
  if(a == 4):
        print('a is 4')
        print('a is not 4')


Everything is an object in Python. Roughly, this means that it can be tagged with a variable (i.e., given a name) and passed as an argument to a function. Objects have attributes, which include fields and methods.

Certain objects in Python are mutable (e.g., lists, dictionaries), while other objects are immutable (e.g., tuples, strings, sets). Mutable means one can change parts of the object.

Many objects can be composite (e.g., a list of dictionaries or a dictionary of lists, tuples, and strings).

myList = [1, 2, 'foo']
myList[1] = 2.5

myTuple = (1, 2, 'foo')
    myTuple[1] = 2.5
except Exception as error:
'tuple' object does not support item assignment
(1, 2, 'foo')


As in R and other interpreted languages, variables are not their values in Python (think “I am not my name, I am the person named XXX”). You can think of variables as tags on objects. In particular, variables can be bound to an object of one type and then reassigned to an object of another type without error.

a = 'foobar'
a * 4

a = 3
TypeError: object of type 'int' has no len()

Modules, files, packages, import

While you will often explore things from an interactive Python prompt, you will save your code in files for reuse as well as to document what you’ve done. You can use Python code saved in a plain text file (i.e., a module) from a Python prompt or other files by importing it. Typically, this is done at the top of a file (if you are working at a prompt, you just need to import it before you want to use the functionality).

cat  # special IPython functionality to call the operating system
del(a); del(hello)

import mymod



This is convenient, but often not seen as good practice, as is reduces modularity and can interfere with already-existing objects.

from mymod import *


As in R, you can also load in additional supporting packages for extra functionality. Here are some examples of importing Python packages:

from math import cos


import math

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

When we import packages, they are in different namespaces, which helps to avoid problems with different packages using the same names for different functions or objects.


Adopting standard coding conventions is good practice.

The first link above is the official “Style Guide for Python Code”, usually referred to as PEP8 (PEP is an acronym for Python Enhancement Proposal). There are a couple of potentially helpful tools for helping you conform to the standard. The pep8 package that provides a commandline tool to check your code against some of the PEP8 standard conventions. Similarly, autopep8 provides a tool to automatically format your code so that it conforms to the PEP8 standards.

Documentation and getting help

Getting help pulls up the relevant docstring (see here for some guidance on writing docstrings (in particular for NumPy). Let’s briefly see how you might benefit from docstrings in practice.

In [1]: import numpy as np

In [2]: np.ndim?
Signature: np.ndim(a)
Return the number of dimensions of an array.

a : array_like
    Input array.  If it is not already an ndarray, a conversion is

number_of_dimensions : int
    The number of dimensions in `a`.  Scalars are zero-dimensional.

See Also
ndarray.ndim : equivalent method
shape : dimensions of array
ndarray.shape : dimensions of array

>>> np.ndim([[1,2,3],[4,5,6]])
>>> np.ndim(np.array([[1,2,3],[4,5,6]]))
>>> np.ndim(1)
File:      /usr/local/linux/mambaforge-3.11/lib/python3.11/site-packages/numpy/core/
Type:      function

Docstrings are an important part of Python. A docstring is a character string that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the doc special attribute of that object. All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings.

We can see the docstring directly in the file indicated above ( as well as the actual code of the function.


Note that ? and ?? only work in IPython (or a Jupyter notebook). For help in plain Python, use help(np.ndim).

  • What happens if you type np.ndim?? (i.e., use two question marks)?

  • What does np.ndim() do? How does it execute under the hood? Consider why the following uses of ndim both work.

    a = np.array([0, 1, 2])

    Now explain why only one of these works.

    a = [0, 1, 2]
  • Type np.array? at an IPython prompt. Briefly skim the docstring. nparray allows you to construct numpy arrays.

  • Type np. followed by the <Tab> key at an IPython prompt. Choose two or three of the completions and use ? to view their docstrings. In particular, pay attention to the examples provided near the end of the docstring and see whether you can figure out how you might use this functionality.

Decoding error messages

Let’s run the following code and try to tease out where the error is. The tricky part is that the error occurs within a function where the function comes from a module (separate code file).

import days


We’ll run that first in a plain Python session and then in an IPython session that shows more information about what happened.

The list of function calls that led to the error is called a traceback. (Note that in R you can get similar output using traceback() after an error or setting options(error = recover) before an error.)

Data Structures

Python has a number of basic data structure types that are widely used. There are both similarities and differences from basic data structures in R.


Python has integers, floats, and complex numbers with the usual operations.


x = 1.1

x * 2
x ** 2

(type(1), type(1.1), type(1+2j))
y = 1+2j

We can apply various functions to numbers, as expected.

# cos(0)  # Why would this fail?

import math

The math package in the standard library includes many additional numerical operations.

  • Using the section on “Built-in Types” from the official “The Python Standard Library” reference, figure out how to compute:

    1. 3 modulo 4,
    2. \((\lceil \frac{3}{4} \rceil \times 4)^3\), and
    3. \(\sqrt{-1}\).
  • Is the result of 5/3 - 2/3 of the integer type? Is the mathematical value seen in Python an integer? What about 7/3-4/3?

  • Here’s a numerical puzzle. Why does the last computation not work, when the others do? And, for those of you coming from R, which of these computations don’t work in R?


Objects and object-oriented programming

We’ll talk about this in more detail later, but it’s worth mentioning here that Python is an object-oriented language. What this means is that variables in Python are objects that are instances of a class.

Objects have methods that can be used on them and fields (member data) that are part of the object. All objects in a class have the same methods and same member data ‘slots’, but different objects will have different values in those slots.

Note that even the basic numeric structures behave like objects. We can use tab completion to see what methods are available for an object and what member data are part of an object.

x = 3.0
# x.as_integer_ratio  x.hex               x.real
# x.conjugate         x.imag              
# x.fromhex           x.is_integer        

Which of those are attributes/metadata (‘member data’) and which are methods (‘member functions’)? If it’s a method, say foo, you can run the method as If it’s member data, you can see its value with


Strings are immutable sequences of (zero or more) characters.


Unlike numbers, Python strings are container objects. Specifically, it is a sequence. Python has several sequence types including strings, tuples, and lists. Sequence types share some common functionality, which we can demonstrate with strings.


To see how indexing works in Python let’s use the string containing the digits 0 through 9.

import string

Note that indexing starts at 0 (unlike R and Fortran, but like C). Also negative integers index starting from the end of the sequence. You can find the length of a sequence using the len function.


Slicing allows you to select a subset of a string (or any sequence) by specifying start and stop indices as well as a step, which you specify using the start:stop:step notation inside of square braces.

As we work through these, try to guess what they will do.


Subsequence testing

'23' in string.digits 
'25' not in string.digits

String methods

string1 = "my string"
string1 = "my string"


string1 + " is your string."


string1[3:5] = 'ts'
TypeError: 'str' object does not support item assignment
string1 > "ab"
string1 > "zz"
What do you think is invoked when one does string1 > 'ab'?


  • Using this string: x = 'The ant wants what all ants want.', solve the following string manipulation problems using string indexing, slicing, methods, and subsequence testing:
    1. Convert the string to all lower case letters (don’t change x).
    2. Count the number of occurrences of the substring ant.
    3. Create a list of the words occurring in x. Make sure to remove punctuation and convert all words to lowercase.
    4. Using only string methods on x, create the following string: The chicken wants what all chickens want.
    5. Using indexing and the + operator, create the following string: The tna wants what all ants want.
    6. Do the same thing except using a string method instead.
  • What can you do with the in and not in operators? For those coming from R, what R operator is this like and how is it different?
  • Figure out what code you could run to figure out if Python is explicitly counting the number of characters when it does len(x)?
  • Compare the time for computing the length of a (long) string in Python and R. What can you infer about what is happening behind the scenes?


Tuples are immutable sequences of (zero or more) objects. Functions in Python often return tuples.

x = 1; y = 'foo'

xy = (x, y)
xy = x,y


xy[1] = 3  # immutable!
TypeError: 'tuple' object does not support item assignment


  • What’s weird about this? What are the types involved?

    z = x,y
    a,b = x,y
  • Create the following: x=5 and y=6. Now swap their values using a single line of code. (For R users, how would you do this in R?)

  • What happens when you multiply a tuple by a number? For R users, how is this different than similar syntax in R?

  • What’s nice about using immutable objects in your code?


Lists are mutable sequences of (zero or more) objects.

dice = [1, 2, 3, 4, 5, 6]


dice[1::2] = dice[::2]




1 in dice


dice2 = dice.copy()
id(dice)   # the 'id' gives the location in memory where the object is stored

# if use dice.append(dice) it embeds a pointer to itself 


  • Create a list of numbers. Reverse the order of the items in the list using slicing. Now reverse the order of the items using a list method. How does using the method differ from slicing?

  • Do you think tuples have a method to reverse the order of its items? Why or why not? Check to see if you are correct or not.

  • Using a list method sort your numbers. Create a list of strings and sort it.

  • Figure out some different ways of combining your list of numbers with your list of strings to create a single list of mixed type elements.

  • Now try to sort the resulting list. What happens?

  • What does the following tell you about copying and use of memory in Python?

    a = [1, 3, 5]
    b = a
    # this should confirm what you might suspect
    a[1] = 5


Dictionaries are mutable, unordered collections of key-value pairs.

students = {"Jarrod Millman": ['A', 'B+', 'A-'], 
            "Thomas Kluyver": ['A-', 'A-'], 
            "Stefan van der Walt": 'and now for something completely different'}
students["Jarrod Millman"]
students["Jarrod Millman"][1]
  • How can you add an item to a dictionary.
  • How do you combine two dictionaries into a single dictionary?
  • What are some analogs to dictionaries in R?
  • How are dictionaries different from such analogous structures in R?


Sets are immutable, unordered collections of unique elements.

x =  {1, 2, 4, 1, 4}
Built-in functions

Python has several built-in functions (you can find a full list using the link above). We’ve already used a few (e.g., len(), type(), print()). Here are a few more that we you will find useful.

zip and enumerate create objects that you can iterate through, but you can’t view the elements directly except by looping over them. They’re useful either to convert to lists or to loop through element by element.

x = zip([1, 2], ["a", "b"])
[(1, 'a'), (2, 'b')]

That’s sort of like cbind in R.

enumerate(["a", "b"])
list(enumerate(["a", "b"]))
[(0, 'a'), (1, 'b')]

We’ll soon see an example usage of enumerate within the context of looping.

Some other useful built-in functions are abs(), all(), any(), dict(), dir(), id(), list(), and set().

Control flow


This is as expected based on your experience with other languages. As previously noted, the indentation is important.

x = 2

if a >=4 :    
    print('a is big')
    if(a == 4):
        print('a is 4')
    print('a is small')

if a >=4 :    
    print('a is big')
    if(a == 4):
        print('a is 4')
        print('a is not 4')
a is small

For-loops (and list comprehension)

Here’s basic use of a for loop. Once again indentation is critical, in this case for indicating where the loop ends.

for x in [1,2,3,4]:

for x in [1,2,3,4]:
    y = x*2
    print(y, end=" ")

for x in range(30):
    y = x
print(y, end=" ")
2 4 6 8 


Building up a list piece-by-piece is a common task, which can easily be done in a for-loop. List comprehension provides a compact syntax to handle this task.

[x for x in range(4)]

vals = [-4, 3, -1, 2.5, 7]
[x for x in vals if x > 0]  # list comprehension

import string
letters = string.ascii_lowercase

# concise but terse:
[l[1] for l in enumerate(letters) if l[0] > 13]
# better style:
[letter for index, letter in enumerate(letters) if index > 13]

x = zip(['clinton', 'bush', 'obama', 'trump'], ['Dem', 'Rep', 'Dem', 'Rep'])
for pres,party in x:
    print(pres, party)

# note if we try to do the loop again, we get nothing, because
# we have emptied the iterator when we loop through it
for pres,party in x:
    print(pres, party)
clinton Dem
bush Rep
obama Dem
trump Rep


  • See what [1, 2, 3] + 3 returns. Try to explain what happened and why.
  • Use list comprehension to perform element-wise addition of a scalar to a list of scalars.
  • How would you do the same task using a for loop? The range function may be helpful as might the enumerate function.
  • Use a for loop to iterate through the elements of a zip object and determine the type of the individual elements.


Here’s an example that illustrates both positional arguments (always first) and named arguments.

def add(x, y=1, absol=False):
    if absol:

add(3, 5)

add(3, absol=True, y=-5)

add(y=-5, x=3)
add(y=-5, 3)
SyntaxError: positional argument follows keyword argument (, line 13)


  • Define a function that will take the sqrt of a number and will (if requested by the user) set the square root of a negative number to 0.

  • Now use the function in a list comprehension to operate on a list of numbers.

  • What happens if you modify a list within a function in Python; why do you think this is?

  • What happens if you modify a single number (scalar) within a function in Python; why do you think this is?

  • Recall how scoping works in R in terms of where objects are looked for if not found locally in a function (i.e., lexical scoping). Empirically assess how scoping works in Python. In other words, consider this function and how it will behave depending on where f is defined.

    def f(x):


We’ve already seen a bunch of object-oriented behavior. Here we’ll see how to make our own classes and objects that are instances (realizations) of a class.

class Rectangle(object):
    dim = 2  # class variable
    counter = 0
    def __init__(self, height, width):
        self.height = height  # instance variable
        self.width = width    # instance variable
        Rectangle.counter += 1
    def __repr__(self):
        return("{0} by {1} rectangle".format(self.height, self.width))        
    def area(self, verbose = False):
        if verbose:
            print('Computing the area... ')
    def set_diagonal(self):
        self.diagonal = pow(self.height**2 + self.width**2, 0.5)

x = Rectangle(10, 5)

x.dim = 'foo'
x.dim # hmmm


y = Rectangle(4, 8)

Data formats


The Python standard library provides a package for reading and writing CSV files. This is a somewhat low-level library, so in practice you will often use Pandas (or perhaps NumPy/SciPy) CSV functionality.


However the JSON package in the standard library is much more useful.

import json

x = {"name": "Jarrod", "department": "Biostatistics"}

with open("tmp.json", "w") as outfile: 
    json.dump(x, outfile)

# cat tmp.json  # special IPython functionality

with open("tmp.json") as infile:
    y = json.load(infile)

{'name': 'Jarrod', 'department': 'Biostatistics'}

Note that cat is not a Python statement. IPython is clever enough to guess that you want it to call out to the underlying operating system.


  • One of the nice things above the JSON format is that it so well structured that it easy for a machine to parse, but simple enough that it easy for humans to read. By default json.dump writes everything out to disk without line breaks. For readability purposes, use json.dump? to figure out how to pretty-print the text as well as sort it alphabetically by key.
  • Read in one of the JSON files in the project directory and experiment with pretty-printing it when you dump it back out to a file.

Standard library

Python provides a wealth of functionality in its huge standard library. We’ve already seen several packages in the standard library (e.g., math, csv, json). If you need some functionality the standard library is one of the first places to look.

Here are a couple packages that you may find useful.


import os

pwd           # IPython operating system call



  • Use os? and dir(os) to explore the os package.


The re package provides support for regular expressions.

Math, statistics, and plotting

Numpy and scipy

Standard lists in Python are not amenable to mathematical manipulation, unlike standard vectors in R. Instead we generally work with numpy arrays. These arrays can be of various dimensions (i.e., vectors, matrices, multi-dimensional arrays).

import numpy as np
z = [0, 1, 2] 

y = np.array(z)


x = np.array([[1, 2], [3, 4]], dtype=np.float64)


e = np.linalg.eig(x)

e[0]  # first eigenvalue (not the largest in this case...)
e[1][:, 0] # corresponding eigenvector
array([-0.82456484,  0.56576746])

All of the elements of the array must be of the same type.

There are a variety of numpy functions that allow us to do standard mathematical/statistical manipulations.

Here we’ll use some of those functions in addition to some syntax for subsetting and vectorized calculations.

np.linspace(0, 1, 5)

x = np.random.normal(size=10)

pos = x > 0

y = x[pos]

x[[1, 3, 4]]

x[pos] = 0

array([1.        , 1.        , 1.        , 1.        , 1.        ,
       0.55928119, 1.        , 0.98856735, 0.99467766, 1.        ])

scipy has even more numerical routines, including working with distributions and additional linear algebra.

import scipy.stats as st
st.norm.cdf(1.96, 0, 1)
st.norm.cdf(1.96, 0.5, 2)
st.norm(0.5, 2).cdf(1.96)


  • See what happens if you try to create a 2-d numpy array with one column of numbers and one column of characters/strings.
  • Try to add a vector to a matrix; how does this compare to R?
  • How do you compute the variance of each column of a matrix (akin to apply(x, 2, var) in R)? Try this first without ChatGPT, and then (if you have access to ChatGPT) see what ChatGPT tells you.


Pandas provides a Python implementation of R’s dataframe capabilities. Let’s see some example code.

import pandas as pd
dat = pd.read_csv('gapminder.csv')


dat.sort_values(['year', 'country'])

dat.loc[0:5, ['year', 'country']]  # R-style indexing

dat[dat.year == 1952]

ndat = dat[['pop','lifeExp','gdpPercap']]
ndat.apply(lambda col: col.max() - col.min())
pop          1.318623e+09
lifeExp      5.900400e+01
gdpPercap    1.132820e+05
dtype: float64

Now let’s see the sort of split-apply-combine functionality that is popular in dplyr and related R packages.

dat2007 = dat[dat.year == 2007].copy()  

dat2007.groupby('continent', as_index=False)[['lifeExp','gdpPercap']].mean()

def stdize(vals):
    return((vals - vals.mean()) / vals.std())

dat2007['lifeExpZ'] = dat2007.groupby('continent')['lifeExp'].transform(stdize)


  • Use pd.merge() to merge the continent means for life expectancy for 2007 back into the original dat2007 dataFrame.
  • Explore the pandas documentation to see if there is a way to add the continent means as a column without an explicit merge (i.e., mimicing capability built into R’s dplyr mutate function)?


Work through the material in the Matplotlib tutorial and then solve the following problems.


  • Make a scatterplot of lifeExp vs gdpPerCap for 2007; make sure you have nice axis labels and title.
  • Consider whether plotting income on a logarithmic axis is a better way to display the data.
  • Now make an array of plots (in one figure) where each subplot is a different year.
  • Now modify the plot so that the color of the point indicates the continent and add a legend that explains this (hint: it’s probably easier to do this with plt.scatter than plt.plot). Hint: This snippet of code can help with relating continents to particular colors: dict(zip(conts, colors)).
  • Find an outlier whose lifeExp is unexpectedly high or low and add text in the plot that associates the point with the name of the country (e.g., I suspect Cuba will be unexpectedly high and an oil-producing country may be unexpectedly low).


Please open project.html and work through the analysis of the Senatorial tweets with your group.

