Introduction to Python

Author

Chris Paciorek

Published

August 11, 2023

Background

Preparation

This tutorial assumes you are have a bit of familiarity with some high-level language such as R or MATLAB. Some of the exercises as you to compare the Python functionality to functionality in R. Feel free to ignore these if you’re not familiar with R.

You will also need Python (IPython will also be helpful) installed on your computer, as well as a few core additional packages, including re, numpy, scipy, matplotlib, and pandas.

We recommend using the Anaconda distribution of Python, but you can see various options in the short course overview.

Packages can generally be installed via pip, or via conda if you have the Anaconda or Miniconda installation of Python. The following shows how to do this from the command line (this will work on MacOS or Linux, not sure about Windows).

conda list
conda install numpy

## using mamba (a fast drop-in replacement for `conda`)
mamba list
mamba install numpy

## using pip
pip install numpy
# or to install within your home directory if you do not have admin control of the computer
pip install --user numpy

For additional help with installation, please see the this SCF documentation or this Statistics 243 documentation.

Resources

Useful written references and tutorials:

While working through this tutorial, you should type the example code snippets at an interactive Python terminal. You may wish to use either the IPython shell (which has some additional functionality relative to a plain Python session) or a Jupyter IPython notebook. To start an IPython shell, type the following at a bash prompt:

ipython

To start an Jupyter IPython notebook locally on your computer, type this at the command line and a notebook should open in your browser.

jupyter notebook

Alternatively you can access Jupyter notebooks through a service called JupyterHub, in particular the campus DataHub or SCF JupyterHub.

Side note: to have all output of printing objects to the screen by typing the object name (not just the last result) printed in the Jupyter notebook, you can run this in a cell in your notebook.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Python 2 vs. 3

Until a few years ago, many people still used Python 2 even though Python 3 was available. It’s possible you’ll run across Python 2 code or the occasional person still using Python 2.

Python 3 is the current version of Python (more specifically Python 3.11), which is in some ways incompatible with Python 2. You should be using Python 3, though most of the code here will also work in Python 2.

Introduction

Formatting Python code

Unlike most languages, in Python indentation determines code blocks, including functions, loops, and if-else statements.

The standard is 4 spaces (some people use a tab instead), but you can use other spacing if it’s consistent within a block of code.

a = 3
 a = 3  # this will cause an IndentationError in Python itself, but not IPython/Jupytr

if a>=4:    
    print('a is big')
    if(a == 4):
        print('a is 4')
else:
    print('a is small')

if a>=4:    
  print('a is big')
  if(a == 4):
        print('a is 4')
  else:
        print('a is not 4')

Objects

Everything is an object in Python. Roughly, this means that it can be tagged with a variable (i.e., given a name) and passed as an argument to a function. Objects have attributes, which include fields and methods.

Certain objects in Python are mutable (e.g., lists, dictionaries), while other objects are immutable (e.g., tuples, strings, sets). Mutable means one can change parts of the object.

Many objects can be composite (e.g., a list of dictionaries or a dictionary of lists, tuples, and strings).

myList = [1, 2, 'foo']
myList[1] = 2.5
myList

myTuple = (1, 2, 'foo')
try:
    myTuple[1] = 2.5
except Exception as error:
    print(error)
myTuple

'tuple' object does not support item assignment

(1, 2, 'foo')

Variables

As in R and other interpreted languages, variables are not their values in Python (think “I am not my name, I am the person named XXX”). You can think of variables as tags on objects. In particular, variables can be bound to an object of one type and then reassigned to an object of another type without error.

a = 'foobar'
a
a * 4
len(a)

a = 3
a
a*4
len(a)

TypeError: object of type 'int' has no len()

Modules, files, packages, import

While you will often explore things from an interactive Python prompt, you will save your code in files for reuse as well as to document what you’ve done. You can use Python code saved in a plain text file (i.e., a module) from a Python prompt or other files by importing it. Typically, this is done at the top of a file (if you are working at a prompt, you just need to import it before you want to use the functionality).

cat mymod.py  # special IPython functionality to call the operating system

del(a); del(hello)

import mymod

mymod.hello()
mymod.a

hello()
a

This is convenient, but often not seen as good practice, as is reduces modularity and can interfere with already-existing objects.

from mymod import *

hello()
a

As in R, you can also load in additional supporting packages for extra functionality. Here are some examples of importing Python packages:

from math import cos
cos(0)
sin(0)
math.cos(0)

dir()

import math
dir()
math.cos(0)
math.sin(0)
dir(math)

import numpy as np
numpy.arctan(1) 
np.arctan(1)
import scipy as sp
import matplotlib.pyplot as plt

When we import packages, they are in different namespaces, which helps to avoid problems with different packages using the same names for different functions or objects.

Style

Adopting standard coding conventions is good practice.

The first link above is the official “Style Guide for Python Code”, usually referred to as PEP8 (PEP is an acronym for Python Enhancement Proposal). There are a couple of potentially helpful tools for helping you conform to the standard. The pep8 package that provides a commandline tool to check your code against some of the PEP8 standard conventions. Similarly, autopep8 provides a tool to automatically format your code so that it conforms to the PEP8 standards.

Documentation and getting help

Getting help pulls up the relevant docstring (see here for some guidance on writing docstrings (in particular for NumPy). Let’s briefly see how you might benefit from docstrings in practice.

In [1]: import numpy as np

In [2]: np.ndim?
Signature: np.ndim(a)
Docstring:
Return the number of dimensions of an array.

Parameters
----------
a : array_like
    Input array.  If it is not already an ndarray, a conversion is
    attempted.

Returns
-------
number_of_dimensions : int
    The number of dimensions in `a`.  Scalars are zero-dimensional.

See Also
--------
ndarray.ndim : equivalent method
shape : dimensions of array
ndarray.shape : dimensions of array

Examples
--------
>>> np.ndim([[1,2,3],[4,5,6]])
2
>>> np.ndim(np.array([[1,2,3],[4,5,6]]))
2
>>> np.ndim(1)
0
File:      /usr/local/linux/mambaforge-3.11/lib/python3.11/site-packages/numpy/core/fromnumeric.py
Type:      function

Docstrings are an important part of Python. A docstring is a character string that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the doc special attribute of that object. All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings.

We can see the docstring directly in the file indicated above (fromnumeric.py) as well as the actual code of the function.

Exercises

Note that ? and ?? only work in IPython (or a Jupyter notebook). For help in plain Python, use help(np.ndim).

What happens if you type np.ndim?? (i.e., use two question marks)?
What does np.ndim() do? How does it execute under the hood? Consider why the following uses of ndim both work.
```
a = np.array([0, 1, 2])
a.ndim
np.ndim(a)
```
Now explain why only one of these works.
```
a = [0, 1, 2]
a.ndim
np.ndim(a)
```
Type np.array? at an IPython prompt. Briefly skim the docstring. nparray allows you to construct numpy arrays.
Type np. followed by the <Tab> key at an IPython prompt. Choose two or three of the completions and use ? to view their docstrings. In particular, pay attention to the examples provided near the end of the docstring and see whether you can figure out how you might use this functionality.

Decoding error messages

Let’s run the following code and try to tease out where the error is. The tricky part is that the error occurs within a function where the function comes from a module (separate code file).

import days

days.print_friday_message()

We’ll run that first in a plain Python session and then in an IPython session that shows more information about what happened.

The list of function calls that led to the error is called a traceback. (Note that in R you can get similar output using traceback() after an error or setting options(error = recover) before an error.)

Data Structures

Python has a number of basic data structure types that are widely used. There are both similarities and differences from basic data structures in R.

Numbers

Python has integers, floats, and complex numbers with the usual operations.

2*3
2/3

x = 1.1
type(x)

x * 2
x ** 2

(type(1), type(1.1), type(1+2j))
y = 1+2j

We can apply various functions to numbers, as expected.

# cos(0)  # Why would this fail?

import math
math.cos(0)
math.cos(math.pi)

-1.0

The math package in the standard library includes many additional numerical operations.

math.

math.acos       math.degrees    math.fsum       math.pi
math.acosh      math.e          math.gamma      math.pow
math.asin       math.erf        math.hypot      math.radians
math.asinh      math.erfc       math.isinf      math.sin
math.atan       math.exp        math.isnan      math.sinh
math.atan2      math.expm1      math.ldexp      math.sqrt
math.atanh      math.fabs       math.lgamma     math.tan
math.ceil       math.factorial  math.log        math.tanh
math.copysign   math.floor      math.log10      math.trunc
math.cos        math.fmod       math.log1p      
math.cosh       math.frexp      math.modf

Exercises

Using the section on “Built-in Types” from the official “The Python Standard Library” reference, figure out how to compute:
1. 3 modulo 4,
2. \((\lceil \frac{3}{4} \rceil \times 4)^3\), and
3. \(\sqrt{-1}\).
Is the result of 5/3 - 2/3 of the integer type? Is the mathematical value seen in Python an integer? What about 7/3-4/3?
Here’s a numerical puzzle. Why does the last computation not work, when the others do? And, for those of you coming from R, which of these computations don’t work in R?
```
100000**10
100000.0**10
100000**100
100000.0**100
```

Objects and object-oriented programming

We’ll talk about this in more detail later, but it’s worth mentioning here that Python is an object-oriented language. What this means is that variables in Python are objects that are instances of a class.

Objects have methods that can be used on them and fields (member data) that are part of the object. All objects in a class have the same methods and same member data ‘slots’, but different objects will have different values in those slots.

Note that even the basic numeric structures behave like objects. We can use tab completion to see what methods are available for an object and what member data are part of an object.

x = 3.0
type(x)
x.
# x.as_integer_ratio  x.hex               x.real
# x.conjugate         x.imag              
# x.fromhex           x.is_integer

Which of those are attributes/metadata (‘member data’) and which are methods (‘member functions’)? If it’s a method, say foo, you can run the method as x.foo(). If it’s member data, you can see its value with x.foo.

Strings

Strings are immutable sequences of (zero or more) characters.

Sequences

Unlike numbers, Python strings are container objects. Specifically, it is a sequence. Python has several sequence types including strings, tuples, and lists. Sequence types share some common functionality, which we can demonstrate with strings.

Indexing

To see how indexing works in Python let’s use the string containing the digits 0 through 9.

import string
string.digits 
string.digits[1]
string.digits[-1]

Note that indexing starts at 0 (unlike R and Fortran, but like C). Also negative integers index starting from the end of the sequence. You can find the length of a sequence using the len function.

Slicing

Slicing allows you to select a subset of a string (or any sequence) by specifying start and stop indices as well as a step, which you specify using the start:stop:step notation inside of square braces.

As we work through these, try to guess what they will do.

string.digits[1:5]
string.digits[1:5:2]
string.digits[1::2]
string.digits[:5:-1]
string.digits[1:5:-1]
string.digits[-3:-7:-1]

Subsequence testing

'23' in string.digits 
'25' not in string.digits

String methods

string1 = "my string"
string1.

string1.capitalize  string1.islower     string1.rpartition
string1.center      string1.isspace     string1.rsplit
string1.count       string1.istitle     string1.rstrip
string1.decode      string1.isupper     string1.split
string1.encode      string1.join        string1.splitlines
string1.endswith    string1.ljust       string1.startswith
string1.expandtabs  string1.lower       string1.strip
string1.find        string1.lstrip      string1.swapcase
string1.format      string1.partition   string1.title
string1.index       string1.replace     string1.translate
string1.isalnum     string1.rfind       string1.upper
string1.isalpha     string1.rindex      string1.zfill
string1.isdigit     string1.rjust

string1 = "my string"
string1.upper()

string1.upper?

string1 + " is your string."
"*"*10

string1[3:]
string1[3:4] 
string1[4::2]

string1[3:5] = 'ts'

TypeError: 'str' object does not support item assignment

string1 > "ab"
string1 > "zz"
string1.__

string1.__add__           string1.__len__
string1.__class__         string1.__lt__
string1.__contains__      string1.__mod__
string1.__delattr__       string1.__mul__
string1.__doc__           string1.__ne__
string1.__eq__            string1.__new__
string1.__format__        string1.__reduce__
string1.__ge__            string1.__reduce_ex__
string1.__getattribute__  string1.__repr__
string1.__getitem__       string1.__rmod__
string1.__getnewargs__    string1.__rmul__
string1.__getslice__      string1.__setattr__
string1.__gt__            string1.__sizeof__
string1.__hash__          string1.__str__
string1.__init__          string1.__subclasshook__

What do you think is invoked when one does string1 > 'ab'?

Exercises

Using this string: x = 'The ant wants what all ants want.', solve the following string manipulation problems using string indexing, slicing, methods, and subsequence testing:
1. Convert the string to all lower case letters (don’t change x).
2. Count the number of occurrences of the substring ant.
3. Create a list of the words occurring in x. Make sure to remove punctuation and convert all words to lowercase.
4. Using only string methods on x, create the following string: The chicken wants what all chickens want.
5. Using indexing and the + operator, create the following string: The tna wants what all ants want.
6. Do the same thing except using a string method instead.
What can you do with the in and not in operators? For those coming from R, what R operator is this like and how is it different?
Figure out what code you could run to figure out if Python is explicitly counting the number of characters when it does len(x)?
Compare the time for computing the length of a (long) string in Python and R. What can you infer about what is happening behind the scenes?

Tuples

Tuples are immutable sequences of (zero or more) objects. Functions in Python often return tuples.

x = 1; y = 'foo'

xy = (x, y)
type(xy)
xy = x,y
type(xy)

xy
xy[1]

xy[1] = 3  # immutable!

TypeError: 'tuple' object does not support item assignment

Exercises

What’s weird about this? What are the types involved?
```
z = x,y
a,b = x,y
```
Create the following: x=5 and y=6. Now swap their values using a single line of code. (For R users, how would you do this in R?)
What happens when you multiply a tuple by a number? For R users, how is this different than similar syntax in R?
What’s nice about using immutable objects in your code?

Lists

Lists are mutable sequences of (zero or more) objects.

dice = [1, 2, 3, 4, 5, 6]

dice[1::2]

dice[1::2] = dice[::2]

dice

dice*2

dice+dice[::-1]

1 in dice

dice.extend(dice)

dice2 = dice.copy()
id(dice)   # the 'id' gives the location in memory where the object is stored
id(dice2)

dice.append(dice2)
# if use dice.append(dice) it embeds a pointer to itself

Exercises

Create a list of numbers. Reverse the order of the items in the list using slicing. Now reverse the order of the items using a list method. How does using the method differ from slicing?
Do you think tuples have a method to reverse the order of its items? Why or why not? Check to see if you are correct or not.
Using a list method sort your numbers. Create a list of strings and sort it.
Figure out some different ways of combining your list of numbers with your list of strings to create a single list of mixed type elements.
Now try to sort the resulting list. What happens?

What does the following tell you about copying and use of memory in Python?

a = [1, 3, 5]
b = a
id(a)
id(b)
# this should confirm what you might suspect
a[1] = 5

Dictionaries

Dictionaries are mutable, unordered collections of key-value pairs.

students = {"Jarrod Millman": ['A', 'B+', 'A-'], 
            "Thomas Kluyver": ['A-', 'A-'], 
            "Stefan van der Walt": 'and now for something completely different'}
students
students.keys()
students.values()
students["Jarrod Millman"]
students["Jarrod Millman"][1]

'B+'

students.

students.clear(       students.items(       students.setdefault(
students.copy(        students.keys(        students.update(
students.fromkeys(    students.pop(         students.values(
students.get(         students.popitem()

Exercises

How can you add an item to a dictionary.
How do you combine two dictionaries into a single dictionary?
What are some analogs to dictionaries in R?
How are dictionaries different from such analogous structures in R?

Sets

Sets are immutable, unordered collections of unique elements.

x =  {1, 2, 4, 1, 4}
x
x.

x.add                          x.issubset
x.clear                        x.issuperset
x.copy                         x.pop
x.difference                   x.remove
x.difference_update            x.symmetric_difference
x.discard                      x.symmetric_difference_update
x.intersection                 x.union
x.intersection_update          x.update
x.isdisjoint

Built-in functions

https://docs.python.org/3/library/functions.html

Python has several built-in functions (you can find a full list using the link above). We’ve already used a few (e.g., len(), type(), print()). Here are a few more that we you will find useful.

zip and enumerate create objects that you can iterate through, but you can’t view the elements directly except by looping over them. They’re useful either to convert to lists or to loop through element by element.

x = zip([1, 2], ["a", "b"])
list(x)

[(1, 'a'), (2, 'b')]

That’s sort of like cbind in R.

enumerate(["a", "b"])
list(enumerate(["a", "b"]))

[(0, 'a'), (1, 'b')]

We’ll soon see an example usage of enumerate within the context of looping.

Some other useful built-in functions are abs(), all(), any(), dict(), dir(), id(), list(), and set().

Control flow

https://docs.python.org/3/tutorial/controlflow.html

If-then-else

https://docs.python.org/3/tutorial/controlflow.html#if-statements

This is as expected based on your experience with other languages. As previously noted, the indentation is important.

x = 2

if a >=4 :    
    print('a is big')
    if(a == 4):
        print('a is 4')
else:
    print('a is small')

if a >=4 :    
    print('a is big')
    if(a == 4):
        print('a is 4')
    else:
        print('a is not 4')

a is small

For-loops (and list comprehension)

Here’s basic use of a for loop. Once again indentation is critical, in this case for indicating where the loop ends.

for x in [1,2,3,4]:
    print(x)

for x in [1,2,3,4]:
    y = x*2
    print(y, end=" ")

print("\n")
for x in range(30):
    y = x
print(y, end=" ")

Building up a list piece-by-piece is a common task, which can easily be done in a for-loop. List comprehension provides a compact syntax to handle this task.

[x for x in range(4)]

vals = [-4, 3, -1, 2.5, 7]
[x for x in vals if x > 0]  # list comprehension

import string
letters = string.ascii_lowercase

# concise but terse:
[l[1] for l in enumerate(letters) if l[0] > 13]
# better style:
[letter for index, letter in enumerate(letters) if index > 13]

x = zip(['clinton', 'bush', 'obama', 'trump'], ['Dem', 'Rep', 'Dem', 'Rep'])
for pres,party in x:
    print(pres, party)

# note if we try to do the loop again, we get nothing, because
# we have emptied the iterator when we loop through it
for pres,party in x:
    print(pres, party)

clinton Dem
bush Rep
obama Dem
trump Rep

Exercises

See what [1, 2, 3] + 3 returns. Try to explain what happened and why.
Use list comprehension to perform element-wise addition of a scalar to a list of scalars.
How would you do the same task using a for loop? The range function may be helpful as might the enumerate function.
Use a for loop to iterate through the elements of a zip object and determine the type of the individual elements.

Functions

https://docs.python.org/3/tutorial/controlflow.html#defining-functions

Here’s an example that illustrates both positional arguments (always first) and named arguments.

def add(x, y=1, absol=False):
    if absol:
        return(abs(x+y))
    else:
        return(x+y)

add(3)
add(3, 5)

add(3, absol=True, y=-5)

add(y=-5, x=3)
add(y=-5, 3)

SyntaxError: positional argument follows keyword argument (1024892447.py, line 13)

Exercises

Define a function that will take the sqrt of a number and will (if requested by the user) set the square root of a negative number to 0.
Now use the function in a list comprehension to operate on a list of numbers.
What happens if you modify a list within a function in Python; why do you think this is?
What happens if you modify a single number (scalar) within a function in Python; why do you think this is?
Recall how scoping works in R in terms of where objects are looked for if not found locally in a function (i.e., lexical scoping). Empirically assess how scoping works in Python. In other words, consider this function and how it will behave depending on where f is defined.
```
def f(x):
    return(x+y)
```

Classes

https://docs.python.org/3/tutorial/classes.html

We’ve already seen a bunch of object-oriented behavior. Here we’ll see how to make our own classes and objects that are instances (realizations) of a class.

class Rectangle(object):
    dim = 2  # class variable
    counter = 0
    def __init__(self, height, width):
        self.height = height  # instance variable
        self.width = width    # instance variable
        self.set_diagonal()
        Rectangle.counter += 1
    def __repr__(self):
        return("{0} by {1} rectangle".format(self.height, self.width))        
    def area(self, verbose = False):
        if verbose:
            print('Computing the area... ')
        return(self.height*self.width)
    def set_diagonal(self):
        self.diagonal = pow(self.height**2 + self.width**2, 0.5)

x = Rectangle(10, 5)

x.dim
x.dim = 'foo'
x.dim # hmmm

x.area()
Rectangle.area(x)

y = Rectangle(4, 8)
y.counter
x.counter

Data formats

CSV

https://docs.python.org/3/library/csv.html

The Python standard library provides a package for reading and writing CSV files. This is a somewhat low-level library, so in practice you will often use Pandas (or perhaps NumPy/SciPy) CSV functionality.

JSON

https://docs.python.org/3/library/json.html

However the JSON package in the standard library is much more useful.

import json

x = {"name": "Jarrod", "department": "Biostatistics"}

with open("tmp.json", "w") as outfile: 
    json.dump(x, outfile)

# cat tmp.json  # special IPython functionality

with open("tmp.json") as infile:
    y = json.load(infile)

y

{'name': 'Jarrod', 'department': 'Biostatistics'}

Note that cat is not a Python statement. IPython is clever enough to guess that you want it to call out to the underlying operating system.

Exercise

One of the nice things above the JSON format is that it so well structured that it easy for a machine to parse, but simple enough that it easy for humans to read. By default json.dump writes everything out to disk without line breaks. For readability purposes, use json.dump? to figure out how to pretty-print the text as well as sort it alphabetically by key.
Read in one of the JSON files in the project directory and experiment with pretty-printing it when you dump it back out to a file.

Standard library

https://docs.python.org/3/tutorial/stdlib.html

Python provides a wealth of functionality in its huge standard library. We’ve already seen several packages in the standard library (e.g., math, csv, json). If you need some functionality the standard library is one of the first places to look.

Here are a couple packages that you may find useful.

os

https://docs.python.org/3/tutorial/stdlib.html#operating-system-interface

import os

os.getcwd()
pwd           # IPython operating system call

os.chdir('..')
os.getcwd()

Exercise

Use os? and dir(os) to explore the os package.

re

https://docs.python.org/3/howto/regex.html

The re package provides support for regular expressions.

Math, statistics, and plotting

Numpy and scipy

Standard lists in Python are not amenable to mathematical manipulation, unlike standard vectors in R. Instead we generally work with numpy arrays. These arrays can be of various dimensions (i.e., vectors, matrices, multi-dimensional arrays).

import numpy as np
z = [0, 1, 2] 

y = np.array(z)
y*3

y.dtype


x = np.array([[1, 2], [3, 4]], dtype=np.float64)
x*x
x.dot(x)
x.T

np.linalg.svd(x)

e = np.linalg.eig(x)

e[0]  # first eigenvalue (not the largest in this case...)
e[1][:, 0] # corresponding eigenvector

array([-0.82456484,  0.56576746])

All of the elements of the array must be of the same type.

There are a variety of numpy functions that allow us to do standard mathematical/statistical manipulations.

Here we’ll use some of those functions in addition to some syntax for subsetting and vectorized calculations.

np.linspace(0, 1, 5)

np.random.seed(0)
x = np.random.normal(size=10)

pos = x > 0

y = x[pos]

x[[1, 3, 4]]

x[pos] = 0

np.cos(x)

array([1.        , 1.        , 1.        , 1.        , 1.        ,
       0.55928119, 1.        , 0.98856735, 0.99467766, 1.        ])

scipy has even more numerical routines, including working with distributions and additional linear algebra.

import scipy.stats as st
st.norm.cdf(1.96, 0, 1)
st.norm.cdf(1.96, 0.5, 2)
st.norm(0.5, 2).cdf(1.96)

0.7673049076991025

Exercise

See what happens if you try to create a 2-d numpy array with one column of numbers and one column of characters/strings.
Try to add a vector to a matrix; how does this compare to R?
How do you compute the variance of each column of a matrix (akin to apply(x, 2, var) in R)? Try this first without ChatGPT, and then (if you have access to ChatGPT) see what ChatGPT tells you.

Pandas

Pandas provides a Python implementation of R’s dataframe capabilities. Let’s see some example code.

import pandas as pd
dat = pd.read_csv('gapminder.csv')
dat.head()

dat.columns
dat['year']
dat.year
dat[0:5]

dat.sort_values(['year', 'country'])

dat.loc[0:5, ['year', 'country']]  # R-style indexing

dat[dat.year == 1952]

ndat = dat[['pop','lifeExp','gdpPercap']]
ndat.apply(lambda col: col.max() - col.min())

pop          1.318623e+09
lifeExp      5.900400e+01
gdpPercap    1.132820e+05
dtype: float64

Now let’s see the sort of split-apply-combine functionality that is popular in dplyr and related R packages.

dat2007 = dat[dat.year == 2007].copy()  

dat2007.groupby('continent', as_index=False)[['lifeExp','gdpPercap']].mean()

def stdize(vals):
    return((vals - vals.mean()) / vals.std())

dat2007['lifeExpZ'] = dat2007.groupby('continent')['lifeExp'].transform(stdize)

Exercise

Use pd.merge() to merge the continent means for life expectancy for 2007 back into the original dat2007 dataFrame.
Explore the pandas documentation to see if there is a way to add the continent means as a column without an explicit merge (i.e., mimicing capability built into R’s dplyr mutate function)?

Matplotlib

Work through the material in the Matplotlib tutorial and then solve the following problems.

Exercise

Make a scatterplot of lifeExp vs gdpPerCap for 2007; make sure you have nice axis labels and title.
Consider whether plotting income on a logarithmic axis is a better way to display the data.
Now make an array of plots (in one figure) where each subplot is a different year.
Now modify the plot so that the color of the point indicates the continent and add a legend that explains this (hint: it’s probably easier to do this with plt.scatter than plt.plot). Hint: This snippet of code can help with relating continents to particular colors: dict(zip(conts, colors)).
Find an outlier whose lifeExp is unexpectedly high or low and add text in the plot that associates the point with the name of the country (e.g., I suspect Cuba will be unexpectedly high and an oil-producing country may be unexpectedly low).

Mini-project

Please open project.html and work through the analysis of the Senatorial tweets with your group.

A Note on the Contents

This content was adapted by Chris Paciorek from material prepared by K. Jarrod Millman and is licensed under the CC BY-NC-SA 4.0 license.