from IPython.core.interactiveshell import InteractiveShell
= "all" InteractiveShell.ast_node_interactivity
Introduction to Python
Background
Preparation
This tutorial assumes you are have a bit of familiarity with some high-level language such as R or MATLAB. Some of the exercises as you to compare the Python functionality to functionality in R. Feel free to ignore these if you’re not familiar with R.
You will also need Python (IPython will also be helpful) installed on your computer, as well as a few core additional packages, including re, numpy, scipy, matplotlib, and pandas.
We recommend using the Anaconda distribution of Python, but you can see various options in the short course overview.
Packages can generally be installed via pip, or via conda if you have the Anaconda or Miniconda installation of Python. The following shows how to do this from the command line (this will work on MacOS or Linux, not sure about Windows).
conda list
conda install numpy
## using mamba (a fast drop-in replacement for `conda`)
mamba list
mamba install numpy
## using pip
pip install numpy
# or to install within your home directory if you do not have admin control of the computer
pip install --user numpy
For additional help with installation, please see the this SCF documentation or this Statistics 243 documentation.
Resources
Useful written references and tutorials:
- https://docs.python.org/3/index.html
- https://docs.python.org/3/library/index.html
- https://scipy-lectures.github.io/
While working through this tutorial, you should type the example code snippets at an interactive Python terminal. You may wish to use either the IPython shell (which has some additional functionality relative to a plain Python session) or a Jupyter IPython notebook. To start an IPython shell, type the following at a bash prompt:
ipython
To start an Jupyter IPython notebook locally on your computer, type this at the command line and a notebook should open in your browser.
jupyter notebook
Alternatively you can access Jupyter notebooks through a service called JupyterHub, in particular the campus DataHub or SCF JupyterHub.
Side note: to have all output of printing objects to the screen by typing the object name (not just the last result) printed in the Jupyter notebook, you can run this in a cell in your notebook.
Python 2 vs. 3
Until a few years ago, many people still used Python 2 even though Python 3 was available. It’s possible you’ll run across Python 2 code or the occasional person still using Python 2.
Python 3 is the current version of Python (more specifically Python 3.11), which is in some ways incompatible with Python 2. You should be using Python 3, though most of the code here will also work in Python 2.
Introduction
Formatting Python code
Unlike most languages, in Python indentation determines code blocks, including functions, loops, and if-else statements.
The standard is 4 spaces (some people use a tab instead), but you can use other spacing if it’s consistent within a block of code.
= 3
a = 3 # this will cause an IndentationError in Python itself, but not IPython/Jupytr
a
if a>=4:
print('a is big')
if(a == 4):
print('a is 4')
else:
print('a is small')
if a>=4:
print('a is big')
if(a == 4):
print('a is 4')
else:
print('a is not 4')
Objects
Everything is an object in Python. Roughly, this means that it can be tagged with a variable (i.e., given a name) and passed as an argument to a function. Objects have attributes, which include fields and methods.
Certain objects in Python are mutable (e.g., lists, dictionaries), while other objects are immutable (e.g., tuples, strings, sets). Mutable means one can change parts of the object.
Many objects can be composite (e.g., a list of dictionaries or a dictionary of lists, tuples, and strings).
= [1, 2, 'foo']
myList 1] = 2.5
myList[
myList
= (1, 2, 'foo')
myTuple try:
1] = 2.5
myTuple[except Exception as error:
print(error)
myTuple
'tuple' object does not support item assignment
(1, 2, 'foo')
Variables
As in R and other interpreted languages, variables are not their values in Python (think “I am not my name, I am the person named XXX”). You can think of variables as tags on objects. In particular, variables can be bound to an object of one type and then reassigned to an object of another type without error.
= 'foobar'
a
a* 4
a len(a)
= 3
a
a*4
alen(a)
TypeError: object of type 'int' has no len()
Modules, files, packages, import
While you will often explore things from an interactive Python prompt, you will save your code in files for reuse as well as to document what you’ve done. You can use Python code saved in a plain text file (i.e., a module) from a Python prompt or other files by importing it. Typically, this is done at the top of a file (if you are working at a prompt, you just need to import it before you want to use the functionality).
cat mymod.py # special IPython functionality to call the operating system
del(a); del(hello)
import mymod
mymod.hello()
mymod.a
hello() a
This is convenient, but often not seen as good practice, as is reduces modularity and can interfere with already-existing objects.
from mymod import *
hello() a
As in R, you can also load in additional supporting packages for extra functionality. Here are some examples of importing Python packages:
from math import cos
0)
cos(0)
sin(0)
math.cos(
dir()
import math
dir()
0)
math.cos(0)
math.sin(dir(math)
import numpy as np
1)
numpy.arctan(1)
np.arctan(import scipy as sp
import matplotlib.pyplot as plt
When we import packages, they are in different namespaces, which helps to avoid problems with different packages using the same names for different functions or objects.
Style
Adopting standard coding conventions is good practice.
- https://www.python.org/dev/peps/pep-0008/
- https://docs.python.org/3/tutorial/controlflow.html#intermezzo-coding-style
The first link above is the official “Style Guide for Python Code”, usually referred to as PEP8 (PEP is an acronym for Python Enhancement Proposal). There are a couple of potentially helpful tools for helping you conform to the standard. The pep8 package that provides a commandline tool to check your code against some of the PEP8 standard conventions. Similarly, autopep8 provides a tool to automatically format your code so that it conforms to the PEP8 standards.
Documentation and getting help
Getting help pulls up the relevant docstring (see here for some guidance on writing docstrings (in particular for NumPy). Let’s briefly see how you might benefit from docstrings in practice.
In [1]: import numpy as np
In [2]: np.ndim?
Signature: np.ndim(a)
Docstring:
Return the number of dimensions of an array.
Parameters
----------
a : array_like
Input array. If it is not already an ndarray, a conversion is
attempted.
Returns
-------
number_of_dimensions : int
The number of dimensions in `a`. Scalars are zero-dimensional.
See Also
--------
ndarray.ndim : equivalent method
shape : dimensions of array
ndarray.shape : dimensions of array
Examples
--------
>>> np.ndim([[1,2,3],[4,5,6]])
2
>>> np.ndim(np.array([[1,2,3],[4,5,6]]))
2
>>> np.ndim(1)
0
File: /usr/local/linux/mambaforge-3.11/lib/python3.11/site-packages/numpy/core/fromnumeric.py
Type: function
Docstrings are an important part of Python. A docstring is a character string that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the doc special attribute of that object. All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings.
We can see the docstring directly in the file indicated above (fromnumeric.py
) as well as the actual code of the function.
Exercises
Note that ?
and ??
only work in IPython (or a Jupyter notebook). For help in plain Python, use help(np.ndim)
.
What happens if you type
np.ndim??
(i.e., use two question marks)?What does
np.ndim()
do? How does it execute under the hood? Consider why the following uses ofndim
both work.= np.array([0, 1, 2]) a a.ndim np.ndim(a)
Now explain why only one of these works.
= [0, 1, 2] a a.ndim np.ndim(a)
Type
np.array?
at an IPython prompt. Briefly skim the docstring.nparray
allows you to construct numpy arrays.Type
np.
followed by the<Tab>
key at an IPython prompt. Choose two or three of the completions and use?
to view their docstrings. In particular, pay attention to the examples provided near the end of the docstring and see whether you can figure out how you might use this functionality.
Decoding error messages
Let’s run the following code and try to tease out where the error is. The tricky part is that the error occurs within a function where the function comes from a module (separate code file).
import days
days.print_friday_message()
We’ll run that first in a plain Python session and then in an IPython session that shows more information about what happened.
The list of function calls that led to the error is called a traceback. (Note that in R you can get similar output using traceback()
after an error or setting options(error = recover)
before an error.)
Data Structures
Python has a number of basic data structure types that are widely used. There are both similarities and differences from basic data structures in R.
- https://docs.python.org/3/library/stdtypes.html
- https://docs.python.org/3/tutorial/datastructures.html
- https://docs.python.org/3/reference/datamodel.html
Numbers
Python has integers, floats, and complex numbers with the usual operations.
2*3
2/3
= 1.1
x type(x)
* 2
x ** 2
x
type(1), type(1.1), type(1+2j))
(= 1+2j y
We can apply various functions to numbers, as expected.
# cos(0) # Why would this fail?
import math
0)
math.cos( math.cos(math.pi)
-1.0
The math
package in the standard library includes many additional numerical operations.
math.
math.acos math.degrees math.fsum math.pi
math.acosh math.e math.gamma math.pow
math.asin math.erf math.hypot math.radians
math.asinh math.erfc math.isinf math.sin
math.atan math.exp math.isnan math.sinh
math.atan2 math.expm1 math.ldexp math.sqrt
math.atanh math.fabs math.lgamma math.tan
math.ceil math.factorial math.log math.tanh
math.copysign math.floor math.log10 math.trunc
math.cos math.fmod math.log1p
math.cosh math.frexp math.modf
Exercises
Using the section on “Built-in Types” from the official “The Python Standard Library” reference, figure out how to compute:
- 3 modulo 4,
- \((\lceil \frac{3}{4} \rceil \times 4)^3\), and
- \(\sqrt{-1}\).
Is the result of
5/3 - 2/3
of the integer type? Is the mathematical value seen in Python an integer? What about7/3-4/3
?Here’s a numerical puzzle. Why does the last computation not work, when the others do? And, for those of you coming from R, which of these computations don’t work in R?
100000**10 100000.0**10 100000**100 100000.0**100
Objects and object-oriented programming
We’ll talk about this in more detail later, but it’s worth mentioning here that Python is an object-oriented language. What this means is that variables in Python are objects that are instances of a class.
Objects have methods that can be used on them and fields (member data) that are part of the object. All objects in a class have the same methods and same member data ‘slots’, but different objects will have different values in those slots.
Note that even the basic numeric structures behave like objects. We can use tab completion to see what methods are available for an object and what member data are part of an object.
= 3.0
x type(x)
x.# x.as_integer_ratio x.hex x.real
# x.conjugate x.imag
# x.fromhex x.is_integer
Which of those are attributes/metadata (‘member data’) and which are methods (‘member functions’)? If it’s a method, say foo
, you can run the method as x.foo()
. If it’s member data, you can see its value with x.foo
.
Strings
Strings are immutable sequences of (zero or more) characters.
Sequences
Unlike numbers, Python strings are container objects. Specifically, it is a sequence. Python has several sequence types including strings, tuples, and lists. Sequence types share some common functionality, which we can demonstrate with strings.
Indexing
To see how indexing works in Python let’s use the string containing the digits 0 through 9.
import string
string.digits 1]
string.digits[-1] string.digits[
Note that indexing starts at 0 (unlike R and Fortran, but like C). Also negative integers index starting from the end of the sequence. You can find the length of a sequence using the len
function.
Slicing
Slicing allows you to select a subset of a string (or any sequence) by specifying start and stop indices as well as a step, which you specify using the start:stop:step
notation inside of square braces.
As we work through these, try to guess what they will do.
1:5]
string.digits[1:5:2]
string.digits[1::2]
string.digits[5:-1]
string.digits[:1:5:-1]
string.digits[-3:-7:-1] string.digits[
Subsequence testing
'23' in string.digits
'25' not in string.digits
String methods
= "my string"
string1 string1.
string1.capitalize string1.islower string1.rpartition
string1.center string1.isspace string1.rsplit
string1.count string1.istitle string1.rstrip
string1.decode string1.isupper string1.split
string1.encode string1.join string1.splitlines
string1.endswith string1.ljust string1.startswith
string1.expandtabs string1.lower string1.strip
string1.find string1.lstrip string1.swapcase
string1.format string1.partition string1.title
string1.index string1.replace string1.translate
string1.isalnum string1.rfind string1.upper
string1.isalpha string1.rindex string1.zfill
string1.isdigit string1.rjust
= "my string"
string1
string1.upper()
string1.upper?
+ " is your string."
string1 "*"*10
3:]
string1[3:4]
string1[4::2]
string1[
3:5] = 'ts' string1[
TypeError: 'str' object does not support item assignment
> "ab"
string1 > "zz"
string1 string1.__
string1.__add__ string1.__len__
string1.__class__ string1.__lt__
string1.__contains__ string1.__mod__
string1.__delattr__ string1.__mul__
string1.__doc__ string1.__ne__
string1.__eq__ string1.__new__
string1.__format__ string1.__reduce__
string1.__ge__ string1.__reduce_ex__
string1.__getattribute__ string1.__repr__
string1.__getitem__ string1.__rmod__
string1.__getnewargs__ string1.__rmul__
string1.__getslice__ string1.__setattr__
string1.__gt__ string1.__sizeof__
string1.__hash__ string1.__str__
string1.__init__ string1.__subclasshook__
What do you think is invoked when one does string1 > 'ab'
?
Exercises
- Using this string:
x = 'The ant wants what all ants want.'
, solve the following string manipulation problems using string indexing, slicing, methods, and subsequence testing:- Convert the string to all lower case letters (don’t change x).
- Count the number of occurrences of the substring
ant
. - Create a list of the words occurring in
x
. Make sure to remove punctuation and convert all words to lowercase. - Using only string methods on
x
, create the following string:The chicken wants what all chickens want.
- Using indexing and the
+
operator, create the following string:The tna wants what all ants want.
- Do the same thing except using a string method instead.
- What can you do with the
in
andnot in
operators? For those coming from R, what R operator is this like and how is it different? - Figure out what code you could run to figure out if Python is explicitly counting the number of characters when it does
len(x)
? - Compare the time for computing the length of a (long) string in Python and R. What can you infer about what is happening behind the scenes?
Tuples
Tuples are immutable sequences of (zero or more) objects. Functions in Python often return tuples.
= 1; y = 'foo'
x
= (x, y)
xy type(xy)
= x,y
xy type(xy)
xy1]
xy[
1] = 3 # immutable! xy[
TypeError: 'tuple' object does not support item assignment
Exercises
What’s weird about this? What are the types involved?
= x,y z = x,y a,b
Create the following:
x=5
andy=6
. Now swap their values using a single line of code. (For R users, how would you do this in R?)What happens when you multiply a tuple by a number? For R users, how is this different than similar syntax in R?
What’s nice about using immutable objects in your code?
Lists
Lists are mutable sequences of (zero or more) objects.
= [1, 2, 3, 4, 5, 6]
dice
1::2]
dice[
1::2] = dice[::2]
dice[
dice
*2
dice
+dice[::-1]
dice
1 in dice
dice.extend(dice)
= dice.copy()
dice2 id(dice) # the 'id' gives the location in memory where the object is stored
id(dice2)
dice.append(dice2)# if use dice.append(dice) it embeds a pointer to itself
Exercises
Create a list of numbers. Reverse the order of the items in the list using slicing. Now reverse the order of the items using a list method. How does using the method differ from slicing?
Do you think tuples have a method to reverse the order of its items? Why or why not? Check to see if you are correct or not.
Using a list method sort your numbers. Create a list of strings and sort it.
Figure out some different ways of combining your list of numbers with your list of strings to create a single list of mixed type elements.
Now try to sort the resulting list. What happens?
What does the following tell you about copying and use of memory in Python?
= [1, 3, 5] a = a b id(a) id(b) # this should confirm what you might suspect 1] = 5 a[
Dictionaries
Dictionaries are mutable, unordered collections of key-value pairs.
= {"Jarrod Millman": ['A', 'B+', 'A-'],
students "Thomas Kluyver": ['A-', 'A-'],
"Stefan van der Walt": 'and now for something completely different'}
students
students.keys()
students.values()"Jarrod Millman"]
students["Jarrod Millman"][1] students[
'B+'
students.
students.clear( students.items( students.setdefault(
students.copy( students.keys( students.update(
students.fromkeys( students.pop( students.values(
students.get( students.popitem()
Exercises
- How can you add an item to a dictionary.
- How do you combine two dictionaries into a single dictionary?
- What are some analogs to dictionaries in R?
- How are dictionaries different from such analogous structures in R?
Sets
Sets are immutable, unordered collections of unique elements.
= {1, 2, 4, 1, 4}
x
x x.
x.add x.issubset
x.clear x.issuperset
x.copy x.pop
x.difference x.remove
x.difference_update x.symmetric_difference
x.discard x.symmetric_difference_update
x.intersection x.union
x.intersection_update x.update
x.isdisjoint
Built-in functions
Python has several built-in functions (you can find a full list using the link above). We’ve already used a few (e.g., len(), type(), print()
). Here are a few more that we you will find useful.
zip
and enumerate
create objects that you can iterate through, but you can’t view the elements directly except by looping over them. They’re useful either to convert to lists or to loop through element by element.
= zip([1, 2], ["a", "b"])
x list(x)
[(1, 'a'), (2, 'b')]
That’s sort of like cbind
in R.
enumerate(["a", "b"])
list(enumerate(["a", "b"]))
[(0, 'a'), (1, 'b')]
We’ll soon see an example usage of enumerate within the context of looping.
Some other useful built-in functions are abs()
, all()
, any()
, dict()
, dir()
, id()
, list()
, and set()
.
Control flow
If-then-else
This is as expected based on your experience with other languages. As previously noted, the indentation is important.
= 2
x
if a >=4 :
print('a is big')
if(a == 4):
print('a is 4')
else:
print('a is small')
if a >=4 :
print('a is big')
if(a == 4):
print('a is 4')
else:
print('a is not 4')
a is small
For-loops (and list comprehension)
- https://docs.python.org/3/tutorial/controlflow.html#for-statements
- https://docs.python.org/3/whatsnew/2.0.html#list-comprehensions
Here’s basic use of a for loop. Once again indentation is critical, in this case for indicating where the loop ends.
for x in [1,2,3,4]:
print(x)
for x in [1,2,3,4]:
= x*2
y print(y, end=" ")
print("\n")
for x in range(30):
= x
y print(y, end=" ")
1
2
3
4
2 4 6 8
29
Building up a list piece-by-piece is a common task, which can easily be done in a for-loop. List comprehension provides a compact syntax to handle this task.
for x in range(4)]
[x
= [-4, 3, -1, 2.5, 7]
vals for x in vals if x > 0] # list comprehension
[x
import string
= string.ascii_lowercase
letters
# concise but terse:
1] for l in enumerate(letters) if l[0] > 13]
[l[# better style:
for index, letter in enumerate(letters) if index > 13]
[letter
= zip(['clinton', 'bush', 'obama', 'trump'], ['Dem', 'Rep', 'Dem', 'Rep'])
x for pres,party in x:
print(pres, party)
# note if we try to do the loop again, we get nothing, because
# we have emptied the iterator when we loop through it
for pres,party in x:
print(pres, party)
clinton Dem
bush Rep
obama Dem
trump Rep
Exercises
- See what
[1, 2, 3] + 3
returns. Try to explain what happened and why. - Use list comprehension to perform element-wise addition of a scalar to a list of scalars.
- How would you do the same task using a for loop? The
range
function may be helpful as might theenumerate
function. - Use a for loop to iterate through the elements of a zip object and determine the type of the individual elements.
Functions
Here’s an example that illustrates both positional arguments (always first) and named arguments.
def add(x, y=1, absol=False):
if absol:
return(abs(x+y))
else:
return(x+y)
3)
add(3, 5)
add(
3, absol=True, y=-5)
add(
=-5, x=3)
add(y=-5, 3) add(y
SyntaxError: positional argument follows keyword argument (1024892447.py, line 13)
Exercises
Define a function that will take the sqrt of a number and will (if requested by the user) set the square root of a negative number to 0.
Now use the function in a list comprehension to operate on a list of numbers.
What happens if you modify a list within a function in Python; why do you think this is?
What happens if you modify a single number (scalar) within a function in Python; why do you think this is?
Recall how scoping works in R in terms of where objects are looked for if not found locally in a function (i.e., lexical scoping). Empirically assess how scoping works in Python. In other words, consider this function and how it will behave depending on where
f
is defined.def f(x): return(x+y)
Classes
We’ve already seen a bunch of object-oriented behavior. Here we’ll see how to make our own classes and objects that are instances (realizations) of a class.
class Rectangle(object):
= 2 # class variable
dim = 0
counter def __init__(self, height, width):
self.height = height # instance variable
self.width = width # instance variable
self.set_diagonal()
+= 1
Rectangle.counter def __repr__(self):
return("{0} by {1} rectangle".format(self.height, self.width))
def area(self, verbose = False):
if verbose:
print('Computing the area... ')
return(self.height*self.width)
def set_diagonal(self):
self.diagonal = pow(self.height**2 + self.width**2, 0.5)
= Rectangle(10, 5)
x
x.dim= 'foo'
x.dim # hmmm
x.dim
x.area()
Rectangle.area(x)
= Rectangle(4, 8)
y
y.counter x.counter
2
Data formats
CSV
The Python standard library provides a package for reading and writing CSV files. This is a somewhat low-level library, so in practice you will often use Pandas (or perhaps NumPy/SciPy) CSV functionality.
JSON
However the JSON package in the standard library is much more useful.
import json
= {"name": "Jarrod", "department": "Biostatistics"}
x
with open("tmp.json", "w") as outfile:
json.dump(x, outfile)
# cat tmp.json # special IPython functionality
with open("tmp.json") as infile:
= json.load(infile)
y
y
{'name': 'Jarrod', 'department': 'Biostatistics'}
Note that cat
is not a Python statement. IPython is clever enough to guess that you want it to call out to the underlying operating system.
Exercise
- One of the nice things above the JSON format is that it so well structured that it easy for a machine to parse, but simple enough that it easy for humans to read. By default
json.dump
writes everything out to disk without line breaks. For readability purposes, usejson.dump?
to figure out how to pretty-print the text as well as sort it alphabetically by key. - Read in one of the JSON files in the
project
directory and experiment with pretty-printing it when you dump it back out to a file.
Standard library
Python provides a wealth of functionality in its huge standard library. We’ve already seen several packages in the standard library (e.g., math
, csv
, json
). If you need some functionality the standard library is one of the first places to look.
Here are a couple packages that you may find useful.
os
import os
os.getcwd()# IPython operating system call
pwd
'..')
os.chdir( os.getcwd()
Exercise
- Use
os?
anddir(os)
to explore the os package.
re
The re
package provides support for regular expressions.
Math, statistics, and plotting
Numpy and scipy
Standard lists in Python are not amenable to mathematical manipulation, unlike standard vectors in R. Instead we generally work with numpy arrays. These arrays can be of various dimensions (i.e., vectors, matrices, multi-dimensional arrays).
import numpy as np
= [0, 1, 2]
z
= np.array(z)
y *3
y
y.dtype
= np.array([[1, 2], [3, 4]], dtype=np.float64)
x *x
x
x.dot(x)
x.T
np.linalg.svd(x)
= np.linalg.eig(x)
e
0] # first eigenvalue (not the largest in this case...)
e[1][:, 0] # corresponding eigenvector e[
array([-0.82456484, 0.56576746])
All of the elements of the array must be of the same type.
There are a variety of numpy functions that allow us to do standard mathematical/statistical manipulations.
Here we’ll use some of those functions in addition to some syntax for subsetting and vectorized calculations.
0, 1, 5)
np.linspace(
0)
np.random.seed(= np.random.normal(size=10)
x
= x > 0
pos
= x[pos]
y
1, 3, 4]]
x[[
= 0
x[pos]
np.cos(x)
array([1. , 1. , 1. , 1. , 1. ,
0.55928119, 1. , 0.98856735, 0.99467766, 1. ])
scipy has even more numerical routines, including working with distributions and additional linear algebra.
import scipy.stats as st
1.96, 0, 1)
st.norm.cdf(1.96, 0.5, 2)
st.norm.cdf(0.5, 2).cdf(1.96) st.norm(
0.7673049076991025
Exercise
- See what happens if you try to create a 2-d numpy array with one column of numbers and one column of characters/strings.
- Try to add a vector to a matrix; how does this compare to R?
- How do you compute the variance of each column of a matrix (akin to
apply(x, 2, var)
in R)? Try this first without ChatGPT, and then (if you have access to ChatGPT) see what ChatGPT tells you.
Pandas
Pandas provides a Python implementation of R’s dataframe capabilities. Let’s see some example code.
import pandas as pd
= pd.read_csv('gapminder.csv')
dat
dat.head()
dat.columns'year']
dat[
dat.year0:5]
dat[
'year', 'country'])
dat.sort_values([
0:5, ['year', 'country']] # R-style indexing
dat.loc[
== 1952]
dat[dat.year
= dat[['pop','lifeExp','gdpPercap']]
ndat apply(lambda col: col.max() - col.min()) ndat.
pop 1.318623e+09
lifeExp 5.900400e+01
gdpPercap 1.132820e+05
dtype: float64
Now let’s see the sort of split-apply-combine functionality that is popular in dplyr and related R packages.
= dat[dat.year == 2007].copy()
dat2007
'continent', as_index=False)[['lifeExp','gdpPercap']].mean()
dat2007.groupby(
def stdize(vals):
return((vals - vals.mean()) / vals.std())
'lifeExpZ'] = dat2007.groupby('continent')['lifeExp'].transform(stdize) dat2007[
Exercise
- Use
pd.merge()
to merge the continent means for life expectancy for 2007 back into the originaldat2007
dataFrame. - Explore the pandas documentation to see if there is a way to add the continent means as a column without an explicit merge (i.e., mimicing capability built into R’s dplyr mutate function)?
Matplotlib
Work through the material in the Matplotlib tutorial and then solve the following problems.
Exercise
- Make a scatterplot of lifeExp vs gdpPerCap for 2007; make sure you have nice axis labels and title.
- Consider whether plotting income on a logarithmic axis is a better way to display the data.
- Now make an array of plots (in one figure) where each subplot is a different year.
- Now modify the plot so that the color of the point indicates the continent and add a legend that explains this (hint: it’s probably easier to do this with plt.scatter than plt.plot). Hint: This snippet of code can help with relating continents to particular colors:
dict(zip(conts, colors))
. - Find an outlier whose lifeExp is unexpectedly high or low and add text in the plot that associates the point with the name of the country (e.g., I suspect Cuba will be unexpectedly high and an oil-producing country may be unexpectedly low).
Mini-project
Please open project.html
and work through the analysis of the Senatorial tweets with your group.
A Note on the Contents
This content was adapted by Chris Paciorek from material prepared by K. Jarrod Millman and is licensed under the CC BY-NC-SA 4.0 license.