Numerics, reproducibility, and automation

Overview

This module presents some information on numerical analysis (random number generation and floating point precision), packaging, reproducibility, and automation.

8. Numerical analysis and random number generation

Random number generation

Random numbers on a computer are actually (periodic) deterministic sequences that (hopefully) behave as if they were random.

Random number seed

The seed determines where in the cycle of the periodic sequence you are.

It’s critical for ensuring reproducibility when running code that uses random numbers

Code
import numpy as np
np.random.seed(1)
np.random.normal(size = 5)
array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862,  0.86540763])
Code
np.random.normal(size = 5)
array([-2.3015387 ,  1.74481176, -0.7612069 ,  0.3190391 , -0.24937038])
Code
np.random.seed(1)
np.random.normal(size = 10)
array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862,  0.86540763,
       -2.3015387 ,  1.74481176, -0.7612069 ,  0.3190391 , -0.24937038])

Random number generators

If you don’t do anything, numpy will use a generator called the Mersenne Twister (that’s the default in R too).

However the “default” RNG in numpy is a newer, better algorithm called PCG64.

Code
## Mersenne twister
np.random.seed(1)
np.random.normal(size = 5)
array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862,  0.86540763])
Code
## Mersenne twister, selected specifically
rng = np.random.Generator(np.random.MT19937(seed = 1))
rng.normal(size = 5)
## Not clear why numbers differ from above. Seed setup might differ.
array([ 2.04676909, -1.03653009,  0.72997546,  0.41584996, -0.1922969 ])
Code
## 'Default' PCG64
rng = np.random.Generator(np.random.PCG64(seed = 1))
rng.normal(size = 5)
array([ 0.34558419,  0.82161814,  0.33043708, -1.30315723,  0.90535587])
Code
rng = np.random.default_rng(seed = 1)
rng.normal(size = 5)
array([ 0.34558419,  0.82161814,  0.33043708, -1.30315723,  0.90535587])

Some cautions (detailed by Philip Stark and Kellie Ottoboni):

  • seeds with many zeros can be problematic.
  • Mersenne Twister not as good as PCG64.
  • Be particularly cautious to use a good RNG when doing permutation and taking random samples.

Floating point precision

Code
0.3 - 0.2 == 0.1
0.3
0.2
0.1 # Hmmm...
0.1

Double-precision accuracy

Numbers in most languages are stored by default as double precision floating point numbers.

  • 8 bytes = 64 bits = double precision
  • 4 bytes = 32 bits = single precision

(GPU libraries often use 4 bytes or even fewer.)

Code
def dg(x, form = '.20f'):
    print(format(x, form))

a = 0.3
b = 0.2
dg(a)
dg(b)
dg(a-b)
dg(0.1)
0.29999999999999998890
0.20000000000000001110
0.09999999999999997780
0.10000000000000000555

How many digits of accuracy do we have?

Same thing regardless of the size of the number:

Code
12345678.1234567812345678
12345678.123456782
Code
12345678123456781234.5678
1.234567812345678e+19
Code
dg(12345678123456781234.5678)
12345678123456780288.00000000000000000000

This occurs because of how the 64 bits in a double precision floating point number are allocated to give a large range of numbers in terms of magnitude and high precision in terms of digits of accuracy.

Comparisons

Let’s see what kinds of numbers we can safely compare for exact equality with ==.

Code
x = .5 - .2
x == 0.3
x = 1.5 - .2
x == 0.3

x = .5 - .5
x == 0
x = .5 - .2  -.3
x == 0
x = (.5 - .2) - (.51 - .21)
x == 0

np.isclose(x, 0)

Calculation errors

Let’s consider accuracy of subtraction.

Code
1.1234 - 1.0
0.12339999999999995
Code
123456781234.1234 - 123456781234.0
0.1233978271484375

This is called catastrophic cancellation.

Do you think calling it “catastrophic” is too extreme? If so, consider this.

Code
81.0 - 80.0
1.0
Code
12345678123456781.0 -12345678123456780.0
0.0

What’s the derivative of \(sin(x)\)? Let’s see what kind of accuracy we can get.

Code
def deriv(f, x, eps=1e-8):
   return (f(x+eps) - f(x)) / eps

np.cos(0.2)

deriv(np.sin, 0.2)

deriv(np.sin, 0.2, 1e-12)
deriv(np.sin, 0.2, 1e-14)
deriv(np.sin, 0.2, 1e-15)
deriv(np.sin, 0.2, 1e-16)
deriv(np.sin, 0.2, 1e-17)

Linear algebra errors

The errors seen in doing calculations, such as catastrophic cancellation, cascaded through derivative estimation. That also happens when doing linear algebra operations.

Here’s an example where mathematically all the eigenvalues are real-valued and positive, but not on the computer.

Code
import scipy as sp

xs = np.arange(100)
dists = np.abs(xs[:, np.newaxis] - xs)
# This is a p.d. matrix (mathematically).
corr_matrix = np.exp(-(dists/10)**2)
# But not numerically...
sp.linalg.eigvals(corr_matrix)[80:99]
array([ 1.43024256e-16+1.28433951e-16j,  1.43024256e-16-1.28433951e-16j,
        1.49210347e-16+4.87070558e-17j,  1.49210347e-16-4.87070558e-17j,
       -1.54661687e-16+1.25601177e-16j, -1.54661687e-16-1.25601177e-16j,
       -2.07808543e-16+3.73312254e-17j, -2.07808543e-16-3.73312254e-17j,
        9.83783628e-17+7.20174529e-17j,  9.83783628e-17-7.20174529e-17j,
        2.37821218e-18+1.00101181e-16j,  2.37821218e-18-1.00101181e-16j,
        5.29254936e-17+0.00000000e+00j, -1.62444574e-16+0.00000000e+00j,
       -2.74621910e-17+7.55223038e-17j, -2.74621910e-17-7.55223038e-17j,
       -9.08016006e-17+3.79199791e-17j, -9.08016006e-17-3.79199791e-17j,
       -3.25243092e-17+0.00000000e+00j])

Overflow and underflow

Having a finite number of bits to represent each number means there are minimum and maximum numbers that can be expressed.

Code
1.38e5000
1.38e-400
0.0
How much do we need to worry about overflow and underflow?

Q: Roughly how many observations would it take before we underflow in calculating a likelihood?

  • 10
  • 100
  • 1000
  • 100000
  • 100000000
Code
import numpy as np
import scipy as sp

n = 10

x = np.random.normal(size=n)
np.prod(sp.stats.norm.pdf(x))

loglik = np.sum(sp.stats.norm.logpdf(x))
loglik
np.exp(loglik)
2.0739518337814167e-08
Work with probabilities and densities on the log scale. Almost always.

Exercise: forking and pull requests for numerical issues with Newton’s method

Look at your partner’s code in terms of how it handles the stopping criterion or (for the 1-d case) the finite difference estimate of the gradient and Hessian.

Fork your partner’s repository. In the local version of that repository on your DataHub, make a change to try to improve how the stopping criterion or finite difference estimate is handled. Then make a pull request back into your partner’s repository.

Here are the detailed steps of how to do all that:

  1. Go to your partner’s GitHub page and fork the repository (https://github.com/<username>/newton-practice/fork). You can name it newton-practice-<partner_name> or something else.
  2. On DataHub, clone the forked repository: git clone https://github.com/<your_username>/newton-practice-<partner_name>.
  3. On DataHub, enter the directory of the cloned repository: cd newton-practice-<partner_name>.
  4. Make a branch, make changes, add, and commit.
  5. Give DataHub access to the new repository via Step 3 of the steps for accessing GitHub repository from DataHub.
  6. Push the commit to your GitHub: git push -u origin newton-practice-<partner_name>.
  7. Go to https://github.com/<your_username>/newton-practice-<partner_name> and open a pull request (“Compare and pull request”). The request will automatically be done in the GitHub account of your partner.

Then review the pull request your partner makes in your own repository. Iterate with your partner until you are happy to merge the pull request into your repository. Remember to then run git pull to get the changes in the local copy of the repository.

9. Packaging and Conda environments

Packages

What’s a package? In general it is software that is provided as a bundle that can be downloaded and made available on your computer. Some packages (e.g., R, Python themselves) are stand-alone packages. Others (such as R packages and Python packages) are add-on functionality that works with and extends the functionality of the stand-alone software.

What do we need to have a Python package?

A basic Python package

A Python package is a directory containing one or more modules and with a file named __init__.py that is called when a package is imported and serves to initialize the package.

Let’s create a basic package.

mkdir mypkg

cat << EOF > mypkg/__init__.py
## Make objects from mymod.py available as mypkg.foo
from mypkg.mymod import *

print("Welcome to my package.")
EOF

cat << EOF > mypkg/mymod.py
x = 7

def myfun(val):
    print("The arg is: ", str(val), ".", sep = '')
EOF

Note that if there were other modules, we could have imported from those as well.

Now we can use the objects from the module without having to know that it was in a particular module (because of how __init__.py was set up).

Code
import mypkg
mypkg.x
mypkg.myfun(7)

Note that one can set __all__ in an __init__.py to define what is imported, which makes clear what is publicly available and hides what is considered internal.

A minimal “real” package

Fernando has a this toy example package.

Let’s clone the repository and see if we can install it.

git clone https://github.com/fperez/mytoy
cd mytoy
pip install --user .
cd
ls .local/lib/python3.11/site-packages
Code
import mytoy

mytoy.toy(7)

Let’s look in the top directory of the repository. There we see various files related to building and installing the package, namely, pyproj.toml, setup.py, setup.cfg, etc.

In fact, one can install the package with only either setup.py or pyproj.toml, but the other files listed here are recommended:

  • pyproj.toml (or pyproject.toml): this is a configuration file used by packaging tools. In the mytoy example it specifies to use setuptools to build and install the package.
  • setup.py: this is run when the package is built and installed when using setuptools. In the example, it simply runs setuptools.setup(). With recent versions of setuptools, you don’t actually need this so long as you have the pyproj.toml file.
  • setup.cfg: provides metadata about the package when using setuptools.
  • environment.yml: provides information about the full environment in which your package should be used (including examples, documentation, etc.). For projects using setuptools, a minimal list of dependencies needed for installation and use of the package can instead be included in the install_requires option of setup.cfg.
  • LICENSE: specifies the license for your package giving the terms under which others can use it.

The postBuild file is a completely optional file only needed if you want to use the package with a MyBinder environment.

At the numpy GitHub repository, by looking in pyproject.toml, you can see that numpy is build and installed using a system called Meson, while at the Jupyter GitHub repository you can see that the jupyter package is built and installed using setuptools.

Building a package usually refers to compiling source code but for a Python package that just has Python code, nothing needs to be compiled. Installing a package means putting the built package into a location on your computer where packages are installed.

It would take more work to make the package available in a repository such as PyPI or via Conda.

Exercise: Make your own package

Make a package out of your Newton method code, following the structure of mytoy. Include your tests as part of the package.

See if you can install your package, run the tests, and use the package, following the README.md of mytoy.

Installing packages

If a package is on PyPI or available through Conda but not on your system, you can install it easily (usually). You don’t need root permission on a machine to install a package, though you may need to use pip install --user or set up a new Conda environment.

Packages often depend on other packages. In general, if one package depends on another, pip or conda will generally install the dependency automatically.

One advantage of Conda is that it can also install non-Python packages on which a Python package depends, whereas with pip you sometimes need to install a system package to satisfy a dependency.

Faster installation with Mamba

It’s not uncommon to run into a case where conda has trouble installing a package because of version inconsistencies amongst the dependencies. mamba is a drop-in replacement for conda and often does a better job of this “dependency resolution”. We use mamba by default on the SCF.

With newer versions of Conda, you can use the libmamba dependency “resolver” by running conda config --set solver libmamba, which adds solver: libmamba to your .condarc file.

Reproducibility and package management

For reproducibility, it’s important to know the versions of the packages you use (and the version of Python). pip and conda make it easy to do this. You can create a requirements file that captures the packages you are currently using (and, critically, their versions) and then install exactly that set of packages (and versions) based on that requirements file.

## Reproducing using pip.
pip freeze > requirements.txt
pip install -r requirements.txt

## Reproducing a Conda environment.
conda env export > environment.yml
conda env create -f environment.yml

Conda is a general package manager. You can use it to manage Python packages but lots of other software as well, including R and Julia.

Fully isolating your Conda environment

Conda environments provide an additional layer of modularity/reproducibility, allowing you to set up a fully reproducible environment for your computation. Here (by explicitly giving python=3.11) the Python 3.11 executable and all packages you install in the environment are fully independent of whatever Python executables are installed on the system. (That said, there are some caveats that can make it a bit of headache to ensure full isolation.)

type python
conda create -n my_iso_env python=3.11
source activate my_iso_env
type python
conda install numpy
Activating an environment

If you use conda activate rather than source activate, Conda will prompt you to run conda init, which will make changes to your ~/.bashrc that, for one, activate the Conda base environment automatically when a shell is started. This may be fine, but it’s helpful to be aware.

Package locations

Packages in Python (and in R, Julia, etc.) may be installed in various places on the filesystem, and it sometimes it is helpful (e.g., if you end up with multiple versions of a package installed on your system) to be able to figure out where on the filesystem the package is being loaded from.

  • pkgname.__file__ will show where the imported package is installed.
  • pkname.__version__ will show the version of the package (as will pip list or conda list, for all packages).
  • sys.path shows where Python looks for packages on your system.
    • packages installed via pip will generally be in ~/.local/lib/python3.X/site-packages.

Source vs. binary packages

The difference between a source package and a binary package is that

  • a source package has the raw Python (and C/C++ and Fortran, in some cases) code as text files
  • a binary package has all the non-Python code in a binary/non-text format, with the C/C++ and Fortran code already having been compiled.

If you install a package from source, C/C++/Fortran code will be compiled on your system (if the package has such code). That should mean the compiled code will work on your system, but requires you to have a compiler available and things properly configured. A binary package doesn’t need to be compiled on your system, but in some cases (e.g., if you got the wrong binary version) the code may not run on your system because it was compiled in such a way that is not compatible with your system.

Python wheels are a binary package format for Python packages. Wheels for some packages will vary by platform (i.e., operating system) so that the package will install correctly on the system where it is being installed.

10. GitHub Actions and Continuous Integration

Having reproducible environments (e.g., Conda environments or Docker containers) provides the ability to run (and automate) various actions to run on remote/cloud computational resources and know that they (should) work.

Docker containers

Docker containers share some similarities with Conda environments except that you provide a full specification for the Linux operating system you want to use and software you want installed.

You need to provide the recipe for the environment or container. The automation system can then set up the environment as a virtual environment/container on the remote machine and run the action/code you request. This could simply be giving you a terminal or starting a Jupyter notebook in the environment and giving you access via your browser.

In this section, we’ll show some examples of automation with reproducible environments.

There are a lot more out there. In fact, DataHub is based on Docker containers (and a container management service called Kubernetes). Let’s see what operating system is being run in the container that your DataHub server is using.

$ cat /etc/issue
Ubuntu 22.04.4 LTS \n \l

This is all quite powerful, and the use of reproducible environments generally helps with debugging because you can also set up the environment on your own machine. That said, running code remotely or in the cloud can also be hard to debug at times.

Example 1: Binder

One example is using Binder to open a Jupyter notebook in an executable environment.

Based on a configuration (various options are possible for how to specify the configuration), Binder will create a Docker container with the needed software installed.

The most common configuration format is to provide a Conda environment file. Binder will then create a Conda environment inside the Docker (Linux) container. The Conda environment will contain the packages specified in the environment file.

Let’s set up a Binder environment “Geospatial analytics in Python with Geopandas” that can access the notebooks and materials provided in the repository in a computational environment designed to use the notebooks reliably and reproducibly.

In some cases an image for the container will already be available, and all that needs to happen is to start a virtual machine using that image as the computational environment. In other cases (e.g., with a configuration you’ve just created), the Docker container and Conda environment need to be set up.

If we’re familiar with what happens when Docker containers and Conda environments are set up, we can see in the Binder logs that a Docker container is set up (using Ubuntu Linux) and one of the steps of doing that is to create a Conda environment within the container.

Let’s see the steps that are run if we use the mytoy repository to start a Binder environment at ovh.mybinder.org, specifying fperez/mytoy as the repository. (Note that in a number of cases, the startup has failed, probably because mybinder has very limited resources for providing cloud compute resources. We are using ovh.mybinder.org rather than mybinder.org in hopes that is more likely to succeed.

Example 2: GitHub Actions for automated testing

A second example is using GitHub Actions (GHA) to automate activities such as testing. To set up a GitHub Actions workflow, one

  • specifies when the workflow will run (e.g., when a push or pull request is made, or only manually)
  • provides instructions for how to set up the environment for the workflow
  • provides the operations that the workflow should run.

A good example is running tests whenever you make a commit to a repository containing software.

With GHA, you specify the operating system and then run the steps you specify in .github/workflows/some_action.yml. Some steps will customize the environment as the initial steps and then additional step(s) will run shell or other code to run your workflow. You use pre-specified operations (called actions) to do common things (such as checking out a GitHub repository and installing commonly used software).

When triggered, GitHub will run the steps in a virtual machine, which is called the runner.

Joint demo/exercise

We’ll walk through the steps of setting up automated testing for the mytoy example package. Why do this? It’s common to have testing pass locally on your own machine, but fail for various reasons when done elsewhere.

The workflow is specified using a YAML file placed in the .github/workflows directory of the repository.

Here’s an example YAML file for testing the mytoy package:

on:
  push:
    branches:
    - main

jobs:
  CI:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    # Install dependencies
    - name: Set up Python 
      uses: actions/setup-python@v5
      with:
        python-version: 3.12

    - name: Install mytoy
      run: |
        # There is a small bug here.
        cd mytoy
        pip install --user .

    - name: Run tests
      run: |
        # Another small bug here.
        cd mytoy
        pytest

We’ll first run this with the current tests for mytoy, which should pass when run via GitHub Actions. We can take a look at the output.

Next, let’s introduce a failing test to see what the output looks like and get a sense for how to debug it.

Example 3: GitHub Actions for publishing webpages

We can use GitHub Actions with GitHub pages to easily publish websites, where we make changes to the repository and the workflow causes the website to be updated.

Here’s the GitHub Actions workflow (a YAML file) that publishes the website for the spring 2023 version of Statistics 159/259. Any commits made in the main branch cause the website to be updated.

Configuring GitHub Pages

To set up GitHub Pages for a repository, go to https://github.com/github_org/repository/settings/pages and select Source -> Deploy from a branch and then choose gh-pages as the branch (first creating the gh-pages branch if necessary). The website should appear at github_org.github.io/repository.

Quarto-based websites

In fact the website for this workshop was done using Quarto, which uses GitHub Pages and GitHub Actions behind the scenes.

Chris runs quarto publish gh-pages to update the site. Quarto will (on the local machine) render the html (convert the Markdown to html) and then behind the scenes Quarto uses GitHub Actions and GitHub Pages (via the `gh-pages branch) to publish the website.

Feedback survey

This is the first time we’ve done this, so it’s particularly valuable now to get feedback on it. Please fill out this brief survey.