R bootcamp, Module 9: Workflows, managing projects and good practices

August 2022, UC Berkeley

Chris Paciorek

Tips for avoiding bugs

Common syntax errors and bugs

library(codetools)
f <- function() {y <- 3; print(x + y)}
findGlobals(f)
## [1] "{"     "+"     "<-"    "print" "x"
mat <- matrix(1:4, 2, 2)[1, ]
dim(mat); print(mat)
## NULL
## [1] 1 3
colSums(mat)
## Error in base::colSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be an array of at least two dimensions
mat <- matrix(1:4, 2, 2)[1, , drop = FALSE]
colSums(mat)
## [1] 1 3

Debugging

As a scripting language, R essentially has a debugger working automatically by virtue of you often being able to easily run code line by line.

But there is an official debugger and other tools that greatly help in figuring out problems, particularly for more complicated situations.

Let’s briefly see these in action. I’ll demo this in a very basic way, but hopefully this will give you an idea of the power of these tools.

buggyFun <- function(myDF) {
   print(names(myDF))
   myDF$id <- seq_len(nrow(myDF))
   sums <- rowSums(myDF)
   return(sums)
}

buggyFun(gapminder)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
## Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be numeric
if(FALSE) {
  traceback()
  debug(buggyFun)
  buggyFun(gapminder)

  undebug(buggyFun)
  options(error = recover)
  buggyFun(gapminder)
}
  1. We can use debug() to step through a function line by line

  2. After an error occurs, we can use traceback() to look at the call stack

  3. More helpfully, if we set options(error = recover) before running code, we can go into the function in which the error occurred

  4. We can insert browser() inside a function and R will stop there and allow us to proceed with debugging statements

  5. You can temporarily insert code into a function (including built-in functions) with trace(fxnName, edit = TRUE)

Testing

Testing should be performed on multiple levels and begun as early as possible in the development process. For programs that accept input either from a user or file, it is important that the code validates the input is what it expects to receive. Tests that ensure individual code elements (e.g., functions, classes, and class methods) behave correctly are called unit tests. Writing unit tests early in the process of implementing new functionality helps you think about what you want a piece of code to do, rather than just how it does it. This practice improves code quality by focusing your attention on use cases rather than getting lost in implementation details.

The testthat package is very helpful for setting up tests. Also, RUnit is a testing framework for R that helps automate test setup, creation, execution, and reporting. For more information, see Bioconductor’s unit testing guidelines.

Timing your code

First, a cautionary note…

premature optimization is the root of all evil

— Donald Knuth, 1974

There are a few tools in R for timing your code.

system.time(mean(rnorm(1e7)))
##    user  system elapsed 
##   0.686   0.024   0.711
library(rbenchmark)
x <- rnorm(1e7)
benchmark(ifelse(x < 0, x, 0),
                   x[x < 0] <- 0, replications = 5,
                   columns = c('replications', 'elapsed'))
##   replications elapsed
## 1            5   1.483
## 2            5   0.175

Profiling your code

For more advanced assessment of bottlenecks in your code, consider Rprof(). Actually, the output from Rprof can be hard to decipher, so you may want to use the proftools package functions, which make use of Rprof under the hood.

Here’s a function that does the linear algebra to implement a linear regression, assuming X is the matrix of predictors, including a column for the intercept:


\hat{\beta} = (X^{\top}X )^{-1}X^{\top}y

lr_slow <- function(y, X) {
  XtX <- t(X) %*% X
  Xty <- t(X) %*% y
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_medium <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_fast <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  U <- chol(XtX)
  tmp <- backsolve(U, Xty, transpose = TRUE)
  return(backsolve(U, tmp))
}                   

Now let’s try these two functions with profiling turned on.

## generate random observations and random matrix of predictors
y <- rnorm(5000)
x <- matrix(rnorm(5000*1000), nrow = 5000)

library(proftools)

pd1 <- profileExpr(lr_slow(y, x))
hotPaths(pd1)
##  path                  total.pct self.pct
##  lazyLoadDBfetch       20.83      0.00   
##  . <Anonymous>         20.83      0.00   
##  . . lazyLoadDBfetch   20.83     20.83   
##  lr_slow               79.17      0.00   
##  . %*% (<text>:2)      17.71     17.71   
##  . %*% (<text>:3)       3.12      3.12   
##  . solve (<text>:4)    37.50      0.00   
##  . . standardGeneric   37.50      0.00   
##  . . . solve           37.50      0.00   
##  . . . . solve.default 37.50     37.50   
##  . t (<text>:2)         2.08      0.00   
##  . . standardGeneric    2.08      0.00   
##  . . . t                2.08      0.00   
##  . . . . t.default      2.08      2.08   
##  . t (<text>:3)        18.75      0.00   
##  . . standardGeneric   18.75      0.00   
##  . . . t               18.75      0.00   
##  . . . . t.default     18.75     18.75
hotPaths(pd1, value = 'time')
##  path                  total.time self.time
##  lazyLoadDBfetch       0.40       0.00     
##  . <Anonymous>         0.40       0.00     
##  . . lazyLoadDBfetch   0.40       0.40     
##  lr_slow               1.52       0.00     
##  . %*% (<text>:2)      0.34       0.34     
##  . %*% (<text>:3)      0.06       0.06     
##  . solve (<text>:4)    0.72       0.00     
##  . . standardGeneric   0.72       0.00     
##  . . . solve           0.72       0.00     
##  . . . . solve.default 0.72       0.72     
##  . t (<text>:2)        0.04       0.00     
##  . . standardGeneric   0.04       0.00     
##  . . . t               0.04       0.00     
##  . . . . t.default     0.04       0.04     
##  . t (<text>:3)        0.36       0.00     
##  . . standardGeneric   0.36       0.00     
##  . . . t               0.36       0.00     
##  . . . . t.default     0.36       0.36
pd2 <- profileExpr(lr_medium(y, x))
hotPaths(pd2)
##  path                    total.pct self.pct
##  lazyLoadDBfetch          2.56      2.56   
##  lr_medium               97.44      0.00   
##  . %*% (<text>:12)        2.56      2.56   
##  . crossprod (<text>:10)  5.13      0.00   
##  . . crossprod            5.13      0.00   
##  . . . base::crossprod    5.13      5.13   
##  . crossprod (<text>:9)  51.28      0.00   
##  . . crossprod           51.28      0.00   
##  . . . base::crossprod   51.28     51.28   
##  . solve (<text>:11)     38.46      0.00   
##  . . standardGeneric     38.46      0.00   
##  . . . solve             38.46      0.00   
##  . . . . solve.default   38.46     35.90   
##  . . . . . diag           2.56      2.56
hotPaths(pd2, value = 'time')
##  path                    total.time self.time
##  lazyLoadDBfetch         0.02       0.02     
##  lr_medium               0.76       0.00     
##  . %*% (<text>:12)       0.02       0.02     
##  . crossprod (<text>:10) 0.04       0.00     
##  . . crossprod           0.04       0.00     
##  . . . base::crossprod   0.04       0.04     
##  . crossprod (<text>:9)  0.40       0.00     
##  . . crossprod           0.40       0.00     
##  . . . base::crossprod   0.40       0.40     
##  . solve (<text>:11)     0.30       0.00     
##  . . standardGeneric     0.30       0.00     
##  . . . solve             0.30       0.00     
##  . . . . solve.default   0.30       0.28     
##  . . . . . diag          0.02       0.02
pd3 <- profileExpr(lr_fast(y, x))
hotPaths(pd3)
##  path                    total.pct self.pct
##  lr_fast                 100.00     0.00   
##  . chol (<text>:18)       15.79     0.00   
##  . . standardGeneric      15.79     0.00   
##  . . . chol               15.79     0.00   
##  . . . . chol.default     15.79    15.79   
##  . crossprod (<text>:16)  73.68     0.00   
##  . . crossprod            73.68     0.00   
##  . . . base::crossprod    73.68    73.68   
##  . crossprod (<text>:17)  10.53     0.00   
##  . . crossprod            10.53     0.00   
##  . . . base::crossprod    10.53    10.53
hotPaths(pd3, value = 'time')
##  path                    total.time self.time
##  lr_fast                 0.38       0.00     
##  . chol (<text>:18)      0.06       0.00     
##  . . standardGeneric     0.06       0.00     
##  . . . chol              0.06       0.00     
##  . . . . chol.default    0.06       0.06     
##  . crossprod (<text>:16) 0.28       0.00     
##  . . crossprod           0.28       0.00     
##  . . . base::crossprod   0.28       0.28     
##  . crossprod (<text>:17) 0.04       0.00     
##  . . crossprod           0.04       0.00     
##  . . . base::crossprod   0.04       0.04

You might also check out profvis for an alternative to displaying profiling information generated by Rprof.

Memory use

You should know how much memory (RAM) the computer you are using has and keep in mind how big your objects are and how much memory you code might use. All objects in R are stored in RAM unlike, e.g., SAS or a database.

If in total, the jobs on a machine approach the physical RAM, the machine may (depending on how it is set up) start to use the hard disk as ‘virtual memory’. This is called paging or swapping, and once this happens you’re often toast (i.e., your code may take essentially forever to finish). And if paging doesn’t happen, your job will die with an out-of-memory (OOM) error.

You can assess memory use with top or ps in Linux/Mac or the Task Manager in Windows.

Often it’s a good idea to roughly estimate how much memory an object will take up even before creating it in R. You can do this with some simple arithmetic. Every real number takes 8 bytes (integers and logicals take less; character strings are complicated), so an object with, say, 1 million rows and 10 columns, all numbers, would take roughly 8 * 1000000 * 10 bytes or 800 Mb.

x <- rnorm(1e7)
object.size(x)
## 80000048 bytes
1e7*8/1e6  # direct calculation of Mb
## [1] 80
print(object.size(x), units = 'auto')
## 76.3 Mb
x <- rnorm(1e8)
gc()
##             used  (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells   2083510 111.3    4077525  217.8   3162653 169.0
## Vcells 113614165 866.9  159622940 1217.9 123668420 943.6
rm(x)
gc()
##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  2083522 111.3    4077525 217.8   3162653 169.0
## Vcells 13614197 103.9  127698352 974.3 123668420 943.6

Memory use: quick quiz

POLL 9A: Suppose you have a CSV file with one million rows and one thousand columns, filled with numbers. How much memory will this take up if you read the data into R?

(respond at https://pollev.com/chrispaciorek428)

  1. 800 MB
  2. 8 GB
  3. 1 GB
  4. 80 GB
  5. something else

Scripting

If you use a good editor that understands the syntax of the language you are using, it’s easier to write and understand your code. Some editors to consider:

To run all the code in an entire file, do source('myCodeFile.R').

RStudio projects help to organize your work. In many cases you may also want to create an R package to share your code, even if you don’t plan to make it publicly available. The devtools and usethis packages can help you create and manage projects and packages.

Batch jobs / background execution

To run code as a background job in a Linux/Mac context:

R CMD BATCH --no-save myCodeFile.R myOutput.Rout &

Then you don’t need to leave RStudio or R or a terminal window open. Your job will run by itself. If you are logged into a remote machine, you can log out and come back later.

IMPORTANT: make sure to write any needed output to a file (e.g. .Rda files, CSV files, text output from print() or cat()).

Good coding practices: functions

Use functions whenever possible. In particular try to write functions rather than carry out your work using blocks of code. Why? Functions allow us to reuse blocks of code easily for later use and for recreating an analysis (reproducible research). It’s more transparent than sourcing a file of code because the inputs and outputs are specified formally, so you don’t have to read through the code to figure out what it does.

Good use of functions includes:

Functions should:

Good coding practices: syntax

For an example that demonstrates some of these ideas, see goodCode.R. And for a counterexample, see badCode.R.

c <- 7
c(3,5)
## [1] 3 5
c
## [1] 7
rm(c)
c
## function (...)  .Primitive("c")

Reproducible research

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

— Jonathan Buckheit and David Donoho, WaveLab and Reproducible Research (1995)

Here are some useful articles talking about reproducibility.

Some ideas for improving reproducibility

Breakout

Please fill out the feedback survey. See the Ed Discussion board for the link.