Tips for avoiding bugs

Plan out your code in advance, including all special cases/possibilities.
When doing software development, write tests for your code early in the process.
Build up code in pieces, testing along the way. Make big changes in small steps, sequentially checking to see if the code has broken on test case(s).
Code in a modular fashion, making good use of functions, so that you don’t need to debug the same code multiple times. Smaller functions are easier to debug, easier to understand, and can be combined in a modular fashion (like the UNIX utilities)
Use core R functionality and algorithms already coded. Figure out if a functionality already exists in (or can be adapted from) an R package (or potentially in a C/Fortran library/package): code that is part of standard mathematical/numerical packages will probably be more efficient and bug-free than anything you would write.
Remove objects you don’t need, to avoid accidentally using values from an old object via the scoping rules.
Be careful that the conditions of if statements and the sequences of for loops are robust when they involve evaluating R code.
Don’t hard code numbers - use variables (e.g., number of iterations, parameter values in simulations), even if you don’t expect to change the value, as this makes the code more readable and reduces bugs when you use the same number multiple times:
Check that inputs to and outputs from functions (either functions you call or functions you write) are valid and use warning() and stop() to give a warning or stop execution when something unexpected happens.
Use try() with functions that may fail in cases where you don’t want overall execution to fail because a single piece of the execution fails.
Write code for clarity and accuracy first; then worry about efficiency. Write an initial version of the code in the simplest way, without trying to be efficient (e.g., you might use loops even if you’re coding in R); then make a second version that employs efficiency tricks and check that both produce the same output.

Common syntax errors and bugs

Parenthesis mis-matches
[...] vs. [[...]]
== vs. =
Comparing real numbers exactly using == is dangerous. Suppose you generate x = 0.333333 in some fashion with some code and then check: x == 1/3. This will produce FALSE.
Vectors vs. single values:
- || vs. | and && vs &
- You expect a single value but your code gives you a vector
- You want to compare an entire vector but your code just compares the first value (e.g., in an if statement) – consider using identical() or all.equal()
Silent type conversion when you don’t want it, or lack of coercion where you’re expecting it
Using the wrong function or variable name
Giving unnamed arguments to a function in the wrong order
Forgetting to define a variable in the environment of a function and having the function, via R’s scoping rules, get that variable as a global variable from one of the enclosing environments. At best the types are not compatible and you get an error; at worst, you use a garbage value and the bug is hard to trace. In some cases your code may work fine when you develop the code (if the variable exists in the enclosing environment), but then may not work when you restart R if the variable no longer exists or is different.

library(codetools)
f <- function() {y <- 3; print(x + y)}
findGlobals(f)

## [1] "{"     "+"     "<-"    "print" "x"

R (usually helpfully) drops matrix and array dimensions that are extraneous; which can sometimes confuse later code that expects an object of a certain dimension. The [ operator takes an additional optional argument that can avoid dropping dimensions.

mat <- matrix(1:4, 2, 2)[1, ]
dim(mat); print(mat)

## NULL

## [1] 1 3

colSums(mat)

## Error in base::colSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be an array of at least two dimensions

mat <- matrix(1:4, 2, 2)[1, , drop = FALSE]
colSums(mat)

## [1] 1 3

Debugging

As a scripting language, R essentially has a debugger working automatically by virtue of you often being able to easily run code line by line.

But there is an official debugger and other tools that greatly help in figuring out problems, particularly for more complicated situations.

Let’s briefly see these in action. I’ll demo this in a very basic way, but hopefully this will give you an idea of the power of these tools.

buggyFun <- function(myDF) {
   print(names(myDF))
   myDF$id <- seq_len(nrow(myDF))
   sums <- rowSums(myDF)
   return(sums)
}

buggyFun(gapminder)

## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

## Error in base::rowSums(x, na.rm = na.rm, dims = dims, ...): 'x' must be numeric

if(FALSE) {
  traceback()
  debug(buggyFun)
  buggyFun(gapminder)

  undebug(buggyFun)
  options(error = recover)
  buggyFun(gapminder)
}

We can use debug() to step through a function line by line
After an error occurs, we can use traceback() to look at the call stack
More helpfully, if we set options(error = recover) before running code, we can go into the function in which the error occurred
We can insert browser() inside a function and R will stop there and allow us to proceed with debugging statements
You can temporarily insert code into a function (including built-in functions) with trace(fxnName, edit = TRUE)

Testing

Testing should be performed on multiple levels and begun as early as possible in the development process. For programs that accept input either from a user or file, it is important that the code validates the input is what it expects to receive. Tests that ensure individual code elements (e.g., functions, classes, and class methods) behave correctly are called unit tests. Writing unit tests early in the process of implementing new functionality helps you think about what you want a piece of code to do, rather than just how it does it. This practice improves code quality by focusing your attention on use cases rather than getting lost in implementation details.

The testthat package is very helpful for setting up tests. Also, RUnit is a testing framework for R that helps automate test setup, creation, execution, and reporting. For more information, see Bioconductor’s unit testing guidelines.

Timing your code

First, a cautionary note…

premature optimization is the root of all evil

— Donald Knuth, 1974

There are a few tools in R for timing your code.

system.time(mean(rnorm(1e7)))

##    user  system elapsed 
##   0.686   0.024   0.711

library(rbenchmark)
x <- rnorm(1e7)
benchmark(ifelse(x < 0, x, 0),
                   x[x < 0] <- 0, replications = 5,
                   columns = c('replications', 'elapsed'))

##   replications elapsed
## 1            5   1.483
## 2            5   0.175

Profiling your code

For more advanced assessment of bottlenecks in your code, consider Rprof(). Actually, the output from Rprof can be hard to decipher, so you may want to use the proftools package functions, which make use of Rprof under the hood.

Here’s a function that does the linear algebra to implement a linear regression, assuming X is the matrix of predictors, including a column for the intercept:

$\hat{\beta} = (X^{\top}X )^{-1}X^{\top}y$

lr_slow <- function(y, X) {
  XtX <- t(X) %*% X
  Xty <- t(X) %*% y
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_medium <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  inv <- solve(XtX)   ## explicit matrix inverse is slow and generally a bad idea numerically
  return(inv %*% Xty)
}                   

lr_fast <- function(y, X) {
  XtX <- crossprod(X)
  Xty <- crossprod(X, y)
  U <- chol(XtX)
  tmp <- backsolve(U, Xty, transpose = TRUE)
  return(backsolve(U, tmp))
}

Now let’s try these two functions with profiling turned on.

## generate random observations and random matrix of predictors
y <- rnorm(5000)
x <- matrix(rnorm(5000*1000), nrow = 5000)

library(proftools)

pd1 <- profileExpr(lr_slow(y, x))
hotPaths(pd1)

##  path                  total.pct self.pct
##  lazyLoadDBfetch       20.83      0.00   
##  . <Anonymous>         20.83      0.00   
##  . . lazyLoadDBfetch   20.83     20.83   
##  lr_slow               79.17      0.00   
##  . %*% (<text>:2)      17.71     17.71   
##  . %*% (<text>:3)       3.12      3.12   
##  . solve (<text>:4)    37.50      0.00   
##  . . standardGeneric   37.50      0.00   
##  . . . solve           37.50      0.00   
##  . . . . solve.default 37.50     37.50   
##  . t (<text>:2)         2.08      0.00   
##  . . standardGeneric    2.08      0.00   
##  . . . t                2.08      0.00   
##  . . . . t.default      2.08      2.08   
##  . t (<text>:3)        18.75      0.00   
##  . . standardGeneric   18.75      0.00   
##  . . . t               18.75      0.00   
##  . . . . t.default     18.75     18.75

hotPaths(pd1, value = 'time')

##  path                  total.time self.time
##  lazyLoadDBfetch       0.40       0.00     
##  . <Anonymous>         0.40       0.00     
##  . . lazyLoadDBfetch   0.40       0.40     
##  lr_slow               1.52       0.00     
##  . %*% (<text>:2)      0.34       0.34     
##  . %*% (<text>:3)      0.06       0.06     
##  . solve (<text>:4)    0.72       0.00     
##  . . standardGeneric   0.72       0.00     
##  . . . solve           0.72       0.00     
##  . . . . solve.default 0.72       0.72     
##  . t (<text>:2)        0.04       0.00     
##  . . standardGeneric   0.04       0.00     
##  . . . t               0.04       0.00     
##  . . . . t.default     0.04       0.04     
##  . t (<text>:3)        0.36       0.00     
##  . . standardGeneric   0.36       0.00     
##  . . . t               0.36       0.00     
##  . . . . t.default     0.36       0.36

pd2 <- profileExpr(lr_medium(y, x))
hotPaths(pd2)

##  path                    total.pct self.pct
##  lazyLoadDBfetch          2.56      2.56   
##  lr_medium               97.44      0.00   
##  . %*% (<text>:12)        2.56      2.56   
##  . crossprod (<text>:10)  5.13      0.00   
##  . . crossprod            5.13      0.00   
##  . . . base::crossprod    5.13      5.13   
##  . crossprod (<text>:9)  51.28      0.00   
##  . . crossprod           51.28      0.00   
##  . . . base::crossprod   51.28     51.28   
##  . solve (<text>:11)     38.46      0.00   
##  . . standardGeneric     38.46      0.00   
##  . . . solve             38.46      0.00   
##  . . . . solve.default   38.46     35.90   
##  . . . . . diag           2.56      2.56

hotPaths(pd2, value = 'time')

##  path                    total.time self.time
##  lazyLoadDBfetch         0.02       0.02     
##  lr_medium               0.76       0.00     
##  . %*% (<text>:12)       0.02       0.02     
##  . crossprod (<text>:10) 0.04       0.00     
##  . . crossprod           0.04       0.00     
##  . . . base::crossprod   0.04       0.04     
##  . crossprod (<text>:9)  0.40       0.00     
##  . . crossprod           0.40       0.00     
##  . . . base::crossprod   0.40       0.40     
##  . solve (<text>:11)     0.30       0.00     
##  . . standardGeneric     0.30       0.00     
##  . . . solve             0.30       0.00     
##  . . . . solve.default   0.30       0.28     
##  . . . . . diag          0.02       0.02

pd3 <- profileExpr(lr_fast(y, x))
hotPaths(pd3)

##  path                    total.pct self.pct
##  lr_fast                 100.00     0.00   
##  . chol (<text>:18)       15.79     0.00   
##  . . standardGeneric      15.79     0.00   
##  . . . chol               15.79     0.00   
##  . . . . chol.default     15.79    15.79   
##  . crossprod (<text>:16)  73.68     0.00   
##  . . crossprod            73.68     0.00   
##  . . . base::crossprod    73.68    73.68   
##  . crossprod (<text>:17)  10.53     0.00   
##  . . crossprod            10.53     0.00   
##  . . . base::crossprod    10.53    10.53

hotPaths(pd3, value = 'time')

##  path                    total.time self.time
##  lr_fast                 0.38       0.00     
##  . chol (<text>:18)      0.06       0.00     
##  . . standardGeneric     0.06       0.00     
##  . . . chol              0.06       0.00     
##  . . . . chol.default    0.06       0.06     
##  . crossprod (<text>:16) 0.28       0.00     
##  . . crossprod           0.28       0.00     
##  . . . base::crossprod   0.28       0.28     
##  . crossprod (<text>:17) 0.04       0.00     
##  . . crossprod           0.04       0.00     
##  . . . base::crossprod   0.04       0.04

You might also check out profvis for an alternative to displaying profiling information generated by Rprof.

Memory use

You should know how much memory (RAM) the computer you are using has and keep in mind how big your objects are and how much memory you code might use. All objects in R are stored in RAM unlike, e.g., SAS or a database.

If in total, the jobs on a machine approach the physical RAM, the machine may (depending on how it is set up) start to use the hard disk as ‘virtual memory’. This is called paging or swapping, and once this happens you’re often toast (i.e., your code may take essentially forever to finish). And if paging doesn’t happen, your job will die with an out-of-memory (OOM) error.

You can assess memory use with top or ps in Linux/Mac or the Task Manager in Windows.

Often it’s a good idea to roughly estimate how much memory an object will take up even before creating it in R. You can do this with some simple arithmetic. Every real number takes 8 bytes (integers and logicals take less; character strings are complicated), so an object with, say, 1 million rows and 10 columns, all numbers, would take roughly 8 * 1000000 * 10 bytes or 800 Mb.

x <- rnorm(1e7)
object.size(x)

## 80000048 bytes

1e7*8/1e6  # direct calculation of Mb

## [1] 80

print(object.size(x), units = 'auto')

## 76.3 Mb

x <- rnorm(1e8)
gc()

##             used  (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells   2083510 111.3    4077525  217.8   3162653 169.0
## Vcells 113614165 866.9  159622940 1217.9 123668420 943.6

rm(x)
gc()

##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  2083522 111.3    4077525 217.8   3162653 169.0
## Vcells 13614197 103.9  127698352 974.3 123668420 943.6

Memory use: quick quiz

POLL 9A: Suppose you have a CSV file with one million rows and one thousand columns, filled with numbers. How much memory will this take up if you read the data into R?

(respond at https://pollev.com/chrispaciorek428)

800 MB
8 GB
1 GB
80 GB
something else

Scripting

Keep your code in script (i.e., text) files.
Keep your files modular and focused.
Write functions to reuse code.
Learn a powerful, general purpose text editor

If you use a good editor that understands the syntax of the language you are using, it’s easier to write and understand your code. Some editors to consider:

RStudio’s built-in editor
Mac: Aquamacs, TextMate, TextEdit
Windows: WinEdt
Linux: vim, Emacs with ESS, Atom

To run all the code in an entire file, do source('myCodeFile.R').

RStudio projects help to organize your work. In many cases you may also want to create an R package to share your code, even if you don’t plan to make it publicly available. The devtools and usethis packages can help you create and manage projects and packages.

Batch jobs / background execution

To run code as a background job in a Linux/Mac context:

R CMD BATCH --no-save myCodeFile.R myOutput.Rout &

Then you don’t need to leave RStudio or R or a terminal window open. Your job will run by itself. If you are logged into a remote machine, you can log out and come back later.

IMPORTANT: make sure to write any needed output to a file (e.g. .Rda files, CSV files, text output from print() or cat()).

Good coding practices: functions

Use functions whenever possible. In particular try to write functions rather than carry out your work using blocks of code. Why? Functions allow us to reuse blocks of code easily for later use and for recreating an analysis (reproducible research). It’s more transparent than sourcing a file of code because the inputs and outputs are specified formally, so you don’t have to read through the code to figure out what it does.

Good use of functions includes:

Write reusable code for core functionality and keep a single copy of the code (w/ backups of course) so you only need to change it in one place
Smaller functions are easier to debug, easier to understand, and can be combined in a modular fashion (like the UNIX utilities)

Functions should:

be modular (having a single task);
have meaningful name; and
have a comment describing their purpose, inputs and outputs (see the help file for any standard R function for how this is done in that context).

Good coding practices: syntax

For an example that demonstrates some of these ideas, see goodCode.R. And for a counterexample, see badCode.R.

Header information: put metainfo on the code into the first few lines of the file as comments. Include who, when, what, how the code fits within a larger program (if appropriate), possibly the versions of R and key packages that you wrote this for
Indentation: do this systematically (your editor can help here). This helps you and others to read and understand the code and can help in detecting errors in your code because it can expose lack of symmetry.
Whitespace: use a lot of it. Some places where it is good to have it are (1) around operators (assignment and arithmetic), (2) between function arguments and list elements, (3) between matrix/array indices, in particular for missing indices.
Use blank lines to separate blocks of code and comments to say what the block does
Split long lines at meaningful places.
Use parentheses for clarity even if not needed for order of operations. For example, a/y*x will work but is not easy to read and you can easily induce a bug if you forget the order of operations.
Documentation - add lots of comments (but don’t belabor the obvious). Remember that in a few months, you may not follow your own code any better than a stranger. Some key things to document:
- summarizing a block of code,
- explaining a very complicated piece of code
- explaining arbitrary constant values.
break code into separate files (<2000-3000 lines per file) with meaningful file names and related functions grouped within a file.
Choose a consistent naming style for objects and functions: e.g. numIts (lowerCamelCase) vs. NumIts (UpperCamelCase) vs. num.its vs. num_its
- I’d suggest avoiding periods in names since periods are used for object-oriented programming in R and many other languages
Try to have the names be informative without being overly long.
Try to avoid using the names of standard R functions for your objects, but R will generally be fairly smart about things.

c <- 7
c(3,5)

## [1] 3 5

## [1] 7

rm(c)
c

## function (...)  .Primitive("c")

Use active names for functions (e.g., calc_loglik, calcLogLik)

R bootcamp, Module 9: Workflows, managing projects and good practices