R bootcamp, Module 10: Advanced topics

August 2022, UC Berkeley

Chris Paciorek

This purpose of this module

For some of the topics here, my goal is not to teach you how to fish, but merely to tell you that fish exist, they are delicious, that they can be caught, and where one might go to figure out how to catch them.

Object-oriented programming (OOP) in R

Confusingly, R has three (well, actually five) different systems for OOP. This can be confusing, but they get the job done for a lot of tasks.

To understand how base R works, it’s helpful to understand S3.

To use OOP in a fashion similar to Python and C++, I suggest using R6.

Basics of object-oriented programming (OOP)

The basic idea is that coding is structured around objects, which belong to a class, and methods that operate on objects in the class.

Objects are like lists, but with methods that are specifically associated with particular classes, as we’ve seen with the lm class.

Objects have fields, analogous to the components of a list. For S4 and reference classes, the fields of the class are fixed and all objects of the class must contain those fields.

Working with S3 classes and methods

S3 objects are generally built upon lists.

mod <- lm(gapminder$lifeExp ~ log(gapminder$gdpPercap))
class(mod)
## [1] "lm"
is.list(mod)
## [1] TRUE
names(mod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
mod$coefficients
##              (Intercept) log(gapminder$gdpPercap) 
##                -9.100889                 8.405085
mod[['coefficients']]
##              (Intercept) log(gapminder$gdpPercap) 
##                -9.100889                 8.405085
mod[[1]]
##              (Intercept) log(gapminder$gdpPercap) 
##                -9.100889                 8.405085

The magic of the S3 OOP approach here is that ‘generic’ methods (i.e., functions) can be tailored to work specifically with specific kinds of objects. This has been called “functional OOP” because the generic methods look like regular functions.

summary(gapminder$lifeExp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.60   48.20   60.71   59.47   70.85   82.60
summary(mod)
## 
## Call:
## lm(formula = gapminder$lifeExp ~ log(gapminder$gdpPercap))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -9.1009     1.2277  -7.413 1.93e-13 ***
## log(gapminder$gdpPercap)   8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

Question: What do you think R is doing behind the scenes?

Consider summary.lm.

More on working with S3 classes and methods

library(methods)
yb <- gapminder$lifeExp > 75
yc <- gapminder$lifeExp
x <- log(gapminder$gdpPercap)
mod1 <- lm(yc ~ x)
mod2 <- glm(yb ~ x, family = binomial)
mod2$residuals[1:20] # access field with list-like syntax
##         1         2         3         4         5         6         7         8 
## -1.000026 -1.000031 -1.000035 -1.000033 -1.000022 -1.000027 -1.000054 -1.000035 
##         9        10        11        12        13        14        15        16 
## -1.000015 -1.000014 -1.000021 -1.000053 -1.000258 -1.000477 -1.000832 -1.001461 
##        17        18        19        20 
## -1.002613 -1.003204 -1.003496 -1.003838
class(mod2)
## [1] "glm" "lm"
is(mod2, "lm")
## [1] TRUE
is.list(mod2)
## [1] TRUE
names(mod2)
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"
methods(class = "glm")
##  [1] add1           anova          coerce         confint        cooks.distance
##  [6] deviance       drop1          effects        extractAIC     family        
## [11] formula        influence      initialize     logLik         model.frame   
## [16] nobs           predict        print          profile        residuals     
## [21] rstandard      rstudent       show           slotsFromS3    summary       
## [26] vcov           weights       
## see '?methods' for accessing help and source code
methods(predict)
##  [1] predict.ar*                predict.Arima*            
##  [3] predict.arima0*            predict.bam               
##  [5] predict.bs*                predict.bSpline*          
##  [7] predict.fastTps            predict.gam               
##  [9] predict.glm                predict.glmmPQL*          
## [11] predict.gls*               predict.gnls*             
## [13] predict.HoltWinters*       predict.interp.surface    
## [15] predict.jam*               predict.Krig              
## [17] predict.lda*               predict.lm                
## [19] predict.lme*               predict.lmList*           
## [21] predict.lmList4*           predict.loess*            
## [23] predict.lqs*               predict.mca*              
## [25] predict.merMod*            predict.mKrig             
## [27] predict.mlm*               predict.nbSpline*         
## [29] predict.nlme*              predict.nls*              
## [31] predict.npolySpline*       predict.ns*               
## [33] predict.pbSpline*          predict.polr*             
## [35] predict.poly*              predict.polySpline*       
## [37] predict.ppolySpline*       predict.ppr*              
## [39] predict.prcomp*            predict.princomp*         
## [41] predict.qda*               predict.qsreg             
## [43] predict.rlm*               predict.smooth.spline*    
## [45] predict.smooth.spline.fit* predict.sreg              
## [47] predict.StructTS*          predict.surface           
## [49] predict.surface.default    predict.Tps               
## see '?methods' for accessing help and source code
predict
## function (object, ...) 
## UseMethod("predict")
## <bytecode: 0x55e1588c7140>
## <environment: namespace:stats>
# predict.glm

When predict() is called on a GLM object, it first calls the generic predict(), which then recognizes that the first argument is of the class glm and immediately calls the right class-specific method, predict.glm() in this case.

Making your own S3 class/object/method

Making an object and class-specific methods under S3 is simple.

rboot2022 <- list(month = 'August', year = 2022, 
  instructor = 'Paciorek', attendance = 100)
class(rboot2022) <- "workshop"

rboot2022
## $month
## [1] "August"
## 
## $year
## [1] 2022
## 
## $instructor
## [1] "Paciorek"
## 
## $attendance
## [1] 100
## 
## attr(,"class")
## [1] "workshop"
is(rboot2022, "workshop")
## [1] TRUE
rboot2022$instructor 
## [1] "Paciorek"
print.workshop <- function(x) {
    with(x,
       cat("A workshop held in ", month, " ", year, "; taught by ", instructor, ".\nThe attendance was ", attendance, ".\n", sep = ""))
    invisible(x)
}

# doesn't execute correctly in the slide creation, so comment out here:
# rboot2022 

Note that we rely on the generic print() already existing in R. Otherwise we’d need to create it.

So what is happening behind the scenes here?

Using S4 classes and methods (optional)

Unlike S4, S4 classes have a formal definition and objects in the class must have the specified fields.

The fields of an S4 class are called ‘slots’. Instead of x$field you do x@field.

Here’s a bit of an example of an S4 class for a linear mixed effects model from lme4:

library(lme4)
library(methods)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
class(fm1)
## [1] "lmerMod"
## attr(,"package")
## [1] "lme4"
methods(class = "lmerMod")
## [1] getL show
## see '?methods' for accessing help and source code
slotNames(fm1)
##  [1] "resp"    "Gp"      "call"    "frame"   "flist"   "cnms"    "lower"  
##  [8] "theta"   "beta"    "u"       "devcomp" "pp"      "optinfo"
fm1$theta
## Error in fm1$theta: $ operator not defined for this S4 class
fm1@theta
## [1] 0.96674177 0.01516906 0.23090995

Brief introduction to R6 Classes

R6 classes are a somewhat new feature in R that provides object-oriented programming with behavior more like in other languages.

Here’s an extended example that simulates random time series.

library(R6)

tsSimClass <- R6Class("tsSimClass",
    ## class for holding time series simulators
    public = list(
        initialize = function(times, mean = 0, corParam = 1){
            library(fields)
            stopifnot(is.numeric(corParam), length(corParam) == 1)
            stopifnot(is.numeric(times))
            private$times <- times
            private$n <- length(times)
            private$mean <- mean
            private$corParam <- corParam
            private$currentU <- FALSE
            private$calcMats()
        },          
        
        changeTimes = function(newTimes){
            # modifies a private member field (i.e., a 'setter') and recalculates
            private$times <- newTimes
            private$calcMats()
        },
        
        getTimes = function(){
            # a 'getter' method
            return(private$times)
        },

        print = function(){ # 'print' method
            cat("R6 Object of class 'tsSimClass' with ",
                private$n, " time points.\n", sep = '')
            invisible(self)
        }
    ),

    ## private methods and functions not accessible externally
    private = list(
        calcMats = function() {
            ## calculates correlation matrix and Cholesky factor
            lagMat <- fields::rdist(private$times) # local variable
            corMat <- exp(-lagMat^2 / private$corParam^2)
            private$U <- chol(corMat) # square root matrix
            cat("Done updating correlation matrix and Cholesky factor.\n")
            private$currentU <- TRUE
            invisible(self)
        },
        # internal (private) fields not directly accessible from outside the object
        n = NULL, 
        times = NULL,
        mean = NULL,
        corParam = NULL,
        U = NULL,
        currentU = FALSE
    )
)   

## add another method after class is already created
tsSimClass$set("public", "simulate", function() {
    if(!private$currentU)
        private$calcMats()
    ## analogous to mu+sigma*z for generating N(mu, sigma^2)
    return(private$mean + crossprod(private$U, rnorm(private$n)))
})

my_ts <- tsSimClass$new(1:100, 2, 1)
## Done updating correlation matrix and Cholesky factor.
my_ts
## R6 Object of class 'tsSimClass' with 100 time points.
set.seed(1)
y <- my_ts$simulate()   # generate a time series
plot(my_ts$getTimes(), y, type = 'l', xlab = 'time',
      ylab = 'process values')
## We can't directly access private fields or methods
my_ts$times
## NULL
my_ts$calcMats()
## Error in eval(expr, envir, enclos): attempt to apply non-function
my_ts <- tsSimClass$new(1:100, 2, 3)
## Done updating correlation matrix and Cholesky factor.
set.seed(1)
y <- my_ts$simulate()   # generate a second time series
lines(my_ts$getTimes(), y, col = 'red')

Error and warning messages

When you write your own functions, and particularly for distributing to others, it’s a good idea to:

We can use stop() and warning() to do this. They’re the same functions that are being called when you see an error message or a warning in reaction to your own work in R.

mysqrt <- function(x) {
  if(is.list(x)) {
    warning("x is a list; converting to a vector")
    x <- unlist(x)
  }
  if(!is.numeric(x)) {
    stop("What is the square root of 'bob'?")
  } else {
      if(any(x < 0)) {
        warning("mysqrt: found negative values; proceeding anyway")
        x[x >= 0] <- (x[x >= 0])^(1/2)
        x[x < 0] <- NaN
        return(x)
      } else return(x^(1/2))
  }
}

mysqrt(c(1, 2, 3))
## [1] 1.000000 1.414214 1.732051
mysqrt(c(5, -7))
## Warning in mysqrt(c(5, -7)): mysqrt: found negative values; proceeding anyway
## [1] 2.236068      NaN
mysqrt(c('asdf', 'sdf'))
## Error in mysqrt(c("asdf", "sdf")): What is the square root of 'bob'?
mysqrt(list(5, 3, 'ab'))
## Warning in mysqrt(list(5, 3, "ab")): x is a list; converting to a vector
## Error in mysqrt(list(5, 3, "ab")): What is the square root of 'bob'?
sqrt(c(5, -7))
## Warning in sqrt(c(5, -7)): NaNs produced
## [1] 2.236068      NaN
sqrt('asdf')
## Error in sqrt("asdf"): non-numeric argument to mathematical function
sqrt(list(5, 3, 2))
## Error in sqrt(list(5, 3, 2)): non-numeric argument to mathematical function

So we’ve done something similar to what sqrt() actually does in R.

‘Catching’ errors

When you automate analyses, sometimes an R function call will fail. But you don’t want all of your analyses to grind to a halt because one failed. Rather, you want to catch the error, record that it failed, and move on.

For me this is most critical when I’m doing stratified analyses or sequential operations.

The try() function is a powerful tool here.

Why we need to try()

Suppose we tried to do a stratified analysis of life expectancy on GDP within continents, for 2007. I’m going to do this as a for loop for pedagogical reasons, but again, it would be better to do this with dplyr/lapply/by type tools.

For the purpose of illustration, I’m going to monkey a bit with the data such that there is an error in fitting Oceania. This is artificial, but when you stratify data into smaller groups it’s not uncommon that the analysis can fail for one of the groups (often because of small sample size or missing data).

mod <- list()
fakedat <- gapminder[gapminder$year == 2007, ]
fakedat$gdpPercap[fakedat$continent == 'Oceania'] <- NA

for(cont in c('Asia', 'Oceania', 'Europe', 'Americas', 'Africa')) {
            cat("Fitting model for continent ", cont, ".\n")
            tmp <- subset(fakedat, continent == cont)
            mod[[cont]] <- lm(lifeExp ~ log(gdpPercap), data = tmp)
}
## Fitting model for continent  Asia .
## Fitting model for continent  Oceania .
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases

What happened?

How we can try() harder

mod <- list()

for(cont in c('Asia', 'Oceania', 'Europe', 'Americas', 'Africa')) {
            cat("Fitting model for continent ", cont, ".\n")
            tmp <- subset(fakedat, continent == cont)
            curMod <- try(lm(lifeExp ~ log(gdpPercap), data = tmp))
            if(is(curMod, "try-error")) mod[[cont]] <- NA 
                       else mod[[cont]] <- curMod            
}
## Fitting model for continent  Asia .
## Fitting model for continent  Oceania .
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
##   0 (non-NA) cases
## Fitting model for continent  Europe .
## Fitting model for continent  Americas .
## Fitting model for continent  Africa .
mod[[1]]
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = tmp)
## 
## Coefficients:
##    (Intercept)  log(gdpPercap)  
##         25.650           5.157
mod[[2]]
## [1] NA

Computing on the language

One of the powerful capabilities you have in R is the ability to use R to modify and create R code.

First we need to understand a bit about how R code is stored and manipulated when we don’t want to immediately evaluate it.

When you send some code to R to execute, it has to ‘parse’ the input; i.e., to process it so that it know how to evaluate it. The parsed input can then be evaluated in the proper context (i.e., the right frame, holding the objects to be operated on and created).

We can capture parsed code before it is evaluated, manipulate it, and execute the modified result.

Here’s a teaser:

x7 <- 3

idx <- 7
varName <- paste0('x', idx)
x <- eval(as.name(varName))
print(x)
## [1] 3

(Advanced) Capturing and evaluating parsed code (optional)

code <- quote(n <- 100)
code
## n <- 100
class(code)
## [1] "<-"
n
## Error in eval(expr, envir, enclos): object 'n' not found
eval(code)
n
## [1] 100
results <- rep(0, n)
moreCode <- quote(for(i in 1:n) {
    tmp <- rnorm(30)
    results[i] <- min(tmp)
})
class(moreCode)
## [1] "for"
as.list(moreCode)
## [[1]]
## `for`
## 
## [[2]]
## i
## 
## [[3]]
## 1:n
## 
## [[4]]
## {
##     tmp <- rnorm(30)
##     results[i] <- min(tmp)
## }
newN <- 200
codeText <- paste("n", "<-", newN)
codeText
## [1] "n <- 200"
codeFromText <- parse(text = codeText)
codeFromText
## expression(n <- 200)
eval(codeFromText)
n
## [1] 200

So you could use R’s string manipulation capabilities to write and then evaluate R code. Meta.

(Advanced) Using R to automate working with object names (optional)

Suppose you were given a bunch of objects named “x1”, “x2”, “x3”, … and you wanted to write code to automatically do some computation on them. Here I’ll just demonstrate with three, but this is obviously more compelling if there are many of them.

# assume these objects were provided to you
x1 <- rnorm(5)
x2 <- rgamma(10, 1)
x3 <- runif(20)
# now you want to work with them
nVals <- 3
results <- rep(0, nVals)
for(i in 1:nVals) { 
  varName <- paste("x", i, sep = "")
  tmp <- eval(as.name(varName))
  # tmp <- get(varName) # an alternative
   results[i] <- mean(tmp)
}
results
## [1] -0.3971457  1.4027052  0.4788078
varName
## [1] "x3"
tmp
##  [1] 0.5549006 0.4296244 0.4527201 0.3064433 0.5783539 0.9103703 0.1426041
##  [8] 0.4150476 0.2109258 0.4287504 0.1326900 0.4600964 0.9429571 0.7619739
## [15] 0.9329098 0.4706785 0.6035881 0.4849897 0.1088063 0.2477268

Or suppose you needed to create “x1”, “x2”, “x3”, automatically.

nVals <- 3
results <- rep(0, nVals)
for(i in 1:nVals) {  
   varName <- paste("x", i, sep = "")
   assign(varName, rnorm(10))
}
x2
##  [1] -0.91068068  0.74127631  0.06851153 -0.32375075 -1.08650305 -1.01592895
##  [7] -0.76779018 -1.11972006 -0.44817424  0.47173637

Can you think of any uses of this ability for R to self-generate?

File encodings (optional)

Text (either in the form of a file with regular language in it or a data file with fields of character strings) will often contain characters that are not part of the limited ASCII set of characters, which has 128 characters and control codes; basically what you see on a standard US keyboard.

UTF-8 is an encoding for the Unicode characters that include more than 110,000 characters from 100 different alphabets/scripts. It’s widely used on the web.

Latin-1 encodes a small subset of Unicode and contains the characters used in many European languages (e.g., letters with accents).

Dealing with encodings in R (optional)

To read files with other characters correctly into R, you may need to tell R what encoding the file is in. E.g., see help on read.table() for the fileEncoding and encoding arguments.

With strings already in R, you can convert between encodings with iconv():

text <- "Melhore sua seguran\xe7a"
iconv(text, from = "latin1", to = "UTF-8")
## [1] "Melhore sua segurança"
iconv(text, from = "latin1", to = "ASCII", sub = "???")
## [1] "Melhore sua seguran???a"

You can mark a string with an encoding so R can display it correctly:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
x
## [1] "façile"
# playing around...
x <- "\xa1 \xa2 \xa3 \xf1 \xf2"
Encoding(x) <- "latin1"
x
## [1] "¡ ¢ £ ñ ò"

Line endings in text files (optional)

Windows, Mac, and Linux handle line endings in text files somewhat differently. So if you read a text file into R that was created in a different operating system you can run into difficulties.

So in UNIX you might see ^M at the end of lines when you open a Windows file in a text editor. The dos2unix or fromdos in UNIX commands can do the necessary conversion

In Windows you might have a UNIX text file appear to be all one line. The unix2dos or todos commands in UNIX can do the conversion.

There may also be Windows tools to deal with this.

Working with databases

R has the capability to read and write from a variety of relational database management systems (DBMS). Basically a database is a collection of rectangular format datasets (tables). Some of these tables have fields in common so it makes sense to merge (i.e., join) information from multiple tables. E.g., you might have a database with a table of student information, a table of teacher information and a table of school information.

The DBI package provides a front-end for manipulating databases from a variety of DBMS (MySQL, SQLite, Oracle, among others)

Basically, you tell the package what DBMS is being used on the back-end, link to the actual database, and then you can use the standard functions in the package regardless of the back-end.

Database example

You can get an example database of information about Stack Overflow questions and answers from http://www.stat.berkeley.edu/share/paciorek/stackoverflow-2016.db. Stack Overflow is a website where programmers and software users can pose questions and get answers to technical problems.

library(RSQLite)  # DBI is a dependency
db <- dbConnect(SQLite(), dbname = "../data/stackoverflow-2016.db") 
## stackoverflow-2016.db is an SQLite database

## metadata
dbListTables(db)
## [1] "answers"          "maxRepByQuestion" "questions"        "questionsAugment"
## [5] "questions_tags"   "users"
dbListFields(db, "questions")
## [1] "questionid"   "creationdate" "score"        "viewcount"    "title"       
## [6] "ownerid"
## simple filter operation
popular <- dbGetQuery(db, "select * from questions 
   where viewcount > 10000")
## a join followed by a filter operation
popularR <- dbGetQuery(db, "select * from questions join questions_tags
   on questions.questionid = questions_tags.questionid
   where viewcount > 10000 and
   tag = 'r'")

dbDisconnect(db)

POLL 10A:

What kind of R object do you think is returned when you query a database using dbGetQuery?

(respond at https://pollev.com/chrispaciorek428)

  1. a list of dataframes
  2. a matrix
  3. a dataframe
  4. a list of vectors
  5. some sort of specialized object (recall that lm() returns a specialized object)

Computer architecture

Note to participants: I’m having trouble with parallelization in RStudio, so we’ll just run the demo code in this module in a command line R session. You can open the basic R GUI for Mac or Windows, or, on a Mac, start R in a terminal window.

We’ll focus on shared memory computation here.

How do I know how many cores a computer has?

To see if multiple cores are being used by your job, you can do:

How can we make use of multiple cores?

Some basic approaches are:

Threaded linear algebra

R comes with a default BLAS (basic linear algebra subroutines) and LAPACK (linear algebra package) that carry out the core linear algebra computations. However, you can generally improve performance (sometimes by an order of magnitude) by using a different BLAS. Furthermore a threaded BLAS will allow you to use multiple cores.

A ‘thread’ is a lightweight process, and the operating system sees multiple threads as part of a single process.

We’ll show by demonstration that my desktop in my office is using multiple cores for linear algebra operations.

# note to CJP: don't run on laptop with slow BLAS
n <- 5000
x <- matrix(rnorm(n^2), n)
U <- chol(crossprod(x))

You should see that your R process is using more than 100% of CPU. Inconceivable!

More details on the BLAS (optional)

You can talk with your systems administrator about linking R to a fast BLAS or you can look into it yourself for your personal machine; see the R Installation and Administration manual.

Note that in some cases, in particular for small matrix operations, using multiple threads may actually slow down computation, so you may want to experiment, particularly with Linux. You can force the linear algebra to use only a single core by doing (assuming you’re using the bash shell) export OMP_NUM_THREADS=1 in the terminal window before starting R in the same terminal. Or see the RhpcBLASctl package to do it from within R.

Finally, note that threaded BLAS and either foreach or parallel versions of apply() can conflict and cause R to hang, so you’re likely to want to set the number of threads to 1 as above if you’re doing explicit parallelization.

What is an embarrassingly parallel (EP) problem?

Do you think you should be asking?

An EP problem is one that can be solved by doing independent computations as separate processes without communication between the processes. You can get the answer by doing separate tasks and then collecting the results.

Examples in statistics / data science / machine learning include

  1. stratified analyses
  2. cross-validation
  3. simulations with many independent replicates
  4. bootstrapping
  5. random forest models

Can you think of others in your work?

Some things that are not EP (at least not in a basic formulation):

  1. optimization
  2. Markov chain Monte Carlo for fitting Bayesian models

POLL 10B: Have you used R’s parallel processing capabilities?

(respond at https://pollev.com/chrispaciorek428)

  1. No
  2. Yes, I’ve used foreach
  3. Yes, I’ve used mclapply or parLapply
  4. Yes, I’ve used other tools from the parallel package
  5. Yes, I’ve used the future package
  6. Yes, I’ve used something else, not listed above

Using multiple cores for EP problems: parallel apply using future

The future package provides a lot of nice features for parallelization. We’ll just scratch the surface here to parallelize operations over the elements of a list (note that this is essentially equivalent to parLapply and mclapply).

First, make sure your computations on the elements are independent of each other and don’t involve sequential calculations!

We’ll use a new dataset here, which is a dataset of airline departure times (in particular delays) for all flights from SFO over a period of several years. We’ll do a stratified analysis, fitting a GAM (see Unit 8) for each of the destination airports.

fitFun <- function(curDest) {
            library(mgcv)
            tmp <- subset(air, Dest == curDest)
            tmp$Hour <- tmp$CRSDepTime %/% 100
            curMod <- try(gam(DepDelay ~ Year + s(Month) + s(Hour) + 
                 as.factor(DayOfWeek), data = tmp))
            if(is(tmp, "try-error")) curMod <- NA 
            return(curMod)
}


library(future.apply)
nCores <- 4
plan(multisession, workers = nCores)  
out <- future_lapply(unique(air$Dest), fitFun)
out[[1]]
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
## 
## Estimated degrees of freedom:
## 7.51 7.49  total = 23 
## 
## GCV score: 1051.858
out[[81]]
## [1] "Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : \n  A term has fewer unique covariate combinations than specified maximum degrees of freedom\n"
## attr(,"class")
## [1] "try-error"
## attr(,"condition")
## <simpleError in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots): A term has fewer unique covariate combinations than specified maximum degrees of freedom>

Note that the plan statement determines how the parallelization is done behind the scenes. As shown here, it will start up workers locally on your computer, but if you have access to a cluster, you can modify the plan to make use of multiple compute nodes in a cluster.

One thing to keep in mind is whether the different tasks all take about the same amount of time or widely different times. In the latter case, one wants to sequentially dispatch tasks as earlier tasks finish, rather than dispatching a block of tasks to each core. See the future.scheduling argument for user control over how the allocation is done.

Using multiple cores for EP problems: foreach

First, make sure your iterations are independent and don’t involve sequential calculations!

The foreach package provides a way to do a for loop using multiple cores. It can use a variety of ‘back-ends’ that handle the nitty-gritty of the parallelization. Happily it integrates with the future package nicely.

library(parallel)
library(doFuture)
## Loading required package: foreach
library(foreach)

nCores <- 4  # actually only 2 on my laptop, but appears hyperthreaded
registerDoFuture()
plan(multisession, workers = nCores)  

out <- foreach(dest = unique(air$Dest)) %dopar% {
    cat("Starting job for ", dest, ".\n", sep = "")
    outSub <- fitFun(dest)
    cat("Finishing job for ", dest, ".\n", sep = "")
    outSub # this will become part of the out objec
}
## Starting job for SLC.
## Finishing job for SLC.
## Starting job for PDX.
## Finishing job for PDX.
## Starting job for SNA.
## Finishing job for SNA.
## Starting job for DEN.
## Finishing job for DEN.
## Starting job for ORD.
## Finishing job for ORD.
## Starting job for RNO.
## Finishing job for RNO.
## Starting job for SAN.
## Finishing job for SAN.
## Starting job for IAH.
## Finishing job for IAH.
## Starting job for SEA.
## Finishing job for SEA.
## Starting job for IAD.
## Finishing job for IAD.
## Starting job for LAX.
## Finishing job for LAX.
## Starting job for EWR.
## Finishing job for EWR.
## Starting job for JFK.
## Finishing job for JFK.
## Starting job for OGG.
## Finishing job for OGG.
## Starting job for KOA.
## Finishing job for KOA.
## Starting job for BUR.
## Finishing job for BUR.
## Starting job for LAS.
## Finishing job for LAS.
## Starting job for PHX.
## Finishing job for PHX.
## Starting job for CLT.
## Finishing job for CLT.
## Starting job for PHL.
## Finishing job for PHL.
## Starting job for PIT.
## Finishing job for PIT.
## Starting job for MEM.
## Finishing job for MEM.
## Starting job for CIC.
## Finishing job for CIC.
## Starting job for MFR.
## Finishing job for MFR.
## Starting job for ACV.
## Finishing job for ACV.
## Starting job for BFL.
## Finishing job for BFL.
## Starting job for SBA.
## Finishing job for SBA.
## Starting job for FAT.
## Finishing job for FAT.
## Starting job for MOD.
## Finishing job for MOD.
## Starting job for MRY.
## Finishing job for MRY.
## Starting job for CEC.
## Finishing job for CEC.
## Starting job for RDD.
## Finishing job for RDD.
## Starting job for RDM.
## Finishing job for RDM.
## Starting job for SBP.
## Finishing job for SBP.
## Starting job for SMF.
## Finishing job for SMF.
## Starting job for BOI.
## Finishing job for BOI.
## Starting job for EUG.
## Finishing job for EUG.
## Starting job for AUS.
## Finishing job for AUS.
## Starting job for TWF.
## Finishing job for TWF.
## Starting job for IDA.
## Finishing job for IDA.
## Starting job for PSC.
## Finishing job for PSC.
## Starting job for MDW.
## Finishing job for MDW.
## Starting job for IND.
## Finishing job for IND.
## Starting job for HNL.
## Finishing job for HNL.
## Starting job for LIH.
## Finishing job for LIH.
## Starting job for BOS.
## Finishing job for BOS.
## Starting job for ATL.
## Finishing job for ATL.
## Starting job for MCO.
## Finishing job for MCO.
## Starting job for BWI.
## Finishing job for BWI.
## Starting job for MSY.
## Finishing job for MSY.
## Starting job for DFW.
## Finishing job for DFW.
## Starting job for CVG.
## Finishing job for CVG.
## Starting job for DTW.
## Finishing job for DTW.
## Starting job for MSP.
## Finishing job for MSP.
## Starting job for MIA.
## Finishing job for MIA.
## Starting job for STL.
## Finishing job for STL.
## Starting job for PSP.
## Finishing job for PSP.
## Starting job for CLE.
## Finishing job for CLE.
## Starting job for COS.
## Finishing job for COS.
## Starting job for SAT.
## Finishing job for SAT.
## Starting job for ABQ.
## Finishing job for ABQ.
## Starting job for ANC.
## Finishing job for ANC.
## Starting job for SMX.
## Finishing job for SMX.
## Starting job for DRO.
## Finishing job for DRO.
## Starting job for PIH.
## Finishing job for PIH.
## Starting job for SJC.
## Finishing job for SJC.
## Starting job for ONT.
## Finishing job for ONT.
## Starting job for TUS.
## Finishing job for TUS.
## Starting job for EGE.
## Finishing job for EGE.
## Starting job for GJT.
## Finishing job for GJT.
## Starting job for ASE.
## Finishing job for ASE.
## Starting job for ELP.
## Finishing job for ELP.
## Starting job for PMD.
## Finishing job for PMD.
## Starting job for BZN.
## Finishing job for BZN.
## Starting job for BIL.
## Finishing job for BIL.
## Starting job for OAK.
## Finishing job for OAK.
## Starting job for MKE.
## Finishing job for MKE.
## Starting job for MSO.
## Finishing job for MSO.
## Starting job for FCA.
## Finishing job for FCA.
## Starting job for LMT.
## Finishing job for LMT.
## Starting job for OTH.
## Finishing job for OTH.
## Starting job for LGB.
## Finishing job for LGB.
out[1:5]
## [[1]]
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
## 
## Estimated degrees of freedom:
## 7.51 7.49  total = 23 
## 
## GCV score: 1051.858     
## 
## [[2]]
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
## 
## Estimated degrees of freedom:
## 7.46 8.64  total = 24.1 
## 
## GCV score: 1325.755     
## 
## [[3]]
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
## 
## Estimated degrees of freedom:
## 5.93 8.70  total = 22.63 
## 
## GCV score: 639.5956     
## 
## [[4]]
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
## 
## Estimated degrees of freedom:
## 7.84 8.51  total = 24.35 
## 
## GCV score: 880.4363     
## 
## [[5]]
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## DepDelay ~ Year + s(Month) + s(Hour) + as.factor(DayOfWeek)
## 
## Estimated degrees of freedom:
## 8.00 5.68  total = 21.69 
## 
## GCV score: 1367.66

Question: What do you think are the advantages and disadvantages of having many small tasks vs. a few large tasks?

Parallelization and Random Number Generation

A tale of the good, the bad, and the ugly

Random numbers on a computer are not truly random but are generated as a sequence of pseudo-random numbers. The sequence is finite (but very, very, very, very long) and eventally repeats itself.

A random number seed determines where in the sequence one starts when generating random numbers.

set.seed(1)
rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
set.seed(1)
rnorm(5)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

The syntax for using L’Ecuyer is available in my parallel computing tutorial.

(Advanced) A brief note on distributed computing (optional)

If you have access to multiple machines within a networked environment, such as the compute servers in the Statistics Department or the campus-wide Savio cluster or machines on Amazon’s EC2 service, there are a few (sometimes) straightforward ways to parallelize EP jobs across machines.

  1. Use the future package (recommended).
  2. Use foreach with doMPI as the back-end. You’ll need MPI and Rmpi installed on all machines.
  3. Use foreach with doSNOW as the backend and start a cluster of type “SOCK” to avoid needing MPI/Rmpi.
  4. Use sockets to make a cluster in R and then use parLapply(), parSapply(), mclapply(), etc.

For option 1, see my tutorial on the future package (and also Python’s dask package) for syntax and more details.

For options 4, see my general tutorial on parallel computing for syntax and more details.