R bootcamp, Module 2: Managing R and R resources

August 2022, UC Berkeley

Chris Paciorek

How to be lazy

If you’re starting to type something you’ve typed before, or the long name of an R object or function, STOP! You likely don’t need to type all of that.

Question: Are there other tricks that anyone knows of? Please share in the online discussion forum.

Managing your objects

R has a number of functions for getting metadata about your objects. Some of this is built in to RStudio Environment tab/panel.

v1 <- gapminder$year
v2 <- gapminder$continent
v3 <- gapminder$lifeExp

length(v1)
## [1] 1704
str(v1)
##  int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
class(v1)
## [1] "integer"
typeof(v1)
## [1] "integer"
class(v2)
## [1] "factor"
typeof(v2)
## [1] "integer"
class(v3)
## [1] "numeric"
typeof(v3)
## [1] "double"
is.vector(v1)
## [1] TRUE
is.list(v1)
## [1] FALSE
myList <- list(3, c("uganda", "bulgaria"), matrix(1:4, 2))
is.list(myList)
## [1] TRUE
is.vector(myList)
## [1] TRUE
is.data.frame(myList)
## [1] FALSE

Question: What have you learned? Does it make sense?

Managing objects: quick quiz

POLL 2A: Which of these is true about the gapminder object in R?

(respond at https://pollev.com/chrispaciorek428)

  1. gapminder is a data frame
  2. gapminder is a matrix
  3. gapminder is a vector
  4. gapminder is a list
  5. gapminder is a function

Managing the workspace

R has functions for learning about the collection of objects in your workspace. Some of this is built in to RStudio.

## Let's first create a few objects
x <- rnorm(5)
y <- c(5L, 2L, 7L)
z <- list(a = 3, b = c('sam', 'yang'))
ls()  # search the user workspace (global environment)
## [1] "myList" "v1"     "v2"     "v3"     "x"      "y"      "z"
rm(x)    # delete a variable
ls()
## [1] "myList" "v1"     "v2"     "v3"     "y"      "z"
ls.str() # list and describe variables
## myList : List of 3
##  $ : num 3
##  $ : chr [1:2] "uganda" "bulgaria"
##  $ : int [1:2, 1:2] 1 2 3 4
## v1 :  int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## v2 :  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## v3 :  num [1:1704] 28.8 30.3 32 34 36.1 ...
## y :  int [1:3] 5 2 7
## z : List of 2
##  $ a: num 3
##  $ b: chr [1:2] "sam" "yang"

Saving and reloading the workspace

Finally we can save the objects in our R session:

ls()
## [1] "myList" "v1"     "v2"     "v3"     "y"      "z"
save.image('module2.Rda')
rm(list = ls())
ls()
## character(0)
load('module2.Rda') 
# the result of this may not be quite right in the slide version
ls()
##  [1] "a"                "D2R"              "deepExtract"      "denslines"       
##  [5] "densplot"         "dim2"             "ellipse.default"  "f.angdist"       
##  [9] "f.ciplot"         "f.dplot"          "f.ess"            "f.ess.old"       
## [13] "f.flushplot"      "f.gm"             "f.grstat"         "f.identity"      
## [17] "f.invlogit"       "f.logit"          "f.logmatern.euc"  "f.lonlat2eucl"   
## [21] "f.matern.ang"     "f.matern.ang.cov" "f.matern.euc"     "f.merge"         
## [25] "f.rdist.earth"    "f.sort"           "f.sort2"          "f.squexp"        
## [29] "f.trimat"         "f.vecrep"         "format_bytes"     "getNcdf"         
## [33] "im"               "indices"          "ln"               "lnm"             
## [37] "ls_sizes"         "machineName"      "makePoly"         "module"          
## [41] "plot.ell"         "pmap"             "pmap2"            "pointsInPoly"    
## [45] "pplot"            "pretty_size"      "print.closeR"     "q"               
## [49] "R2"               "R2D"              "rcsv"             "rotate"          
## [53] "sizes"            "source"           "temp.colors"      "thresh"          
## [57] "time_chol"        "tplot"            "tsplot"           "wcsv"

Challenge: how would I find all of my objects that have ‘x’ in their names?

Packages (R’s killer app)

Let’s check out the packages on CRAN. In particular check out the CRAN Task Views.

Essentially any well-established and many not-so-established statistical methods and other functionality is available in a package.

If you want to sound like an R expert, make sure to call them packages and not libraries. A library is the location in the directory structure where the packages are installed/stored.

Using packages

Two steps:

  1. Install the package on your machine
    • one-time only - the package will be a set of files in the filesystem
  2. Load the package
    • every time you start R and need to use a given package - the package will be loaded into memory

To install a package, in RStudio, just do Packages->Install Packages.

From the command line, you generally will just do

install.packages('gapminder') 

That should work without specifying the repository from which to download the package (though sometimes you will be given a menu of repositories from which to select). There may be some cases in which you might need to specify the repository explicitly, e.g.,

install.packages('gapminder', repos = 'https://cran.cnr.berkeley.edu') 

If you’re on a network and are not the administrator of the machine, you may need to explicitly tell R to install it in a directory you are able to write in:

install.packages('gapminder', lib = file.path('~', 'R'))

If you’re using R directly installed on your laptop (i.e., most of you), now (or at the break) would be a good point to install the various packages we need for the bootcamp, which can be done easily with the following command:

install.packages(c('chron','colorspace','codetools', 'DBI','devtools',
                   'dichromat','digest','doFuture','dplyr', 'fields',
                   'foreach','future.apply', 'gapminder', 'ggplot2',
                   'gridExtra','gtable','inline','iterators','knitr',
                   'labeling','lattice','lme4','mapproj','maps','munsell',
                   'proftools','proto','purrr','R6','rbenchmark',
                   'RColorBrewer','Rcpp','reshape2','rJava',
                   'RSQLite', 'scales','spam','stringr','tidyr','xlsx',
                   'xlsxjars','xtable'))

Note that packages often are dependent on other packages so these dependencies may be installed and loaded automatically. E.g., fields depends on maps and on spam.

You can also install directly from a package zip/tarball rather than from CRAN by giving a filename instead of a package name.

General information about a package

You can use syntax as follows to get a list of the objects in a package and a brief description:

library(help = packageName)

On CRAN there often vignettes that are an overview and describe usage of a package if you click on a specific package. The reference manual is just a single document with the help files for all of the objects/functions in a package, so may be helpful but often it’s hard to get the big picture view from that.

More on packages

The search path

To see the packages that are loaded and the order in which packages are searched for functions/objects:

search()

To see what libraries (i.e., directory locations) R is retrieving packages from:

.libPaths()

And to see where R is getting specific packages:

searchpaths()

Package namespaces

Namespaces are way to keep all the names for objects in a package together in a coherent way and allow R to look for objects in a principled way.

A few useful things to know:

ls('package:stats')[1:20]
##  [1] "acf"                  "acf2AR"               "add.scope"           
##  [4] "add1"                 "addmargins"           "aggregate"           
##  [7] "aggregate.data.frame" "aggregate.ts"         "AIC"                 
## [10] "alias"                "anova"                "ansari.test"         
## [13] "aov"                  "approx"               "approxfun"           
## [16] "ar"                   "ar.burg"              "ar.mle"              
## [19] "ar.ols"               "ar.yw"
lm <- function(i) {
   print(i)
}
lm(7) 
## [1] 7
lm(gapminder$lifeExp ~ gapminder$gdpPercap)
## gapminder$lifeExp ~ gapminder$gdpPercap
## <environment: 0x55b88e9f9dc8>
stats::lm(gapminder$lifeExp ~ gapminder$gdpPercap)
## 
## Call:
## stats::lm(formula = gapminder$lifeExp ~ gapminder$gdpPercap)
## 
## Coefficients:
##         (Intercept)  gapminder$gdpPercap  
##           5.396e+01            7.649e-04
rm(lm)

Can you explain what is going on? Consider the results of search().

More on packages, part 2 (optional)

(Advanced) Looking inside a package

Packages are available as “Package source”, namely the raw code and help files, and “binaries”, where stuff is packaged up for R to use efficiently.

To look at the raw R code (and possibly C/C++/Fortran code included in some packages), download and unzip the package source tarball. From the command line of a Linux/Mac terminal (note this won’t look right in the slides version of the HTML):

curl https://cran.r-project.org/src/contrib/fields_9.6.tar.gz \
     -o fields_9.6.tar.gz
tar -xvzf fields_9.6.tar.gz
cd fields
ls R
ls src
ls man
ls data

(Advanced) Creating your own R package

R is do-it-yourself - you can write your own package. At its most basic this is just some R scripts that are packaged together in a convenient format. And if giving it to someone else, it’s best to have some documentation in the form of function help files.

Why make a package?

See the devtools package and package.skeleton() for some useful tools to help you create a package. And there are lots of tips/tutorials online, in particular Hadley Wickham’s R packages book.

The working directory

To read and write from R, you need to have a firm grasp of where in the computer’s filesystem you are reading and writing from.

## What directory does R look for files in (working directory)?
getwd()

## Changing the working directory (Linux/Mac specific)
setwd('~/Desktop/r-bootcamp-fall-2022') # change the working directory
setwd('/Users/paciorek/Desktop') # absolute path
getwd()
setwd('r-bootcamp-fall-2022/modules') # relative path
setwd('../tmp') # relative path, up and back down the tree

## Changing the working directory (Windows specific)
## Windows - use either \\ or / to indicate directories
# setwd('C:\\Users\\Your_username\\Desktop\\r-bootcamp-fall-2022')
# setwd('..\\r-bootcamp-fall-2022')

## Changing the working directory (platform-agnostic)
setwd(file.path('~', 'Desktop', 'r-bootcamp-fall-2022', 'modules')) # change the working directory
setwd(file.path('/', 'Users', 'paciorek', 'Desktop', 'r-bootcamp-fall-2022', 'modules')) # absolute path
getwd()
setwd(file.path('..', 'data')) # relative path

Many errors and much confusion result from you and R not being on the same page in terms of where in the directory structure you are.

In RStudio, you can use Session -> Set Working Directory instead of setwd.

The working directory: quick quiz

(respond at https://pollev.com/chrispaciorek428)

POLL 2B:

Suppose I am on a Mac that has the following directories:

Users
--paciorek
----Desktop
------r-bootcamp-fall-2022
--------data
--------modules
--------schedule
----Documents

Which of the following use relative paths?

  1. setwd(‘Users/paciorek/Desktop/r-bootcamp-fall-2022/data’)
  2. setwd(‘/Users/paciorek/Desktop/r-bootcamp-fall-2022/data’)
  3. setwd(‘data’)
  4. setwd(‘../data’)
  5. setwd(‘~paciorek/Desktop/r-bootcamp-fall-2022/data’)
  6. setwd(‘paciorek’, ‘Desktop/r-bootcamp-fall-2022/data’)
  7. setwd(‘./data’)
  8. setwd(‘Desktop/r-bootcamp-fall-2022/data’)

POLL 2C:

Suppose my current working directory is:

/Users/paciorek/Desktop/r-bootcamp-fall-2022/modules.

Windows users, just think of this as being: C:\Users\paciorek\Desktop\r-bootcamp-fall-2022\modules.

Which of the following will allow me to change to the ‘data’ subdirectory?

  1. setwd(‘Users/paciorek/Desktop/r-bootcamp-fall-2022/data’)
  2. setwd(‘/Users/paciorek/Desktop/r-bootcamp-fall-2022/data’)
  3. setwd(‘data’)
  4. setwd(‘../data’)
  5. setwd(‘~paciorek/Desktop/r-bootcamp-fall-2022/data’)
  6. setwd(‘paciorek’, ‘Desktop/r-bootcamp-fall-2022/data’)
  7. setwd(‘./data’)
  8. setwd(‘Desktop/r-bootcamp-fall-2022/data’)

Reading text files into R

The workhorse for reading into a data frame is read.table(), which allows any separator (CSV, tab-delimited, etc.). read.csv() is a special case of read.table() for CSV files.

Here’s a simple example where R is able to read the data in using the default arguments to read.csv().

getwd()
## [1] "/accounts/vis/paciorek/staff/workshops/r-bootcamp-fall-2022/modules"
cpds <- read.csv(file.path('..', 'data', 'cpds.csv'))
head(cpds)
##   year   country vturn outlays realgdpgr unemp
## 1 1960 Australia  95.5      NA        NA  1.42
## 2 1961 Australia  95.3      NA     -0.07  2.79
## 3 1962 Australia  95.3   23.17      5.71  2.63
## 4 1963 Australia  95.7   23.01      6.10  2.12
## 5 1964 Australia  95.7   22.88      6.28  1.15
## 6 1965 Australia  95.7   24.90      4.97  1.15

It’s good to first look at your data in plain text format outside of R and then to check it after you’ve read it into R.

More details on reading data into R

Remember that you’ll need to know the current working directory so that you know where R is looking for files.

Next let’s work through a more involved example, so you can see some of the steps and tricks involved in reading data into R.

rta <- read.table("../data/RTAData.csv", sep = ",", head = TRUE)
rta[1:5, 1:5]
##               time X40010 X40015 X40020 X40025
## 1 2010-03-01 14:58    821    209    828    258
## 2 2010-03-01 15:01    804    209    804    248
## 3 2010-03-01 15:04    892    212    801    237
## 4 2010-03-01 15:07    857    214    821    243
## 5 2010-03-01 15:10    849    222    834    252
dim(rta)
## [1] 120822     62
# great, we're all set, right?
# Not so fast...
rta[5, 2]
## [1] "849"
class(rta[ , 2])
## [1] "character"
# let's delve more deeply
# unique(rta[ , 2])  # don't run when creating slides
head(sort(unique(rta[ , 2])))
## [1] ""     "1000" "1001" "1002" "1003" "1004"
tail(sort(unique(rta[ , 2])))
## [1] "995" "996" "997" "998" "999" "x"
# can we handle that with read.table?
# help(read.table)

rta2 <- read.table("../data/RTAData.csv", sep = ",", head = TRUE, 
      na.strings = c('NA', 'x'))
class(rta2[ , 2])
## [1] "integer"
# checking...
sum(is.na( rta2[ , 2] ))
## [1] 24507
sum( rta[ , 2] %in% c('','x'))
## [1] 24507

It’s good to first look at your data in plain text format outside of R and then to check it after you’ve read it into R.

Other ways to read data into R

The read.table() family of functions just skims the surface of things…

  1. You can also read in a file as vector of characters, one character string per line of the file with readLines(), and then post-process it.
  2. You can read fixed width format (constant number of characters per field) with read.fwf().
  3. read_csv() (and read_lines(), read_fwf(), etc.) in the readr package is a faster, more helpful drop-in replacement for read.csv() that plays well with dplyr (see Module 6).
  4. the data.table package is great for reading and manipulating large datasets (orders of gigabytes or 10s of gigabytes).

Reading ‘foreign’ format data

Here’s an example of reading data produced by another statistical package (Stata) with read.dta().

library(foreign)
vote <- read.dta(file.path('..', 'data', '2004_labeled_processed_race.dta'))
head(vote)
##   state pres04    sex  race  age9 partyid income relign8 age60 age65 geocode
## 1     2      1 female white 25-29    <NA>   <NA>    <NA> 18-29 25-29       3
## 2     2      2   male white 18-24    <NA>   <NA>    <NA> 18-29 18-24       3
## 3     2      1 female black 30-39    <NA>   <NA>    <NA> 30-44 30-39       3
## 4     2      1 female black 30-39    <NA>   <NA>    <NA> 30-44 30-39       3
## 5     2      1 female white 40-44    <NA>   <NA>    <NA> 30-44 40-49       3
## 6     2      1 female white 30-39    <NA>   <NA>    <NA> 30-44 30-39       3
##   sizeplac brnagain attend year region y
## 1    rural     <NA>   <NA> 2004      4 0
## 2    rural     <NA>   <NA> 2004      4 1
## 3    rural     <NA>   <NA> 2004      4 0
## 4    rural     <NA>   <NA> 2004      4 0
## 5    rural     <NA>   <NA> 2004      4 0
## 6    rural     <NA>   <NA> 2004      4 0

There are a number of other formats that we can handle for either reading or writing. Let’s see:

library(help = foreign)

R can also read in (and write out) Excel files, netCDF files, HDF5 files, etc., in many cases through add-on packages from CRAN.

A pause for a (gentle) diatribe:

Please try to avoid using Excel files as a data storage format. It’s proprietary, complicated (can have multiple sheets), allows a limited number of rows/columns, and files are not easily readable/viewable (unlike simple text files).

Writing data out from R

Here you have a number of options.

  1. You can write out R objects to an R Data file, as we’ve seen, using save() and save.image().
  2. You can use write.csv() and write.table() to write data frames/matrices to flat text files with delimiters such as comma and tab.
  3. You can use write() to write out matrices in a simple flat text format.
  4. You can use cat() to write to a file, while controlling the formatting to a fine degree.
  5. You can write out in the various file formats mentioned on the previous slide

Writing out plots and tables

pdf('myplot.pdf', width = 7, height = 7)
x <- rnorm(10); y <- rnorm(10)
plot(x, y)
dev.off()
## png 
##   2

xtable() formats tables for HTML and Latex (the default).

library(xtable)
print(xtable(table(gapminder$year, gapminder$continent)), type = "html")
## <!-- html table generated in R 4.2.0 by xtable 1.8-4 package -->
## <!-- Fri Aug 19 11:16:27 2022 -->
## <table border=1>
## <tr> <th>  </th> <th> Africa </th> <th> Americas </th> <th> Asia </th> <th> Europe </th> <th> Oceania </th>  </tr>
##   <tr> <td align="right"> 1952 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1957 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1962 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1967 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1972 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1977 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1982 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1987 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1992 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 1997 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 2002 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##   <tr> <td align="right"> 2007 </td> <td align="right">  52 </td> <td align="right">  25 </td> <td align="right">  33 </td> <td align="right">  30 </td> <td align="right">   2 </td> </tr>
##    </table>

Version control (optional)

Overview

At a basic level, a simple principle is to have version numbers for all your work: code, datasets, manuscripts. Whenever you make a change to a dataset, increment the version number. For code and manuscripts, increment when you make substantial changes or have obvious breakpoints in your workflow.

However, this is a hassle to do manually. Instead of manually trying to keep track of what changes you’ve made to code, data, documents, you use software to help you manage the process. This has several benefits:

Git and GitHub

Git is a popular tool for version control. Git is based around the notion of a repository, which is basically a version-controlled project directory. Many people use it with the GitHub, GitLab, or Bitbucket online hosting services for repositories.

In the introductory material, we’ve already seen how to get a copy of a GitHub repository on your local machine.

As you’re gathering by now, I’ve used Git and GitHub to manage all the content for this workshop.

Making changes to a repository

We’ll go through a short example of making changes to the r-bootcamp-fall-2022 repository. In this case you don’t have permission to make changes so you’ll just have to follow along as I do it. However, you could start your own repository and then you’d be able to do similar things.

Note that there are graphical interfaces to Git that you might want to check out, but here I’m just going to do it from the command line on my Mac.

The basic notion we need is a commit. As we make changes to our files, we want to commit those changes to the repository regularly. A commit is a set of changes recorded with Git. We will often then push those changes to a remote copy of the repository, such as on GitHub.

Here’s a basic workflow:

  1. Add a file to, or make changes to a file in, a repository
  2. Commit the changes
  3. Push the changes to the remote version of the repository

Here’s how this would look from the command line (this won’t look right in the slides version of the HTML):

git add myfile
# make changes to mycode.R
git commit -am'added myfile and fixed bug in mycode.R'
git push

The changes are then available to anyone to pull from the remote repository, a using git pull or graphical interfaces, such as using RStudio’s tools to pull the changes to your machine, discussed in the GitHub slide in module 0.

Getting R help online (optional)

Mailing lists

There are several mailing lists that have lots of useful postings. In general if you have an error, others have already posted about it.

If you are searching you often want to search for a specific error message. Remember to use double quotes around your error message so it is not broken into individual words by the search engine.

Posting your own questions

The main rule of thumb is to do your homework first to make sure the answer is not already available on the mailing list or in other documentation. Some of the folks who respond to mailing list questions are not the friendliest so it helps to have a thick skin, even if you have done your homework. On the plus side, they are very knowledgeable and include the world’s foremost R experts/developers.

Here are some guidelines when posting to one of the R mailing lists https://www.r-project.org/posting-guide.html

sessionInfo() is a function that will give information about your R version, OS, etc., that you can include in your posting.

You also want to include a short, focused, reproducible example of your problem that others can run.

Breakout

Basics

  1. Make sure you are able to install packages from CRAN. E.g., try to install lmtest.

  2. Figure out what your current working directory is.

Using the ideas

  1. Put the data/cpds.csv file in some other directory on your computer, such as Downloads. Use setwd() to set your working directory to be that directory. Read the file in using read.csv(). Now use setwd() to point to a different directory such as Desktop. Write the data frame out to a file without any row names and without quotes on the character strings.

  2. Make a plot with the gapminder data. Save it as a PDF in Desktop. Now see what happens if you set the width and height arguments to be very small and see how it affects the resulting PDF. Do the same but setting width and height to be very large.

  3. Figure out where (what directory) the graphics package is stored on your machine. Is it the same as where the fields package is stored?

Advanced

  1. Load the spam package. Note the message about backsolve() being masked from package:base. Now if you enter backsolve, you’ll see the code associated with the version of backsolve() provided by the spam package. Now enter base::backsolve and you’ll see the code for the version of backsolve() provided by base R. Explain why typing backsolve shows the spam version rather than the base version.