R bootcamp, Module 4: Calculations

August 2022, UC Berkeley

Alan Aw

Vectorized calculations and comparisons

At the core of R is the idea of doing calculations on entire vectors.

## Vectorized arithmetic
gdpTotal <- gapminder$gdpPercap * gapminder$pop

## Vectorized comparisons
wealthy <- gapminder2007$gdpPercap >= 30000

vec1 <- rnorm(5)
vec2 <- rnorm(5)
vec1 > vec2
## [1] FALSE FALSE  TRUE  TRUE FALSE
## Vectorized boolean operations
vec1 == vec2
## [1] FALSE FALSE FALSE FALSE FALSE
vec1 != vec2
## [1] TRUE TRUE TRUE TRUE TRUE
## careful: 
vec1 = vec2
identical(vec1, vec2)
## [1] TRUE
## using 'or'
gdpSubset >= 100000 | gdpSubset <= 1000
## Error in eval(expr, envir, enclos): object 'gdpSubset' not found
## using 'and'
gapminder$gdpPercap[1:10] >= 100000 & gapminder$continent[1:10] == "Asia"
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Recycling

An important related concept is that of recycling

vec10 <- sample(1:10, 10, replace = TRUE)
vec3 <- sample(1:10, 3, replace = TRUE)
vec5 <- sample(1:10, 5, replace = TRUE)
vec10
##  [1]  4  4  2  2  9  9  5  1 10  7
vec3
## [1] 8 6 9
vec5
## [1] 7 5 5 7 6
vec10 + vec5
##  [1] 11  9  7  9 15 16 10  6 17 13
vec10 + vec3
## Warning in vec10 + vec3: longer object length is not a multiple of shorter
## object length
##  [1] 12 10 11 10 15 18 13  7 19 15

Question: Tell me what’s going on. What choices were made by the R developers?

Vectorized calculations: quick quiz

POLL 4A: 2^3 is raising 2 to the power 3, so you get 8.

(respond at https://pollev.com/chrispaciorek428)

What will this return?

(1:4)^(1:4)

  1. an error
  2. 1, 2, 3, 4
  3. 1, 4, 27, 256
  4. 1, 4, 9, 16
  5. 1, 16, 81, 256

POLL 4B: What will this return?

vec1 <- c(1,2,3,4)
vec2 <- c(1,2)
vec1 + vec2
  1. 2, 4
  2. 2, 4, 5, 6
  3. 2, 4, 4, 6
  4. an error

Why vectorize?

As we’ve seen, R has many functions that allow you to operate on each element of a vector all at once. Here’s an example of a statistical calculation with simulated values.

Imagine how this code would look if written using a loop, or three separate loops.

vals <- rnorm(1000)
chi2vals <- vals^2
chi2_df1000 <- sum(chi2vals)

Advantages:

Sometimes there are surprises in terms of what is fast, as well as tricks for vectorizing things in unexpected ways:

vals <- rnorm(1e6)
system.time(trunc <- ifelse(vals > 0, vals, 0))
##    user  system elapsed 
##   0.026   0.008   0.034
system.time(vals <- vals * (vals > 0))
##    user  system elapsed 
##   0.004   0.000   0.004

Question: What am I doing with vals * (vals > 0) ? What happens behind the scenes in R?

If you use a trick like this, having a comment in your code is a good idea.

Vectorized functions

Lots of functions in R are vectorized (i.e., they can take a single value or a vector as an input argument), such as some we’ve already used.

tmp <- as.character(gapminder$year)
gapminder$year2 <- substring(tmp, 3, 4)
head(gapminder$year2)
## [1] "52" "57" "62" "67" "72" "77"

Question: Note the code above runs and the syntax is perfectly good R syntax, but in terms of what it does, there is a bug in it. See if you can see what it is.

Vectorized vector/matrix calculations

Vectorized arithmetic

Recall that +, -,*, / do vectorized calculations:

A <- matrix(1:9, 3)
B <- matrix(seq(4,36, by = 4), 3)

A + B
##      [,1] [,2] [,3]
## [1,]    5   20   35
## [2,]   10   25   40
## [3,]   15   30   45
A + B[ , 1]
##      [,1] [,2] [,3]
## [1,]    5    8   11
## [2,]   10   13   16
## [3,]   15   18   21
A * B
##      [,1] [,2] [,3]
## [1,]    4   64  196
## [2,]   16  100  256
## [3,]   36  144  324
A * B[ , 1]
##      [,1] [,2] [,3]
## [1,]    4   16   28
## [2,]   16   40   64
## [3,]   36   72  108

Linear algebra: Matrix/vector multiplication

A %*% B[ , 1]
##      [,1]
## [1,]  120
## [2,]  144
## [3,]  168
A %*% B
##      [,1] [,2] [,3]
## [1,]  120  264  408
## [2,]  144  324  504
## [3,]  168  384  600
identical(t(A)%*%A, crossprod(A))
## [1] TRUE

Linear algebra (optional)

R can do essentially any linear algebra you need. It uses system-level packages called BLAS (basic linear algebra subroutines) and LAPACK (linear algebra package). Note that these calculations will be essentially as fast as if you wrote C code because R just calls C and Fortran routines to do the calculations.

The BLAS that comes with R is fairly slow. It’s possible to use a faster BLAS, as well as one that uses multiple cores automatically. This can in some cases give you an order of magnitude speedup if your work involves a lot of matrix manipulations/linear algebra. More details in Module 10.

(Advanced) Matrix decompositions

Here are some examples of common matrix decompositions: Cholesky decomposition, eigenvalues/eigenvectors, and SVD. These all use BLAS+LAPACK.

## next 3 lines generate a positive definite matrix
library(fields)
times <- seq(0, 1, length = 100)
R <- exp(-rdist(times) / 0.2) # a correlation matrix
######################################################
e <- eigen(R)
range(e$values)
## [1]  0.02525338 32.85537225
e$vectors[ , 1]
##   [1] 0.05195413 0.05448567 0.05698864 0.05946173 0.06190363 0.06431308
##   [7] 0.06668879 0.06902954 0.07133409 0.07360123 0.07582978 0.07801856
##  [13] 0.08016643 0.08227226 0.08433494 0.08635341 0.08832659 0.09025345
##  [19] 0.09213298 0.09396420 0.09574615 0.09747789 0.09915851 0.10078713
##  [25] 0.10236291 0.10388500 0.10535262 0.10676499 0.10812137 0.10942106
##  [31] 0.11066337 0.11184765 0.11297327 0.11403965 0.11504623 0.11599249
##  [37] 0.11687791 0.11770205 0.11846447 0.11916476 0.11980256 0.12037754
##  [43] 0.12088940 0.12133786 0.12172270 0.12204370 0.12230071 0.12249358
##  [49] 0.12262222 0.12268655 0.12268655 0.12262222 0.12249358 0.12230071
##  [55] 0.12204370 0.12172270 0.12133786 0.12088940 0.12037754 0.11980256
##  [61] 0.11916476 0.11846447 0.11770205 0.11687791 0.11599249 0.11504623
##  [67] 0.11403965 0.11297327 0.11184765 0.11066337 0.10942106 0.10812137
##  [73] 0.10676499 0.10535262 0.10388500 0.10236291 0.10078713 0.09915851
##  [79] 0.09747789 0.09574615 0.09396420 0.09213298 0.09025345 0.08832659
##  [85] 0.08635341 0.08433494 0.08227226 0.08016643 0.07801856 0.07582978
##  [91] 0.07360123 0.07133409 0.06902954 0.06668879 0.06431308 0.06190363
##  [97] 0.05946173 0.05698864 0.05448567 0.05195413
sv <- svd(R)
U <- chol(R)

devs <- rnorm(100)
Rinvb <- solve(R, devs)  # R^{-1} b
Rinv <- solve(R) # R^{-1} -- try to avoid this

Pre-allocation

This is slow.

vals <- 0
n <- 50000
system.time({
for(i in 1:n)
      vals <- c(vals, i)
})
##    user  system elapsed 
##   2.945   0.007   2.954

The same holds for using rbind(), cbind(), or adding to a list, one element at a time.

Question: Thoughts on why this are so slow? Think about what R might be doing behind the scenes.

Note: This is one area where Python and some other languages handle the situation in a more sophisticated way.

The answer is to pre-allocate memory

This is not so slow. (Please ignore the for-loop hypocrisy and the fact that I could do this as vals <- 1:n.)

n <- 50000
system.time({
vals <- rep(0, n)
# alternatively: vals <- as.numeric(NA); length(vals) <- n
for(i in 1:n)
      vals[i] <- i
})
##    user  system elapsed 
##   0.027   0.001   0.027

Here’s how to pre-allocate an empty list:

vals <- list(); length(vals) <- n
head(vals)
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL

apply

Some functions aren’t vectorized, or you may want to use a function on every row or column of a matrix/data frame, every element of a list, etc.

For this we use the apply() family of functions.

mat <- matrix(rnorm(100*1000), nr = 100)
row_min <- apply(mat, MARGIN = 1, FUN = min)
col_max <- apply(mat, MARGIN = 2, FUN = max)

There are actually some even faster specialized functions:

row_mean <- rowMeans(mat)
col_sum <- colSums(mat)

lapply() and sapply()

These are ‘map’ operations that apply a function to each element of a list.

myList <- list(rnorm(3), rnorm(3), rnorm(5))
lapply(myList, min)
## [[1]]
## [1] -0.280493
## 
## [[2]]
## [1] -0.07862098
## 
## [[3]]
## [1] -1.474092
sapply(myList, min)
## [1] -0.28049304 -0.07862098 -1.47409165

Note that we don’t generally want to use apply() on a data frame.

Recall that a dataframe is a list, hence this works:

sapply(gapminder, class)
##     country   continent        year     lifeExp         pop   gdpPercap 
##    "factor"    "factor"   "integer"   "numeric"   "integer"   "numeric" 
##       year2 
## "character"

lapply() and sapply() with vectors

You can use lapply() and sapply() on regular vectors, such as vectors of indices, which can come in handy. This is a bit silly but it illustrates the idea:

myfun <- function(i) {
   max(rnorm(100))
}   

out <- lapply(1:6, myfun)
out
## [[1]]
## [1] 2.376375
## 
## [[2]]
## [1] 2.164874
## 
## [[3]]
## [1] 2.816697
## 
## [[4]]
## [1] 1.717506
## 
## [[5]]
## [1] 3.356898
## 
## [[6]]
## [1] 3.28338
## Or, 'in-line' the function:
out <- sapply(1:10, function(x) x^2)
out
##  [1]   1   4   9  16  25  36  49  64  81 100

POLL 4C: What do you think this will return:

(respond at https://pollev.com/chrispaciorek428)

sapply(1:6, function(n) rnorm(n))
  1. a list, like lapply
  2. a vector of 21 (1+2+3+4+5+6) elements
  3. an error

And more apply()s (optional)

There are a bunch of apply() variants, as well as parallelized versions of them:

Tabulation

tbl <- table(gapminder$country, gapminder$continent)
head(tbl)
##              
##               Africa Americas Asia Europe Oceania
##   Afghanistan      0        0   12      0       0
##   Albania          0        0    0     12       0
##   Algeria         12        0    0      0       0
##   Angola          12        0    0      0       0
##   Argentina        0       12    0      0       0
##   Australia        0        0    0      0      12
rowSums(tbl)[1:20]
##            Afghanistan                Albania                Algeria 
##                     12                     12                     12 
##                 Angola              Argentina              Australia 
##                     12                     12                     12 
##                Austria                Bahrain             Bangladesh 
##                     12                     12                     12 
##                Belgium                  Benin                Bolivia 
##                     12                     12                     12 
## Bosnia and Herzegovina               Botswana                 Brazil 
##                     12                     12                     12 
##               Bulgaria           Burkina Faso                Burundi 
##                     12                     12                     12 
##               Cambodia               Cameroon 
##                     12                     12
## Here we do a data validity check: all countries should have 12 obs in the same continent
all(rowSums(tbl) == 12)
## [1] TRUE

Discretization

You may need to discretize a continuous variable [or a discrete variable with many levels], e.g., by life expectancy.

We can use cut to do it:

gapminder2007$lifeExpBin <- cut(gapminder2007$lifeExp,
                                breaks = c(0, 40, 50, 60, 70, 75, 80, Inf))
tbl <- table(gapminder2007$continent, gapminder2007$lifeExpBin)
round( prop.table(tbl, margin = 1), 2 )
##           
##            (0,40] (40,50] (50,60] (60,70] (70,75] (75,80] (80,Inf]
##   Africa     0.02    0.33    0.42    0.10    0.12    0.02     0.00
##   Americas   0.00    0.00    0.00    0.12    0.48    0.36     0.04
##   Asia       0.00    0.03    0.06    0.24    0.39    0.18     0.09
##   Europe     0.00    0.00    0.00    0.00    0.27    0.50     0.23
##   Oceania    0.00    0.00    0.00    0.00    0.00    0.00     1.00

Stratified analyses I

Often we want to do individual analyses within subsets or clusters of our data.

As a first step, we might want to just split our dataset by a stratifying variable (perhaps after using cut to create a categorical variable).

subsets <- split(gapminder,  gapminder$year)
length(subsets)
## [1] 12
dim(subsets[['2007']])
## [1] 142   7
par(mfrow = c(1,2))
plot(lifeExp ~ gdpPercap, data = subsets[['1952']], main = '1952')
abline(h = 0, col = 'grey')
plot(lifeExp ~ gdpPercap, data = subsets[['2007']], main = '2007')
abline(h = 0, col = 'grey')

Obviously, we’d want to iterate to improve that plot given the outlier.

POLL 4D: What kind of object is produced by split, e.g.,

(respond at https://pollev.com/chrispaciorek428)

subsets <- split(gap,  gap$year)
  1. a dataframe
  2. a list
  3. a list of lists
  4. a list of dataframes
  5. some new kind of object we haven’t encountered yet

Stratified analyses I - map operations

Once you have the stratified dataset you can use lapply to apply a function to each item in the list.

This is known as a ‘map’ operation in other languages.

meanfun <- function(sub) {
   mean(sub$lifeExp)
}   

meanLifeExpByYear <- sapply(subsets, meanfun)
names(subsets)
##  [1] "1952" "1957" "1962" "1967" "1972" "1977" "1982" "1987" "1992" "1997"
## [11] "2002" "2007"
meanLifeExpByYear
##     1952     1957     1962     1967     1972     1977     1982     1987 
## 49.05762 51.50740 53.60925 55.67829 57.64739 59.57016 61.53320 63.21261 
##     1992     1997     2002     2007 
## 64.16034 65.01468 65.69492 67.00742

We’ll talk in detail about writing functions in Module 5.

Stratified analyses II (optional)

Often we want to do individual analyses within subsets or clusters of our data. R has a variety of tools for this; for now we’ll look at aggregate() and by(). These are wrappers of tapply().

aggregate()

gmSmall <- gapminder[ , c('lifeExp', 'gdpPercap')]  # reduce to only numeric columns
aggregate(gmSmall, by = list(year = gapminder$year), FUN = median, na.rm = TRUE) # na.rm not needed here but illustrates use of additional arguments to FUN
##    year lifeExp gdpPercap
## 1  1952 45.1355  1968.528
## 2  1957 48.3605  2173.220
## 3  1962 50.8810  2335.440
## 4  1967 53.8250  2678.335
## 5  1972 56.5300  3339.129
## 6  1977 59.6720  3798.609
## 7  1982 62.4415  4216.228
## 8  1987 65.8340  4280.300
## 9  1992 67.7030  4386.086
## 10 1997 69.3940  4781.825
## 11 2002 70.8255  5319.805
## 12 2007 71.9355  6124.371
aggregate(lifeExp ~ year + continent, data = gapminder, FUN = median)
##    year continent lifeExp
## 1  1952    Africa 38.8330
## 2  1957    Africa 40.5925
## 3  1962    Africa 42.6305
## 4  1967    Africa 44.6985
## 5  1972    Africa 47.0315
## 6  1977    Africa 49.2725
## 7  1982    Africa 50.7560
## 8  1987    Africa 51.6395
## 9  1992    Africa 52.4290
## 10 1997    Africa 52.7590
## 11 2002    Africa 51.2355
## 12 2007    Africa 52.9265
## 13 1952  Americas 54.7450
## 14 1957  Americas 56.0740
## 15 1962  Americas 58.2990
## 16 1967  Americas 60.5230
## 17 1972  Americas 63.4410
## 18 1977  Americas 66.3530
## 19 1982  Americas 67.4050
## 20 1987  Americas 69.4980
## 21 1992  Americas 69.8620
## 22 1997  Americas 72.1460
## 23 2002  Americas 72.0470
## 24 2007  Americas 72.8990
## 25 1952      Asia 44.8690
## 26 1957      Asia 48.2840
## 27 1962      Asia 49.3250
## 28 1967      Asia 53.6550
## 29 1972      Asia 56.9500
## 30 1977      Asia 60.7650
## 31 1982      Asia 63.7390
## 32 1987      Asia 66.2950
## 33 1992      Asia 68.6900
## 34 1997      Asia 70.2650
## 35 2002      Asia 71.0280
## 36 2007      Asia 72.3960
## 37 1952    Europe 65.9000
## 38 1957    Europe 67.6500
## 39 1962    Europe 69.5250
## 40 1967    Europe 70.6100
## 41 1972    Europe 70.8850
## 42 1977    Europe 72.3350
## 43 1982    Europe 73.4900
## 44 1987    Europe 74.8150
## 45 1992    Europe 75.4510
## 46 1997    Europe 76.1160
## 47 2002    Europe 77.5365
## 48 2007    Europe 78.6085
## 49 1952   Oceania 69.2550
## 50 1957   Oceania 70.2950
## 51 1962   Oceania 71.0850
## 52 1967   Oceania 71.3100
## 53 1972   Oceania 71.9100
## 54 1977   Oceania 72.8550
## 55 1982   Oceania 74.2900
## 56 1987   Oceania 75.3200
## 57 1992   Oceania 76.9450
## 58 1997   Oceania 78.1900
## 59 2002   Oceania 79.7400
## 60 2007   Oceania 80.7195
agg <- aggregate(lifeExp ~ year + continent , data = gapminder, FUN = median)
xtabs(lifeExp ~ ., data = agg)
##       continent
## year    Africa Americas    Asia  Europe Oceania
##   1952 38.8330  54.7450 44.8690 65.9000 69.2550
##   1957 40.5925  56.0740 48.2840 67.6500 70.2950
##   1962 42.6305  58.2990 49.3250 69.5250 71.0850
##   1967 44.6985  60.5230 53.6550 70.6100 71.3100
##   1972 47.0315  63.4410 56.9500 70.8850 71.9100
##   1977 49.2725  66.3530 60.7650 72.3350 72.8550
##   1982 50.7560  67.4050 63.7390 73.4900 74.2900
##   1987 51.6395  69.4980 66.2950 74.8150 75.3200
##   1992 52.4290  69.8620 68.6900 75.4510 76.9450
##   1997 52.7590  72.1460 70.2650 76.1160 78.1900
##   2002 51.2355  72.0470 71.0280 77.5365 79.7400
##   2007 52.9265  72.8990 72.3960 78.6085 80.7195

Notice the ‘long’ vs. ‘wide’ formats. You’ll see more about that sort of thing in Module 6.

While you can use aggregate for this, many people use dplyr, as we’ll discuss in Module 6.

by()

aggregate() works fine when the output is univariate, but what about more complicated analyses than computing the median, such as fitting a set of regressions?

out <- by(gapminder, gapminder$year, 
    function(sub) {
      lm(lifeExp ~ log(gdpPercap), data = sub)
    }          
)
length(out)
## [1] 12
summary(out[['2007']])
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.947  -2.661   1.215   4.469  13.115 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.9496     3.8577   1.283    0.202    
## log(gdpPercap)   7.2028     0.4423  16.283   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.122 on 140 degrees of freedom
## Multiple R-squared:  0.6544, Adjusted R-squared:  0.652 
## F-statistic: 265.2 on 1 and 140 DF,  p-value: < 2.2e-16
summary(out[['1952']])
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.9571  -5.7319   0.7517   6.5770  13.7361 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -17.8457     5.0668  -3.522 0.000578 ***
## log(gdpPercap)   8.8298     0.6626  13.326  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.146 on 140 degrees of freedom
## Multiple R-squared:  0.5592, Adjusted R-squared:  0.556 
## F-statistic: 177.6 on 1 and 140 DF,  p-value: < 2.2e-16

This is handy, but it’s just slightly simpler syntax than using split followed by lapply

Sorting

sort() applied to a vector does what you expect.

Sorting a matrix or dataframe based on one or more columns is a somewhat manual process, but once you get the hang of it, it’s not bad.

ord <- order(gapminder$year, gapminder$lifeExp, decreasing = TRUE)
ord[1:5]
## [1]  804  672  696 1488   72
gm_ord <- gapminder[ord, ]
head(gm_ord)
## # A tibble: 6 × 7
##   country          continent  year lifeExp       pop gdpPercap year2
##   <fct>            <fct>     <int>   <dbl>     <int>     <dbl> <chr>
## 1 Japan            Asia       2007    82.6 127467972    31656. 07   
## 2 Hong Kong, China Asia       2007    82.2   6980412    39725. 07   
## 3 Iceland          Europe     2007    81.8    301931    36181. 07   
## 4 Switzerland      Europe     2007    81.7   7554661    37506. 07   
## 5 Australia        Oceania    2007    81.2  20434176    34435. 07   
## 6 Spain            Europe     2007    80.9  40448191    28821. 07

You could of course write your own sort function that uses order(). More in Module 5.

Or, most people probably use dplyr::arrange. More in Module 6.

Joining/Merging Data

We often need to combine data across multiple data frames, merging on common fields (i.e., keys). In database terminology, this is a join operation.

Suppose that our dataset did not have ‘continent’ in it, but that we had a separate data frame that matches country to continent.

# ignore the 'wizard' behind the curtain...
c2c <- unique(gapminder[ , c('country', 'continent')])
gapminderSave <- gapminder
gapminder <- gapminder[ , -which(names(gapminder) == "continent")]

# Now our gapminder dataset doesn't have continent
head(gapminder)
## # A tibble: 6 × 6
##   country      year lifeExp      pop gdpPercap year2
##   <fct>       <int>   <dbl>    <int>     <dbl> <chr>
## 1 Afghanistan  1952    28.8  8425333      779. 52   
## 2 Afghanistan  1957    30.3  9240934      821. 57   
## 3 Afghanistan  1962    32.0 10267083      853. 62   
## 4 Afghanistan  1967    34.0 11537966      836. 67   
## 5 Afghanistan  1972    36.1 13079460      740. 72   
## 6 Afghanistan  1977    38.4 14880372      786. 77

using dplyr’s join functions

The dplyr package (to be discussed in detail in Module 6) provides a variety of join operations like those used in databases.

Now let’s add the continent info in:

head(c2c)
## # A tibble: 6 × 2
##   country     continent
##   <fct>       <fct>    
## 1 Afghanistan Asia     
## 2 Albania     Europe   
## 3 Algeria     Africa   
## 4 Angola      Africa   
## 5 Argentina   Americas 
## 6 Australia   Oceania
head(gapminder)
## # A tibble: 6 × 6
##   country      year lifeExp      pop gdpPercap year2
##   <fct>       <int>   <dbl>    <int>     <dbl> <chr>
## 1 Afghanistan  1952    28.8  8425333      779. 52   
## 2 Afghanistan  1957    30.3  9240934      821. 57   
## 3 Afghanistan  1962    32.0 10267083      853. 62   
## 4 Afghanistan  1967    34.0 11537966      836. 67   
## 5 Afghanistan  1972    36.1 13079460      740. 72   
## 6 Afghanistan  1977    38.4 14880372      786. 77
library(dplyr)

gapminder <- inner_join(gapminder, c2c, by = "country")

dim(gapminderSave)
## [1] 1704    7
dim(gapminder)
## [1] 1704    7
identical(gapminderSave, gapminder)
## [1] FALSE
identical(gapminderSave, gapminder[ , names(gapminderSave)])
## [1] TRUE

You can also do left, right, and full joins.

Using merge (optional)

Alternatively, we could use the base R function, merge().

Now let’s add the continent info in:

gapminder <- gapminder[ , -which(names(gapminder) == "continent")]  ## reset

head(c2c)
## # A tibble: 6 × 2
##   country     continent
##   <fct>       <fct>    
## 1 Afghanistan Asia     
## 2 Albania     Europe   
## 3 Algeria     Africa   
## 4 Angola      Africa   
## 5 Argentina   Americas 
## 6 Australia   Oceania
head(gapminder)
## # A tibble: 6 × 6
##   country      year lifeExp      pop gdpPercap year2
##   <fct>       <int>   <dbl>    <int>     <dbl> <chr>
## 1 Afghanistan  1952    28.8  8425333      779. 52   
## 2 Afghanistan  1957    30.3  9240934      821. 57   
## 3 Afghanistan  1962    32.0 10267083      853. 62   
## 4 Afghanistan  1967    34.0 11537966      836. 67   
## 5 Afghanistan  1972    36.1 13079460      740. 72   
## 6 Afghanistan  1977    38.4 14880372      786. 77
gapminder <- merge(gapminder, c2c, by.x = 'country', by.y = 'country',
                   all.x = TRUE, all.y = FALSE)

dim(gapminderSave)
## [1] 1704    7
dim(gapminder)
## [1] 1704    7
identical(gapminderSave, gapminder)
## [1] FALSE
identical(as.data.frame(gapminderSave), gapminder[ , names(gapminderSave)])
## [1] TRUE

What’s the deal with the all.x and all.y? They allow one to request the equivalent of left, right, and full joins. We can tell R whether we want to keep all of the x observations, all the y observations, or neither, or both, when there may be rows in either of the datasets that don’t match the other dataset.

Breakout

Basics

  1. Create a vector that concatenates the country and year to create a ‘country-year’ variable in a vectorized way using the string processing functions.

  2. Use table() to figure out the number of countries available for each continent.

Using the ideas

  1. Explain the steps of what this code is doing: tmp <- gapminder[ , -which(names(gapminder) == "continent")].

  2. Compute the number of NAs in each column of the gapminder dataset using sapply() and making use of the is.na() function. It’s possible to do this without writing a function (which is a topic we’ll cover in Module 5).

  3. Discretize gdpPercap into some bins and create a gdpPercap_binned variable. Count the number of values in each bin.

  4. Create a boxplot of life expectancy by binned gdpPercap.

  5. Create a dataframe that has the total (i.e., world) population across all the countries for each year.

  6. Merge the info from problem 7 back into the original gapminder dataset. Now plot life expectancy as a function of world population.

  7. Suppose we have two categorical variables and we conduct a hypothesis test of independence. The chi-square statistic is:


\chi^2 = \sum_{i=1}^{n}\sum_{j=1}^{m} \frac{(y_{ij} - e_{ij})^2}{e_{ij}},

where e_{ij} = \frac{y_{i\cdot} y_{\cdot j}}{y_{\cdot \cdot}}, with y_{i\cdot} the sum of the values in the i’th row, y_{\cdot j} the sum of values in the j’th column, and y_{\cdot\cdot} the sum of all the values. Suppose I give you a matrix in R with the y_{ij} values.

You can generate a test matrix as:

y <- matrix(sample(1:10, 12, replace = TRUE), 
nrow = 3, ncol = 4)

Compute the statistic without any loops as follows:

Advanced

  1. For each combination of year and continent, find the 95th percentile of life expectancy.

  2. Here’s a cool trick to pull off a particular element of a list of lists:

params <- list(a = list(mn = 7, sd = 3), b = list(mn = 6,sd = 1), 
  c = list(mn = 2, sd = 1))
sapply(params, "[[", 1)
## a b c 
## 7 6 2

Explain what that does and why it works.

Hint:

test <- list(5, 7, 3)
test[[2]]
## [1] 7
# `[[`(test, 2)  # need it commented or R Markdown processing messes it up...

# `+`(3, 7)