R bootcamp, Module 3: Working with objects and data

August 2022, UC Berkeley

Chris Paciorek

Lists

Collections of disparate or complicated objects

myList <- list(stuff = 3, mat = matrix(1:4, nrow = 2), 
   moreStuff = c("china", "japan"), list(5, "bear"))
myList
## $stuff
## [1] 3
## 
## $mat
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $moreStuff
## [1] "china" "japan"
## 
## [[4]]
## [[4]][[1]]
## [1] 5
## 
## [[4]][[2]]
## [1] "bear"
myList[[3]] # result is not (usually) a list (unless you have nested lists)
## [1] "china" "japan"
identical(myList[[3]], myList$moreStuff)
## [1] TRUE
myList$moreStuff[2]
## [1] "japan"
myList[[4]][[2]]
## [1] "bear"
myList[1:3] # subset of a list is a list
## $stuff
## [1] 3
## 
## $mat
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $moreStuff
## [1] "china" "japan"
myList$newOne <- 'more weird stuff'
names(myList)
## [1] "stuff"     "mat"       "moreStuff" ""          "newOne"

Lists can be used as vectors of complicated objects. E.g., suppose you have a linear regression for each value of a stratifying variable. You could have a list of regression fits. Each regression fit will itself be a list, so you’ll have a list of lists.

Lists: quick quiz

POLL 3A: How would you extract “china” from this list?

(respond at https://pollev.com/chrispaciorek428)

myList <- list(stuff = 3, mat = matrix(1:4, nrow = 2), 
   moreStuff = c("china", "japan"), list(5, "bear"))
  1. myList$moreStuff[1]
  2. myList$moreStuff[[1]]
  3. myList[[1]]
  4. myList[[3]][2]
  5. myList[[3]][1]
  6. myList[3][1]
  7. myList[[‘moreStuff’]][1]

Data frames

A review from Module 1…

class(gapminder)
## [1] "tbl_df"     "tbl"        "data.frame"
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Data frames are (special) lists!

is.list(gapminder)
## [1] TRUE
length(gapminder)
## [1] 6
gapminder[[3]][1:5]
## [1] 1952 1957 1962 1967 1972
lapply(gapminder, class) 
## $country
## [1] "factor"
## 
## $continent
## [1] "factor"
## 
## $year
## [1] "integer"
## 
## $lifeExp
## [1] "numeric"
## 
## $pop
## [1] "integer"
## 
## $gdpPercap
## [1] "numeric"

lapply() is a function used on lists; it works here to apply the class() function to each element of the list, which in this case is each field/column.

But lists are also vectors!

length(gapminder)
## [1] 6
someFields <- gapminder[c(3,5)]
head(someFields)
## # A tibble: 6 × 2
##    year      pop
##   <int>    <int>
## 1  1952  8425333
## 2  1957  9240934
## 3  1962 10267083
## 4  1967 11537966
## 5  1972 13079460
## 6  1977 14880372
identical(gapminder[c(3,5)], gapminder[ , c(3,5)])
## [1] TRUE

In general the placement of commas in R is crucial, but here, two different operations give the same result because of the underlying structure of data frames.

Matrices

If you need to do numeric calculations on an entire non-vector object (dimension > 1), you generally want to use matrices and arrays, not data frames.

mat <- matrix(rnorm(12), nrow = 3, ncol = 4)
mat
##              [,1]       [,2]       [,3]       [,4]
## [1,] -0.997563234 -1.4843480 0.67704596 -0.6704612
## [2,]  0.001128528 -0.2429397 0.01286389 -0.5274799
## [3,]  0.667443893 -0.6124777 1.34872524  0.7868177
# vectorized calcs work with matrices too
mat*4
##              [,1]       [,2]       [,3]      [,4]
## [1,] -3.990252937 -5.9373919 2.70818386 -2.681845
## [2,]  0.004514113 -0.9717588 0.05145555 -2.109920
## [3,]  2.669775572 -2.4499108 5.39490097  3.147271
mat <- cbind(mat, 1:3)
mat
##              [,1]       [,2]       [,3]       [,4] [,5]
## [1,] -0.997563234 -1.4843480 0.67704596 -0.6704612    1
## [2,]  0.001128528 -0.2429397 0.01286389 -0.5274799    2
## [3,]  0.667443893 -0.6124777 1.34872524  0.7868177    3
# Let's convert the gapminder dataframe to a matrix:
gm_mat <- as.matrix(gapminder[ , c('lifeExp', 'gdpPercap')])
head(gm_mat)
##      lifeExp gdpPercap
## [1,]  28.801  779.4453
## [2,]  30.332  820.8530
## [3,]  31.997  853.1007
## [4,]  34.020  836.1971
## [5,]  36.088  739.9811
## [6,]  38.438  786.1134

Matrices: quick quiz

POLL 3B: Recall the gap dataframe has columns that are numeric and columns that are character strings. What do you think will happen if we do this:

as.matrix(gapminder)

(respond at https://pollev.com/chrispaciorek428)

  1. it will convert to a matrix with no changes
  2. all numeric columns will be converted to character strings
  3. R will throw an error
  4. all character columns will be converted to numeric values
  5. R will drop some of the columns

Arrays

Arrays are like matrices but can have more or fewer than two dimensions.

arr <- array(rnorm(12), c(2, 3, 4))
arr
## , , 1
## 
##            [,1]      [,2]      [,3]
## [1,]  0.8342853 1.0857439 0.2890717
## [2,] -0.4741951 0.3268233 0.5573720
## 
## , , 2
## 
##           [,1]      [,2]       [,3]
## [1,] 0.4952659 -1.309363  0.8070183
## [2,] 0.8302991 -1.280746 -0.4620758
## 
## , , 3
## 
##            [,1]      [,2]      [,3]
## [1,]  0.8342853 1.0857439 0.2890717
## [2,] -0.4741951 0.3268233 0.5573720
## 
## , , 4
## 
##           [,1]      [,2]       [,3]
## [1,] 0.4952659 -1.309363  0.8070183
## [2,] 0.8302991 -1.280746 -0.4620758

Attributes

Objects have attributes.

attributes(mat)
## $dim
## [1] 3 5
rownames(mat) <- c('first', 'middle', 'last')
mat
##                [,1]       [,2]       [,3]       [,4] [,5]
## first  -0.997563234 -1.4843480 0.67704596 -0.6704612    1
## middle  0.001128528 -0.2429397 0.01286389 -0.5274799    2
## last    0.667443893 -0.6124777 1.34872524  0.7868177    3
attributes(mat)
## $dim
## [1] 3 5
## 
## $dimnames
## $dimnames[[1]]
## [1] "first"  "middle" "last"  
## 
## $dimnames[[2]]
## NULL
names(attributes(gapminder))
## [1] "names"     "class"     "row.names"
attributes(gapminder)$names
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
attributes(gapminder)$row.names[1:10]
##  [1]  1  2  3  4  5  6  7  8  9 10

Now let’s do a bit of manipulation and see if you can infer how R represents matrices internally.

Attributes: quick quiz

POLL 3C: Consider our matrix ‘mat’:

(respond at https://pollev.com/chrispaciorek428)

mat <- matrix(1:16, nrow = 4, ncol = 4)
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

Suppose I run this code: mat[4]

What do you think will be returned?

  1. 13
  2. 4
  3. 13, 14, 15, 16
  4. 4, 8, 12, 16
  5. an error
mat[4]
attributes(mat) <- NULL
mat
is.matrix(mat)

Question: What can you infer about what a matrix is in R?

Question: What kind of object are the attributes themselves? How do I check?

Matrices are stored column-major

This is like Fortran, MATLAB and Julia but not like C or Python(numpy).

mat <- matrix(1:12, 3, 4)
mat
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
vals <- c(mat)

You can go smoothly back and forth between a matrix (or an array) and a vector:

identical(mat, matrix(vals, 3, 4))
## [1] TRUE
identical(mat, matrix(vals, 3, 4, byrow = TRUE))
## [1] FALSE

This is a common cause of bugs!

Missing values and other special values

Since it was designed by statisticians, R handles missing values very well relative to other languages.

NA is a missing value

vec <- rnorm(12)
vec[c(3, 5)] <- NA
vec
##  [1] -1.04710949 -0.25433306          NA  1.30207420          NA  0.19756982
##  [7] -1.44054992  0.76560416  0.15789745 -0.04049116 -1.03669646 -1.51556459
length(vec)
## [1] 12
sum(vec)
## [1] NA
sum(vec, na.rm = TRUE)
## [1] -2.911599
hist(vec)

is.na(vec)
##  [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Be careful because many R functions won’t warn you that they are ignoring the missing values.

To infinity and beyond

big <- 1e500 
big
## [1] Inf
big + 7
## [1] Inf

NaN stands for Not a Number

sqrt(-5)
## Warning in sqrt(-5): NaNs produced
## [1] NaN
big - big
## [1] NaN
1/0
## [1] Inf

NULL

vec <- c(vec, NULL) 
vec
##  [1] -1.04710949 -0.25433306          NA  1.30207420          NA  0.19756982
##  [7] -1.44054992  0.76560416  0.15789745 -0.04049116 -1.03669646 -1.51556459
length(vec)
## [1] 12
a <- NULL
a + 7
## numeric(0)
a[3, 4]
## NULL
is.null(a)
## [1] TRUE
myList <- list(a = 7, b = 5)
myList$a <- NULL  # works for data frames too
myList
## $b
## [1] 5

NA can hold a place but NULL cannot. NULL is useful for having a function argument default to ‘nothing’. See help(crossprod), which can compute either X^{\top}X or X^{\top}Y.

Missing values: quick quiz

POLL 3D

(just respond in your head; I won’t collect the answers online)

Question 1: Consider the following vector:

vec <- c(3, NA, 7)

What is vec[2]:

  1. NA
  2. 7

Question 2: Consider this vector:

vec <- c(3, NULL, 7)

What is vec[2]:

  1. NULL
  2. NA
  3. 7
  4. 3

Question 3: Consider this list:

mylist <- list(3, NULL, 7)

What is mylist[[2]]:

  1. 7
  2. NULL
  3. NA
  4. 3

Question 4: Consider this code:

mylist <- list(3, 5, 7)
mylist[[2]] <- NULL

What is length(mylist):

  1. 3
  2. 2
  3. 1

Logical vectors and boolean arithmetic

gapminder2007 <- gapminder[gapminder$year == 2007, ]

wealthy <- gapminder2007$gdpPercap > 35000
healthy <- gapminder2007$lifeExp > 75

head(wealthy)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
table(wealthy)
## wealthy
## FALSE  TRUE 
##   130    12
# note the vectorized boolean arithmetic
gapminder2007[wealthy & healthy, ]
## # A tibble: 12 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Austria          Europe     2007    79.8   8199783    36126.
##  2 Canada           Americas   2007    80.7  33390141    36319.
##  3 Denmark          Europe     2007    78.3   5468120    35278.
##  4 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  5 Iceland          Europe     2007    81.8    301931    36181.
##  6 Ireland          Europe     2007    78.9   4109086    40676.
##  7 Kuwait           Asia       2007    77.6   2505559    47307.
##  8 Netherlands      Europe     2007    79.8  16570613    36798.
##  9 Norway           Europe     2007    80.2   4627926    49357.
## 10 Singapore        Asia       2007    80.0   4553009    47143.
## 11 Switzerland      Europe     2007    81.7   7554661    37506.
## 12 United States    Americas   2007    78.2 301139947    42952.
gapminder2007[wealthy | healthy, ]
## # A tibble: 44 × 6
##    country    continent  year lifeExp      pop gdpPercap
##    <fct>      <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Albania    Europe     2007    76.4  3600523     5937.
##  2 Argentina  Americas   2007    75.3 40301927    12779.
##  3 Australia  Oceania    2007    81.2 20434176    34435.
##  4 Austria    Europe     2007    79.8  8199783    36126.
##  5 Bahrain    Asia       2007    75.6   708573    29796.
##  6 Belgium    Europe     2007    79.4 10392226    33693.
##  7 Canada     Americas   2007    80.7 33390141    36319.
##  8 Chile      Americas   2007    78.6 16284741    13172.
##  9 Costa Rica Americas   2007    78.8  4133884     9645.
## 10 Croatia    Europe     2007    75.7  4493312    14619.
## # … with 34 more rows
## # ℹ Use `print(n = ...)` to see more rows
gapminder2007[wealthy & !healthy, ]
## # A tibble: 0 × 6
## # … with 6 variables: country <fct>, continent <fct>, year <int>,
## #   lifeExp <dbl>, pop <int>, gdpPercap <dbl>
## # ℹ Use `colnames()` to see all variable names
# what am I doing here?
sum(healthy)
## [1] 44
mean(healthy)
## [1] 0.3098592

Question: What do you think R is doing to do arithmetic on logical vectors?

Converting between different types of objects

You can use the as() family of functions.

ints <- 1:10
as.character(ints)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
as.numeric(c('3.7', '4.8'))
## [1] 3.7 4.8

Be careful: R tries to be helpful and convert between types/classes when it thinks it’s a good idea. Sometimes it is overly optimistic.

indices <- c(1.7, 2.3)
ints[indices]
## [1] 1 2
ints[0.999999999]
## integer(0)

Converting between different types: quick quiz

POLL 3E:

(just respond in your head; I won’t collect the answers online)

Question 1: What do you think this will do?

ints <- 1:5
ints[0.9999]
  1. return an error
  2. return 1
  3. return an empty vector

Question 2: What does the code do when it tries to use 0.9999 to subset?

  1. round the 0.9999 to 1
  2. truncate the 0.9999 to 0
  3. return an error

Factors

## let's read the Gapminder data from a file with a special argument:
gapminder <- read.csv(file.path('..', 'data', 'gapminder-FiveYearData.csv'),
          stringsAsFactors = TRUE) # This was the default before R 4.0
class(gapminder$continent)
## [1] "factor"
head(gapminder$continent) # What order are the factors in?
## [1] Asia Asia Asia Asia Asia Asia
## Levels: Africa Americas Asia Europe Oceania
levels(gapminder[["continent"]])  # note alternate way to get the variable
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
summary(gapminder$continent)
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

(Advanced) Ordering the Factor (optional)

This example is a bit artificial as ‘continent’ doesn’t really have a natural ordering.

gapminder$continent2 <- ordered(gapminder$continent, 
     levels = levels(gapminder$continent)[c(2,1,3,4,5)])

head(gapminder$continent2)
## [1] Asia Asia Asia Asia Asia Asia
## Levels: Americas < Africa < Asia < Europe < Oceania
levels(gapminder$continent2)
## [1] "Americas" "Africa"   "Asia"     "Europe"   "Oceania"
boxplot(lifeExp ~ continent2, data = gapminder)

(Advanced) Reclassifying Factors

students <- factor(c('basic','proficient','advanced','basic', 
      'advanced', 'minimal'))
levels(students)
## [1] "advanced"   "basic"      "minimal"    "proficient"
unclass(students)
## [1] 2 4 1 2 1 3
## attr(,"levels")
## [1] "advanced"   "basic"      "minimal"    "proficient"
students <- factor(c('basic','proficient','advanced','basic', 
      'advanced', 'minimal'))
score = c(minimal = 65, basic = 75, advanced = 95, proficient = 85) # a named vector
score["advanced"]  # look up by name
## advanced 
##       95
students[3]
## [1] advanced
## Levels: advanced basic minimal proficient
score[students[3]]
## minimal 
##      65
score[as.character(students[3])]
## advanced 
##       95

What went wrong and how did we fix it? Notice how easily this could be a big bug in your code.

Strings

R has lots of functionality for character strings. Usually these are stored as vectors of strings, each string of arbitrary length.

chars <- c('hi', 'hallo', "mother's", 'father\'s', "He said, \"hi\"" )
length(chars)
## [1] 5
nchar(chars)
## [1]  2  5  8  8 13
paste("bill", "clinton", sep = " ")  # paste together a set of strings
## [1] "bill clinton"
paste(chars, collapse = ' ')  # paste together things from a vector
## [1] "hi hallo mother's father's He said, \"hi\""
strsplit("This is the R bootcamp", split = " ")
## [[1]]
## [1] "This"     "is"       "the"      "R"        "bootcamp"
countries <- as.character(gapminder2007$country)
substring(countries, 1, 3)
##   [1] "Afg" "Alb" "Alg" "Ang" "Arg" "Aus" "Aus" "Bah" "Ban" "Bel" "Ben" "Bol"
##  [13] "Bos" "Bot" "Bra" "Bul" "Bur" "Bur" "Cam" "Cam" "Can" "Cen" "Cha" "Chi"
##  [25] "Chi" "Col" "Com" "Con" "Con" "Cos" "Cot" "Cro" "Cub" "Cze" "Den" "Dji"
##  [37] "Dom" "Ecu" "Egy" "El " "Equ" "Eri" "Eth" "Fin" "Fra" "Gab" "Gam" "Ger"
##  [49] "Gha" "Gre" "Gua" "Gui" "Gui" "Hai" "Hon" "Hon" "Hun" "Ice" "Ind" "Ind"
##  [61] "Ira" "Ira" "Ire" "Isr" "Ita" "Jam" "Jap" "Jor" "Ken" "Kor" "Kor" "Kuw"
##  [73] "Leb" "Les" "Lib" "Lib" "Mad" "Mal" "Mal" "Mal" "Mau" "Mau" "Mex" "Mon"
##  [85] "Mon" "Mor" "Moz" "Mya" "Nam" "Nep" "Net" "New" "Nic" "Nig" "Nig" "Nor"
##  [97] "Oma" "Pak" "Pan" "Par" "Per" "Phi" "Pol" "Por" "Pue" "Reu" "Rom" "Rwa"
## [109] "Sao" "Sau" "Sen" "Ser" "Sie" "Sin" "Slo" "Slo" "Som" "Sou" "Spa" "Sri"
## [121] "Sud" "Swa" "Swe" "Swi" "Syr" "Tai" "Tan" "Tha" "Tog" "Tri" "Tun" "Tur"
## [133] "Uga" "Uni" "Uni" "Uru" "Ven" "Vie" "Wes" "Yem" "Zam" "Zim"
tmp <- countries
substring(tmp, 5, 10) <- "______"
tmp[1:20]
##  [1] "Afgh______n"            "Alba___"                "Alge___"               
##  [4] "Ango__"                 "Arge_____"              "Aust_____"             
##  [7] "Aust___"                "Bahr___"                "Bang______"            
## [10] "Belg___"                "Beni_"                  "Boli___"               
## [13] "Bosn______ Herzegovina" "Bots____"               "Braz__"                
## [16] "Bulg____"               "Burk______so"           "Buru___"               
## [19] "Camb____"               "Came____"

We can search for patterns in character vectors and replace patterns (both vectorized!)

indexes <- grep("Korea", countries)
indexes
## [1] 70 71
countries[indexes]
## [1] "Korea, Dem. Rep." "Korea, Rep."
countries2 <- gsub("Korea, Dem. Rep.", "North Korea", countries)
countries2[indexes]
## [1] "North Korea" "Korea, Rep."

Regular expressions (regex or regexp) (optional)

Some of you may be familiar with using regular expressions, which is functionality for doing sophisticated pattern matching and replacement with strings. Python and Perl are both used extensively for such text manipulation.

R has a full set of regular expression capabilities available through the grep(), gregexpr(), and gsub() functions (among others - many R functions will work with regular expressions). However, a particularly nice way to make use of this functionality is to use the stringr package, which is more user-friendly than directly using the core R functions.

You can basically do any regular expression/string manipulations in R.

Subsetting

There are many ways to select subsets in R. The syntax below is useful for vectors, matrices, data frames, arrays and lists.

vec <- gapminder2007$lifeExp
mat <- matrix(1:20, 4, 5)
rownames(mat) <- letters[1:4]
mat
##   [,1] [,2] [,3] [,4] [,5]
## a    1    5    9   13   17
## b    2    6   10   14   18
## c    3    7   11   15   19
## d    4    8   12   16   20

1) by direct indexing

vec[c(3, 5, 12:14)]
## [1] 72.301 75.320 65.554 74.852 50.728
vec[-c(3,5)]
##   [1] 43.828 76.423 42.731 81.235 79.829 75.635 64.062 79.441 56.728 65.554
##  [11] 74.852 50.728 72.390 73.005 52.295 49.580 59.723 50.430 80.653 44.741
##  [21] 50.651 78.553 72.961 72.889 65.152 46.462 55.322 78.782 48.328 75.748
##  [31] 78.273 76.486 78.332 54.791 72.235 74.994 71.338 71.878 51.579 58.040
##  [41] 52.947 79.313 80.657 56.735 59.448 79.406 60.022 79.483 70.259 56.007
##  [51] 46.388 60.916 70.198 82.208 73.338 81.757 64.698 70.650 70.964 59.545
##  [61] 78.885 80.745 80.546 72.567 82.603 72.535 54.110 67.297 78.623 77.588
##  [71] 71.993 42.592 45.678 73.952 59.443 48.303 74.241 54.467 64.164 72.801
##  [81] 76.195 66.803 74.543 71.164 42.082 62.069 52.906 63.785 79.762 80.204
##  [91] 72.899 56.867 46.859 80.196 75.640 65.483 75.537 71.752 71.421 71.688
## [101] 75.563 78.098 78.746 76.442 72.476 46.242 65.528 72.777 63.062 74.002
## [111] 42.568 79.972 74.663 77.926 48.159 49.339 80.941 72.396 58.556 39.613
## [121] 80.884 81.701 74.143 78.400 52.517 70.616 58.420 69.819 73.923 71.777
## [131] 51.542 79.425 78.242 76.384 73.747 74.249 73.422 62.698 42.384 43.487
gapminder[c(2,4), 5]
## [1] 30.332 34.020
gapminder[c(2,4), 'lifeExp']
## [1] 30.332 34.020
## Advanced: subset using a 2-column matrix of indices:
rowInd <- c(1, 3, 4)
colInd <- c(2, 2, 1)
elemInd <- cbind(rowInd, colInd)
elemInd
##      rowInd colInd
## [1,]      1      2
## [2,]      3      2
## [3,]      4      1
gapminder[elemInd]
## [1] "1952"        "1962"        "Afghanistan"

2) by a vector of logicals

wealthy <- gapminder$gdpPercap > 50000
gapminder$gdpPercap[wealthy]
## [1] 108382.35 113523.13  95458.11  80894.88 109347.87  59265.48
gapminder[wealthy, ]
##     country year     pop continent lifeExp gdpPercap continent2
## 853  Kuwait 1952  160000      Asia  55.565 108382.35       Asia
## 854  Kuwait 1957  212846      Asia  58.033 113523.13       Asia
## 855  Kuwait 1962  358266      Asia  60.470  95458.11       Asia
## 856  Kuwait 1967  575003      Asia  64.624  80894.88       Asia
## 857  Kuwait 1972  841934      Asia  67.712 109347.87       Asia
## 858  Kuwait 1977 1140357      Asia  69.343  59265.48       Asia

What happened in the last subsetting operation?

3) by a vector of names

mat[c('a', 'd', 'a'), ]
##   [,1] [,2] [,3] [,4] [,5]
## a    1    5    9   13   17
## d    4    8   12   16   20
## a    1    5    9   13   17

4) using subset()

subset(gapminder, gdpPercap > 50000)
##     country year     pop continent lifeExp gdpPercap continent2
## 853  Kuwait 1952  160000      Asia  55.565 108382.35       Asia
## 854  Kuwait 1957  212846      Asia  58.033 113523.13       Asia
## 855  Kuwait 1962  358266      Asia  60.470  95458.11       Asia
## 856  Kuwait 1967  575003      Asia  64.624  80894.88       Asia
## 857  Kuwait 1972  841934      Asia  67.712 109347.87       Asia
## 858  Kuwait 1977 1140357      Asia  69.343  59265.48       Asia

5) using dplyr tools such as filter() and select() – more in Module 6

Assignment into subsets

We can assign into subsets by using similar syntax, as we saw with vectors.

vec <- rnorm(20)
vec[c(3, 5, 12:14)] <- 1:5
vec
##  [1] -0.54609384  0.41276992  1.00000000 -0.08751844  2.00000000 -0.04387560
##  [7] -1.71994174 -1.52028202  0.30287244 -0.57162163  0.12997778  3.00000000
## [13]  4.00000000  5.00000000  0.38895028  0.91324059 -0.56111335  0.65813862
## [19] -0.86476739  1.43320661
mat <- matrix(rnorm(6*5), nrow = 6)
mat[2, 3:5] <- rnorm(3)
mat
##           [,1]        [,2]         [,3]       [,4]       [,5]
## [1,] 0.1133369  1.42521557  1.146829016  3.8904220 -1.9544089
## [2,] 0.3710806 -0.28580184 -0.008796788  0.9700758 -0.9604171
## [3,] 0.6533810 -0.31371728  0.667937581  0.7597577 -1.2873426
## [4,] 2.6515761 -0.02834754 -0.971263419 -0.7769350  0.4070947
## [5,] 0.7967564  1.24422818 -1.651554999  0.8570243 -1.3529498
## [6,] 0.8221053 -0.80660057  0.728385214  0.9393530  1.2343110
mat[mat[,1] > 0, ] <- -Inf
mat
##      [,1] [,2] [,3] [,4] [,5]
## [1,] -Inf -Inf -Inf -Inf -Inf
## [2,] -Inf -Inf -Inf -Inf -Inf
## [3,] -Inf -Inf -Inf -Inf -Inf
## [4,] -Inf -Inf -Inf -Inf -Inf
## [5,] -Inf -Inf -Inf -Inf -Inf
## [6,] -Inf -Inf -Inf -Inf -Inf

Subsetting: quick quiz

POLL 3F: Suppose I want to select the 3rd elements from the 2nd and 4th columns of a matrix or dataframe. Which syntax will work?

(respond at https://pollev.com/chrispaciorek428)

Here’s a test matrix:

mat <- matrix(1:16, nrow = 4, ncol = 4)
  1. mat[3, (2, 4)]
  2. mat[c(FALSE, FALSE, TRUE, FALSE), c(FALSE, TRUE, FALSE, TRUE)]
  3. mat[c(FALSE, FALSE, TRUE, FALSE), c(2, 4)]
  4. mat[3, c(2, 4)]
  5. mat(3, c(2, 4))
  6. mat[3, ][c(2, 4)]
  7. mat[ , c(2, 4)][3, ]
  8. mat[ , c(2, 4)][3]
  9. mat[c(2, 4)][3, ]

POLL 3F: (Advanced) One of those answers won’t work with a matrix but will work with a dataframe. Which one?

  1. mat[3, (2, 4)]
  2. mat[c(FALSE, FALSE, TRUE, FALSE), c(FALSE, TRUE, FALSE, TRUE)]
  3. mat[c(FALSE, FALSE, TRUE, FALSE), c(2, 4)]
  4. mat[3, c(2, 4)]
  5. mat(3, c(2, 4))
  6. mat[3, ][c(2, 4)]
  7. mat[ , c(2, 4)][3, ]
  8. mat[ , c(2, 4)][3]
  9. mat[c(2, 4)][3, ]

Breakout

Basics

  1. Extract the 5th row from the gapminder dataset.

  2. Extract the last row from the gapminder dataset.

  3. Count the number of gdpPercap values greater than 50000 in the gapminder dataset.

  4. Set all of the gdpPercap values greater than 50000 to NA. You should probably first copy the gap object and work on the copy so that the dataset is unchanged (or just read the data into R again afterwards to get a clean copy).

  5. Consider the first row of the gapminder dataset, which has Afghanistan for 1952. How do I create a string “Afghanistan-1952” using gap$country[1] and gap$year[1]?

Using the ideas

  1. Create a character string using paste() that tells the user how many rows there are in the data frame - do this programmatically such that it would work for any data frame regardless of how many rows it has. The result should look like this: “There are 1704 rows in the dataset”

  2. If you didn’t do it this way already in problem #2, extract the last row from the gapminder dataset without typing the number ‘1704’.

  3. Create a boolean vector indicating if the life expectancy is greater than 75 and the gdpPercap is less than 10000 and calculate the proportion of all the records these represent.

  4. Use that vector to create a new data frame that is a subset of the original data frame.

  5. Consider the attributes of the gapminder dataset. What kind of R object is the set of attributes?

Advanced

  1. Create row names for the data frame based on concatenating the Continent, Country, and Year fields.