R Bootcamp, Module 7: Graphics

August 2022, UC Berkeley

Florica Constantine

By way of introduction…

And here’s some motivation - we can produce a plot like this with a few lines of code.

(Compare to the famous gapminder plot.)

Base graphics

The general call for base plot looks something like this:

plot(x = , y = , ...)

Additional parameters can be passed in to customize the plot:

More layers can be added to the plot with additional calls to lines, points, text, etc.

gapChina <- gapminder %>% filter(country == "China")
plot(gapChina$year, gapChina$gdpPercap)
plot(gapChina$year, gapChina$gdpPercap, type = "l",
     main = "China GDP over time",
     xlab = "Year", ylab = "GDP per capita") # with updated parameters
points(gapChina$year, gapChina$gdpPercap, pch = 16)
points(x = 1977, y = gapChina$gdpPercap[gapChina$year == 1977],
       col = "red", pch = 16)

Other plot types in base graphics

These are a variety of other types of plots you can make in base graphics.

boxplot(lifeExp ~ year, data = gapminder)
hist(gapminder$lifeExp[gapminder$year == 2007])
plot(density(gapminder$lifeExp[gapminder$year == 2007]))
barplot(gapChina$pop, width = 4, names.arg = gapChina$year, 
                               main = "China population")

Object-oriented plots

Here are two examples:

gap_lm <- lm(lifeExp ~ log(gdpPercap) + year, data = gapminder)

# Calls plotting method for class of the dataset ("data.frame")
plot(gapminder[,c('pop','lifeExp','gdpPercap')])

# Calls plotting method for class of gap_lm object ("lm"), print first two plots only
plot(gap_lm, which=1:2)

Pros/cons of base graphics, ggplot2, and lattice

Base graphics is

  1. good for exploratory data analysis and sanity checks

  2. inconsistent in syntax across functions: some take x,y while others take formulas

  3. default plotting parameters are ugly, and it can be difficult to customize

  4. that said, one can do essentially anything in base graphics with some work

ggplot2 is

  1. generally more elegant

  2. more syntactically logical (and therefore simpler, once you learn it)

  3. better at grouping

  4. able to interface with maps

lattice is

  1. faster than ggplot2 (though only noticeable over many and large plots)

  2. simpler than ggplot2 (at first)

  3. perhaps better at trellis plots than ggplot2 (some think this)

  4. able to do 3d plots (but be cautious about using 3d plots)

We’ll focus on ggplot2 as it is very powerful, very widely-used and allows one to produce very nice-looking graphics without a lot of coding.

Basic usage: ggplot2

The general call for ggplot2 graphics looks something like this:

# NOT run
ggplot(data = , aes(x = ,y = , [options])) + geom_xxxx() + ... + ... + ...

Note that ggplot2 graphs in layers in a continuing call (hence the endless +…+…+…), which makes additional layers in the plot.

... + geom_xxxx(data = , aes(x = , y = ,[options]), [options]) + ... + ... + ...

You can see the layering effect by comparing the same graph with different colors for each layer

p <- ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
                 geom_point(color = "red")
p
p + geom_point(aes(x = year, y = lifeExp), color = "gray") + ylab("life expectancy") +
    theme_minimal()

And, if you’re desperate for the quick and dirty functionality of base plot, or just like the more familiar syntax at first, ggplot2 offers the qplot() function as a wrapper for most basic plots:

qplot(x = year, y = lifeExp, data = gapChina)
qplot(x = year, y = lifeExp, data = gapChina, geom = "line")

Grammar of graphics

ggplot2 syntax is very different from base graphics and lattice. It’s built on the grammar of graphics. The basic idea is that the visualization of all data requires four items:

  1. One or more statistics conveying information about the data (identities, means, medians, etc.)

  2. A coordinate system that characterizes the intersections of statistics (at most two for ggplot, three for lattice)

  3. Geometries that differentiate between off-coordinate variation in kind

  4. Scales that differentiate between off-coordinate variation in degree

ggplot2 allows the user to manipulate all four of these items through the stat_*, coord_*, geom_*, and scale_* functions.

All of these are important to truly becoming a ggplot2 master, but today we are going to focus on the most important to basic users and their data layers: ggplot2’s geometries

Some Examples

## Scatterplot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_point() +
                          ggtitle("China's life expectancy")
## Line (time series) plot
ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_line() +
                          ggtitle("China's life expectancy")
## Boxplot
ggplot(gapminder, aes(x = factor(year), y = lifeExp)) + geom_boxplot() +
                          ggtitle("World's life expectancy")
## Histogram
gapminder2007 <- gapminder %>% filter(year == 2007)
ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
                          ggtitle("World's life expectancy")

ggplot2 and tidy data

# This combines the subsetting and plotting into one step
gapminder %>% filter(year == 2007) %>% 
        ggplot(aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
                          ggtitle("World's life expectancy")

For example, here ggplot treats country as an aesthetic parameter that differentiates groups of values, whereas base graphics treats each (year, medal) pair as a set of inputs to the plot.

Here’s ggplot with the data in a tidy format.

# ggplot2 call
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
            geom_line(aes(color = country), show.legend = FALSE)

Is that a useful plot?

And here’s use of base graphics, taking advantage of non-tidy, wide-formatted data.

# Base graphics call
gapminder_wide <- gapminder %>% select(country, year, lifeExp) %>% spread(country, lifeExp)
gapminder_wide[1:5, 1:5]
## # A tibble: 5 × 5
##    year Afghanistan Albania Algeria Angola
##   <int>       <dbl>   <dbl>   <dbl>  <dbl>
## 1  1952        28.8    55.2    43.1   30.0
## 2  1957        30.3    59.3    45.7   32.0
## 3  1962        32.0    64.8    48.3   34  
## 4  1967        34.0    66.2    51.4   36.0
## 5  1972        36.1    67.7    54.5   37.9
plot(gapminder_wide$year, gapminder_wide$China, col = 'red', type = 'l', ylim = c(40, 85))
lines(gapminder_wide$year, gapminder_wide$Turkey, col = 'green')
lines(gapminder_wide$year, gapminder_wide$Italy, col = 'blue')
legend("right", legend = c("China", "Turkey", "Italy"),
                fill = c("red", "blue", "green"))

Of course, as mentioned above, you can always filter your tidy data to replicate this plot with ggplot2

gapminder %>%
  filter(country %in% c("China", "Turkey", "Italy")) %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = country))

Pros/cons of ggplot2

An overview of syntax for various ggplot2 geoms

We’ve already seen these initial ones.

X-Y scatter plots: geom_point()

ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_point() +
                          ggtitle("China's life expectancy")

X-Y line plots: geom_line() or geom_path()

ggplot(gapChina, aes(x = year, y = lifeExp)) + geom_line() +
                          ggtitle("China's life expectancy")

Histograms: geom_histogram(), geom_col(), or geom_bar()

gapminder2007 <- gapminder %>% filter(year == 2007)
ggplot(gapminder2007, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
                          ggtitle("World's life expectancy")

Densities: geom_density(), geom_density2d()

ggplot(gapminder2007, aes(x = lifeExp)) + geom_density() + 
                          ggtitle("World's life expectancy")

Boxplots: geom_boxplot()

# Notice that here, you must explicitly convert numeric years to factors
ggplot(data = gapminder, aes(x = factor(year), y = lifeExp)) +
            geom_boxplot() 

Contour plots: geom_contour()

data(volcano) # Load volcano contour data
volcano[1:10, 1:10] # Examine volcano dataset (first 10 rows and columns)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]  100  100  101  101  101  101  101  100  100   100
##  [2,]  101  101  102  102  102  102  102  101  101   101
##  [3,]  102  102  103  103  103  103  103  102  102   102
##  [4,]  103  103  104  104  104  104  104  103  103   103
##  [5,]  104  104  105  105  105  105  105  104  104   103
##  [6,]  105  105  105  106  106  106  106  105  105   104
##  [7,]  105  106  106  107  107  107  107  106  106   105
##  [8,]  106  107  107  108  108  108  108  107  107   106
##  [9,]  107  108  108  109  109  109  109  108  108   107
## [10,]  108  109  109  110  110  110  110  109  109   108
volcano3d <- melt(volcano) # Use reshape2 package to melt the data into tidy form
head(volcano3d) # Examine volcano3d dataset (head)
##   Var1 Var2 value
## 1    1    1   100
## 2    2    1   101
## 3    3    1   102
## 4    4    1   103
## 5    5    1   104
## 6    6    1   105
names(volcano3d) <- c("xvar", "yvar", "zvar") # Rename volcano3d columns

ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) +
            geom_contour() 

tile/image/level plots, heatmaps: geom_tile(), geom_rect(), geom_raster()

ggplot(data = volcano3d, aes(x = xvar, y = yvar, z = zvar)) +
            geom_tile(aes(fill = zvar)) 

“Trellis” plots

Trellis plots allow you to stratify by a variable, with one panel per categorical value. One uses either facet_grid() or facet_wrap():

ggplot(data = gapminder, aes(x = lifeExp)) + geom_histogram(binwidth = 5) +
            facet_wrap(~year)

This can be quite powerful. It gives you the ability to take account of an additional variable.

Fitted lines and curves with ggplot2 (optional)

ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10()

# Add linear model (lm) smoother
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "lm")

# Add local linear model (loess) smoother, span of 0.75 (more smoothed)
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "loess", span = .75)

# Add local linear model (loess) smoother, span of 0.25 (less smoothed)
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "loess", span = .25)

# Add linear model (lm) smoother, no standard error shading
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "lm", se = FALSE)

# Add local linear model (loess) smoother, no standard error shading
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() + scale_x_log10() +
  geom_smooth(method = "loess", se = FALSE)

Anatomy of aes()

# NOT run
ggplot(data = , aes(x = , y = , color = , linetype = , shape = , size = ))

These four aesthetic parameters (color, linetype, shape, size) can be used to show variation in kind (categories) and variation in degree (numeric).

Parameters passed into aes should be variables in your dataset.

Parameters passed to geom_xxx outside of aes should not be related to your dataset – they apply to the whole figure.

ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
            geom_line(aes(color = country), show.legend = FALSE)

Note what happens when we specify the color parameter outside of the aesthetic operator. ggplot2 views these specifications as invalid graphical parameters.

ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
            geom_line(color = country)
## Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomLine, : object 'country' not found
ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
            geom_line(color = "country")
## Error: Unknown colour name: country
## this works but only makes sense if we restrict to one country
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(color = "red")

Note: Aesthetics automatically show up in your legend. Parameters (those not mapped to a variable in your data frame) do not!

Using aesthetics to highlight features

Differences in kind

## color as the aesthetic to differentiate by continent
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) + scale_x_log10()

## point shape as the aesthetic to differentiate by continent
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(shape = continent)) + scale_x_log10()

## line type as the aesthetic to differentiate by country
gapOceania <- gapminder %>% filter(continent %in% 'Oceania')
ggplot(data = gapOceania, aes(x = year, y = lifeExp)) +
            geom_line(aes(linetype = country)) + scale_x_log10()

Differences in degree

## point size as the aesthetic to differentiate by population
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(size = pop)) + scale_x_log10()

## color as the aesthetic to differentiate by population
ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = pop)) + scale_x_log10() +
            scale_color_gradient(low = 'lightgray', high = 'black')

Multiple non-coordinate aesthetics (differences in kind using color, degree using point size)

ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(size = pop, color = continent)) + scale_x_log10()

How many variables have we represented? If we used a trellis plot we could represent yet another variable!

Using aesthetics: quick quiz

POLL 7A: Which of these ggplot2 calls will work (in the sense of not giving an error, not in the sense of being a useful plot)?

(respond at https://pollev.com/chrispaciorek428)

  1. gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) %>% geom_point()
  2. gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp))
  3. gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point()
  4. gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp, shape = ‘a’)) + geom_point()
  5. gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point(aes(shape = country), show.legend = FALSE)
  6. gapminder %>% ggplot(aes(x = gdpPercap, y = lifeExp)) + geom_point(shape = ‘a’, show.legend = FALSE)
  7. gapminder %>% ggplot() + geom_point(aes(x = gdpPercap, y = lifeExp, shape = country), show.legend = FALSE)

Scaling Aesthetics

Aesthetics are handled by their very own scale functions which allow you to set the limits, breaks, tranformations, and any palletes that might determine how you want your data plotted. ggplot2 includes a number of helpful default scale functions. For example:

For example, our data might be better represented using a log10 transformation of per capita GDP:

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) +
  scale_x_log10()

And perhaps we want colors that are a little different:

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) +
  scale_x_log10() +
  scale_color_viridis_d()

Or perhaps we want to set your palettes and breaks or labels manually:

ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(aes(color = continent)) +
  scale_x_log10(labels = scales::dollar) +
  scale_color_manual("The continents", 
                     values = c("red", "blue", "green", "yellow", "#800080")) # hex codes work!

For more info about setting scales in ggplot2 and for more helper functions consider diving into the scales package which is the backend to much of the scales functionality in ggplot2

Fine tuning your plot

ggplot handles many plot options as additional layers.

Labels

ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
  xlab(label = "GDP per capita") +
  ylab(label = "Life expectancy") +
  ggtitle(label = "Gapminder") 

Or even more simply use the labs() function

ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) + geom_point() +
  labs(x = "GDP per capita", y = "Life expectancy", title = "Gapminder")

Axis and point scales

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point() 
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(size=3) 
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(size=1) 

Colors

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(color = colors()[11]) 
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(color = "red") 

Point Styles and Widths

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(shape = 3) 
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(shape = "w") 
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
            geom_point(shape = "$", size=5) 

Line Styles and Widths

ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(linetype = 1) 
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(linetype = 2) 
ggplot(data = gapChina, aes(x = year, y = lifeExp)) +
            geom_line(linetype = 5, size = 2) 

Themes with ggplot2 (optional)

Elements of the plot not associated with geometries can be adjusted using ggplot themes.

There are some “complete” themes already included with the package: - theme_gray() (the default) - theme_minimal() - theme_bw() - theme_light() - theme_dark() - theme_classic()

But in additional to these, you can tweak just about any element of your plot’s appearance using the theme() function.

For instance, perhaps you want to move the legend from the left to the bottom of your plot, this would be part of the plot theme. Note how you can add options to a complete theme already in the plot:

gapminder %>%
  filter(country %in% c("China", "Turkey", "Italy")) %>%
  ggplot(aes(x = year, y = lifeExp)) +
  geom_line(aes(color = country)) +
  theme_minimal() + 
  theme(legend.position = "bottom")

Combining Multiple Plots

# Initialize gridExtra library
library(gridExtra)

# Create 3 plots to combine in a table
plot1 <- ggplot(data = gapminder2007, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() + annotate('text', 150, 80, label = '(a)')
plot2 <- ggplot(data = gapminder2007, aes(x = pop, y = lifeExp)) +
  geom_point() + scale_x_log10() + annotate('text', 1.8e5, 80, label = '(b)')
plot3 <- ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
      geom_line(aes(color = country), show.legend = FALSE) +
      annotate('text', 1951, 80, label = '(c)')


# Call grid.arrange
grid.arrange(plot1, plot2, plot3, nrow=3, ncol = 1)

patchwork: Combining Multiple ggplot2 plots (optional)

library(patchwork)

# use the patchwork operators
# stack plots horizontally
plot1 + plot2 + plot3

# stack plots vertically
plot1 / plot2 / plot3

# side-by-side plots with third plot below
(plot1 | plot2) / plot3

# side-by-side plots with a space in between, and a third plot below
(plot1 | plot_spacer() | plot2) / plot3

# stack plots vertically and alter with a single "gg_theme"
(plot1 / plot2 / plot3) & theme_bw()

Feel free to explore more at https://github.com/thomasp85/patchwork.

Note: patchwork is an example of a ggplot2 extension package of which there are many! One of the benefits to learning and using ggplot2 is that there is a huge community of developers that build separate graphics packages that generally use the same syntax to extend the ggplot2 functionality into things like animation and 3D plotting! Check them out here.

Exporting

Two basic image types:

Raster/Bitmap (.png, .jpeg)

Every pixel of a plot contains its own separate coding; bad if you want to resize the image.

jpeg(filename = "example.jpg", width = , height =)
plot(x,y)
dev.off()

Vector (.pdf, .ps)

Every element of a plot is encoded with a function that gives its coding conditional on several factors; great for resizing, but image files with many elements can be very large.

pdf(file = "example.pdf", width = , height =)
plot(x,y)
dev.off()

Exporting with ggplot

# NOT run

# Assume we saved our plot as an object called `plot1`.

ggsave(filename = "example.pdf", plot = plot1, scale = , width = , height = )

Breakout

These questions ask you to work with the gapminder dataset.

Basics

  1. Plot a histogram of life expectancy.

  2. Plot the gdp per capita against population. Put the x-axis on the log scale.

  3. Clean up your scatterplot with a title and axis labels. Output it as a PDF and see if you’d be comfortable with including it in a report/paper.

Using the ideas

  1. Create a trellis plot of life expectancy by gdpPercap scatterplots, one subplot per continent. Use a 2x3 layout of panels in the plot. Now have the size of the points vary with population. Use coord_cartesian() (or scale_x_continuous() to set the x-axis limits to be in the range from 100 to 50000.

  2. Make a boxplot of life expectancy conditional on binned values of gdp per capita.

Advanced

  1. Using the data for 2007, recreate as much as you can of this famous Gapminder plot, where the colors are different continents. (Don’t worry about the ‘2015’ in the background and ignore the ‘play’ button at the bottom.)

  2. Create a “trellis” plot where, for a given year, each panel uses a) hollow circles to plot lifeExp as a function of log(gdpPercap), and b) a red loess smoother without standard errors to plot the trend. Turn off the grey background. Figure out how to use partially-transparent points to reduce the effect of the overplotting of points.

  3. In the following code, I try to plot the life expectancy over time, one line per country but where the color is by continent (in the module, we saw how to have the color vary by country).

ggplot(data = gapminder, aes(x = year, y = lifeExp)) +
            geom_line(aes(color = continent))

This doesn’t work. See if you can understand why and how to fix it. Hint, look up “group” in the ggplot2 help information in R.