Berkeley Statistics Logo

String processing tutorial

Training materials for string processing in R and Python.

View the Project on GitHub berkeley-scf/tutorial-string-processing

This project is maintained by berkeley-scf, the UC Berkeley Statistical Computing Facility.

Hosted on GitHub Pages — Theme by orderedlist

Text manipulations in R, Python, Perl, and bash have a number of things in common, as many of these evolved from UNIX. When I use the term string here, I’ll be referring to any sequence of characters that may include numbers, white space, and special characters. Note that in R a character vector is a vector of one or more such strings.

Some of the basic things we need to do are paste/concatenate strings together, split strings apart, take subsets of strings, and replace characters within strings. Often these operations are done based on patterns rather than a fixed string sequence. This involves the use of regular expressions, covered in Section 3.

1 R

In general, strings in R are stored in character vectors. R’s functions for string manipulation are fully vectorized and will work on all of the strings in a vector at once.

Here’s a cheatsheet from RStudio on manipulating strings using the stringr package in R.

1.1 String manipulation in base R

A few of the basic R functions for manipulating strings are paste, strsplit, and substring. paste and strsplit are basically inverses of each other:

  • paste concatenates together an arbitrary set of strings (or a vector, if using the collapse argument) with a user-specified separator character
  • strsplit splits apart based on a delimiter/separator
  • substring splits apart the elements of a character vector based on fixed widths
  • nchar returns the number of characters in a string.

Note that all of these operate in a vectorized fashion.

out <- paste("My", "name", "is", "Chris", ".", sep = " ")
paste(c("My", "name", "is", "Chris", "."), collapse = " ") # equivalent
## [1] "My name is Chris ."
nchar(out)
## [1] 18
strsplit(out, split = ' ')
## [[1]]
## [1] "My"    "name"  "is"    "Chris" "."

Note

Some string processing functions (such as strsplit above) can return multiple values for each input string (each element of the character vector). As a result, the functions will return a list, which will be a list with one element when the function operates on a single string.

out <- c("Her name is Maya", "Hello everyone")
strsplit(out, split = ' ')
## [[1]]
## [1] "Her"  "name" "is"   "Maya"
## 
## [[2]]
## [1] "Hello"    "everyone"

Here are some examples of using substring:

times <- c("04:18:04", "12:12:53", "13:47:00")
substring(times, 7, 8)
## [1] "04" "53" "00"
substring(times[3], 1, 2) <- '01'   ## replacement
times
## [1] "04:18:04" "12:12:53" "01:47:00"

To identify particular subsequences in strings, there are several closely-related R functions. grep will look for a specified string within an R character vector and report back indices identifying the elements of the vector in which the string was found. Note that using the fixed=TRUE argument ensures that regular expressions are NOT used. gregexpr will indicate the position in each string that the specified string is found (use regexpr if you only want the first occurrence). gsub can be used to replace a specified string with a replacement string (use sub if you only want to replace only the first occurrence).

dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
grep("2016", dates)
## [1] 1 3
gregexpr("2016", dates)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gsub("2016", "16", dates)
## [1] "16-08-03"   "2007-09-05" "16-01-02"

1.2 String manipulation using stringr

The stringr package wraps the various core string manipulation functions to provide a common interface. It also removes some of the clunkiness involved in some of the string operations with the base string functions, such as having to to call gregexpr and then regmatches to pull out the matched strings. In general, I’d suggest using stringr functions in place of R’s base string functions.

First let’s see stringr’s versions of some of the base R string functions mentioned in the previous sections.

The basic interface to stringr functions is function(character_vector, pattern, [replacement]).

Table 1 provides an overview of the key functions related to working with patterns, which are basically wrappers for grep, gsub, gregexpr, etc.

Function What it does
str_detect detects pattern, returning TRUE/FALSE
str_count counts matches
str_locate/str_locate_all detects pattern, returning positions of matching characters
str_extract/str_extract_all detects pattern, returning matches
str_replace/str_replace_all detects pattern and replaces matches

The analog of regexpr vs. gregexpr and sub vs. gsub is that most of the functions have versions that return all the matches, not just the first match, e.g. str_locate_all str_extract_all, etc. Note that the _all functions return lists while the non-_all functions return vectors.

To specify options, you can wrap these functions around the pattern argument: fixed(pattern, ignore_case) and regex(pattern, ignore_case). The default is regex, so you only need to specify that if you also want to specify additional arguments, such as ignore_case or others listed under help(regex) (invoke the help after loading stringr)

Here’s an example:

library(stringr)
str <- c("Apple Computer", "IBM", "Apple apps")

str_detect(str, fixed("app", ignore_case = TRUE))
## [1]  TRUE FALSE  TRUE
str_count(str, fixed("app", ignore_case = TRUE))
## [1] 1 0 2
str_locate(str, fixed("app", ignore_case = TRUE))
##      start end
## [1,]     1   3
## [2,]    NA  NA
## [3,]     1   3
str_locate_all(str, fixed("app", ignore_case = TRUE))
## [[1]]
##      start end
## [1,]     1   3
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]     1   3
## [2,]     7   9
dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
str_locate(dates, "20[^0][0-9]") ## regular expression: years begin in 2010
##      start end
## [1,]     1   4
## [2,]    NA  NA
## [3,]     1   4
str_extract_all(dates, "20[^0][0-9]")
## [[1]]
## [1] "2016"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "2016"
str_replace_all(dates, "20[^0][0-9]", "XXXX")
## [1] "XXXX-08-03" "2007-09-05" "XXXX-01-02"

2 Basic text manipulation in Python

Let’s see basic concatenation, splitting, working with substrings, and searching/replacing substrings. Notice that Python’s string functionality is object-oriented (though len is not).

Here, We’ll just cover the basic methods for the str type. There’s lots of additional functionality for working with strings in the re package, discussed here in this tutorial. Of course in many cases of working with strings, one would need the full power of regular expressions to do what one needs to do.

First let’s look at combining/concatenating strings. We can do this with the + operator or using the join method, which is (perhaps confusingly) called based on the separator of interest with the input strings as arguments.

out = "My" + "name" + "is" + "Chris" +  "."
out
## 'MynameisChris.'
out = ' '.join(("My", "name", "is", "Chris", "."))
out
## 'My name is Chris .'

len simply returns the number of characters in the string.

len(out) 
## 18
out.split(' ')
## ['My', 'name', 'is', 'Chris', '.']

To see the various string methods, we can hit tab after typing str. or based on any specific string:

out.
out.capitalize()    out.index(          out.isspace()       out.removesuffix(   out.startswith(
out.casefold()      out.isalnum()       out.istitle()       out.replace(        out.strip(
out.center(         out.isalpha()       out.isupper()       out.rfind(          out.swapcase()
out.count(          out.isascii()       out.join(           out.rindex(         out.title()
out.encode(         out.isdecimal()     out.ljust(          out.rjust(          out.translate(
out.endswith(       out.isdigit()       out.lower()         out.rpartition(     out.upper()
out.expandtabs(     out.isidentifier()  out.lstrip(         out.rsplit(         out.zfill(
out.find(           out.islower()       out.maketrans(      out.rstrip(         
out.format(         out.isnumeric()     out.partition(      out.split(          
out.format_map(     out.isprintable()   out.removeprefix(   out.splitlines(     

Unlike in R, you cannot use the string methods directly on a list or tuple of strings, but you of course can do things like list comprehension to easily process multiple strings.

Working with substrings relies on the fact that Python works with strings as if they are vectors of individual characters.

var = "13:47:00"
var[3:5]
## '47'

However strings are immutable - you cannot alter a subset of characters in the string. Another option is to work with strings as lists.

var[0:2] = "01"
## Error: TypeError: 'str' object does not support item assignment

Now let’s consider finding substrings. Here Python tells us that ‘2016’ starts in the 6th position in the first and third elements (with 0-based indexing).

var = "08-03-2016"
var.find("2016")
## 6

We can count occurrences with .count():

var = "08-03-2016; 07-09-2016"
var.count("2016")
## 2

And we can replace like this:

var = "13:47:00"
var.replace("13", "01")
## '01:47:00'