Extracts the first, or several, patterns from a character vector.

string_extract(
  x,
  pattern,
  single = FALSE,
  simplify = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  unlist = FALSE,
  envir = parent.frame()
)

stextract(
  x,
  pattern,
  single = FALSE,
  simplify = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  unlist = FALSE,
  envir = parent.frame()
)

Arguments

x

A character vector.

pattern

A character scalar. It represents the pattern to be extracted from x. By default this is a regular expression. You can use flags in the pattern in the form flag1, flag2/pattern. Available flags are ignore (case), fixed (no regex), word (add word boundaries), single (select only the first element), and magic (add interpolation with {}) . Example: if "ignore/hello" and x = "Hello world extracted text is "Hello". Shortcut: use the first letters of the flags. Ex: "iw/one" will extract the word "one" (flags 'ignore' + 'word').

single

Logical scalar, default is FALSE. If TRUE, only the first pattern that is detected will be returned. Note that in that case, a character vector is returned of the same length as the vector in input.

simplify

Logical scalar, default is TRUE. If TRUE, then when the vector input x is of length 1, a character vector is returned instead of a list.

fixed

Logical scalar, default is FALSE. Whether to trigger a fixed search instead of a regular expression search (default).

ignore.case

Logical scalar, default is FALSE. If TRUE, then case insensitive search is triggered.

word

Logical scalar, default is FALSE. If TRUE then a) word boundaries are added to the pattern, and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation. Example: if word = TRUE, then pattern = "The, mountain" will select strings containing either the word 'The' or the word 'mountain'.

unlist

Logical scalar, default is FALSE. If TRUE, the function unlist is applied to the resulting list, leading to a character vector in output (instead of a list).

envir

Environment in which to evaluate the interpolations if the flag "magic" is provided. Default is parent.frame().

Value

The object returned by this functions can be a list or a character vector.

If single = TRUE, a character vector is returned, containing the value of the first match. If no match is found, an empty string is returned.

If single = FALSE (the default) and simplify = TRUE (default), the object returned is:

  • a character vector if x, the vector in input, is of length 1: the character vector contains all the matches and is of length 0 if no match is found.

  • a list of the same length as x. The ith element of the list is a character vector of the matches for the ith element of x.

If single = FALSE (default) and simplify = FALSE, the object returned is always a list.

Functions

  • stextract(): Alias to string_extract

Generic regular expression flags

All stringmagic functions support generic flags in regular-expression patterns. The flags are useful to quickly give extra instructions, similarly to usual regular expression flags.

Here the syntax is "flag1, flag2/pattern". That is: flags are a comma separated list of flag-names separated from the pattern with a slash (/). Example: string_which(c("hello...", "world"), "fixed/.") returns 1. Here the flag "fixed" removes the regular expression meaning of "." which would have otherwise meant "any character". The no-flag verion string_which(c("hello...", "world"), ".") returns 1:2.

Alternatively, and this is recommended, you can collate the initials of the flags instead of using a comma separated list. For example: "if/dt[" will apply the flags "ignore" and "fixed" to the pattern "dt[".

The four flags always available are: "ignore", "fixed", "word" and "magic".

  • "ignore" instructs to ignore the case. Technically, it adds the perl-flag "(?i)" at the beginning of the pattern.

  • "fixed" removes the regular expression interpretation, so that the characters ".", "$", "^", "[" (among others) lose their special meaning and are treated for what they are: simple characters.

  • "word" adds word boundaries ("\\b" in regex language) to the pattern. Further, the comma (",") becomes a word separator. Technically, "word/one, two" is treated as "\b(one|two)\b". Example: string_clean("Am I ambushed?", "wi/am") leads to " I ambushed?" thanks to the flags "ignore" and "word".

  • "magic" allows to interpolate variables inside the pattern before regex interpretation. For example if letters = "aiou" then string_clean("My great goose!", "magic/[{letters}] => e") leads to "My greet geese!"

Examples


cars = head(row.names(mtcars))

# Let's extract the first word:
string_extract(cars, "\\w+", single = TRUE)
#> [1] "Mazda"   "Mazda"   "Datsun"  "Hornet"  "Hornet"  "Valiant"

# same using flags
string_extract(cars, "s/\\w+")
#> [1] "Mazda"   "Mazda"   "Datsun"  "Hornet"  "Hornet"  "Valiant"

# extract all words composed on only letters
# NOTE: we use the flag word (`w/`)
string_extract(cars, "w/[[:alpha:]]+")
#> [[1]]
#> [1] "Mazda"
#> 
#> [[2]]
#> [1] "Mazda" "Wag"  
#> 
#> [[3]]
#> [1] "Datsun"
#> 
#> [[4]]
#> [1] "Hornet" "Drive" 
#> 
#> [[5]]
#> [1] "Hornet"     "Sportabout"
#> 
#> [[6]]
#> [1] "Valiant"
#> 

# version without flag:
string_extract(cars, "\\b[[:alpha:]]+\\b")
#> [[1]]
#> [1] "Mazda"
#> 
#> [[2]]
#> [1] "Mazda" "Wag"  
#> 
#> [[3]]
#> [1] "Datsun"
#> 
#> [[4]]
#> [1] "Hornet" "Drive" 
#> 
#> [[5]]
#> [1] "Hornet"     "Sportabout"
#> 
#> [[6]]
#> [1] "Valiant"
#> 

# If a vector of length 1 => a vector is returned
greet = "Hi Tom, how's Mary doing?"
string_extract(greet, "w/[[:upper:]]\\w+")
#> [1] "Hi"   "Tom"  "Mary"

# version with simplify = FALSE => a list is returned
string_extract(greet, "w/[[:upper:]]\\w+", simplify = FALSE)
#> [[1]]
#> [1] "Hi"   "Tom"  "Mary"
#>