Splits a character string with respect to pattern

string_split(
  x,
  split,
  simplify = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  envir = parent.frame()
)

stsplit(
  x,
  split,
  simplify = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  envir = parent.frame()
)

Arguments

x

A character vector.

split

A character scalar. Used to split the character vectors. By default this is a regular expression. You can use flags in the pattern in the form flag1, flag2/pattern. Available flags are ignore (case), fixed (no regex), word (add word boundaries), magic (add interpolation with "{}"). Example: if "ignore/hello" and the text contains "Hello", it will be split at "Hello". Shortcut: use the first letters of the flags. Ex: "iw/one" will split at the word "one" (flags 'ignore' + 'word').

simplify

Logical scalar, default is TRUE. If TRUE, then when the vector input x is of length 1, a character vector is returned instead of a list.

fixed

Logical, default is FALSE. Whether to consider the argument split as fixed (and not as a regular expression).

ignore.case

Logical scalar, default is FALSE. If TRUE, then case insensitive search is triggered.

word

Logical scalar, default is FALSE. If TRUE then a) word boundaries are added to the pattern, and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation. Example: if word = TRUE, then pattern = "The, mountain" will select strings containing either the word 'The' or the word 'mountain'.

envir

Environment in which to evaluate the interpolations if the flag "magic" is provided. Default is parent.frame().

Value

If simplify = TRUE (default), the object returned is:

  • a character vector if x, the vector in input, is of length 1: the character vector contains the result of the split.

  • a list of the same length as x. The ith element of the list is a character vector containing the result of the split of the ith element of x.

If simplify = FALSE, the object returned is always a list.

Functions

  • stsplit(): Alias to string_split

Generic regular expression flags

All stringmagic functions support generic flags in regular-expression patterns. The flags are useful to quickly give extra instructions, similarly to usual regular expression flags.

Here the syntax is "flag1, flag2/pattern". That is: flags are a comma separated list of flag-names separated from the pattern with a slash (/). Example: string_which(c("hello...", "world"), "fixed/.") returns 1. Here the flag "fixed" removes the regular expression meaning of "." which would have otherwise meant "any character". The no-flag verion string_which(c("hello...", "world"), ".") returns 1:2.

Alternatively, and this is recommended, you can collate the initials of the flags instead of using a comma separated list. For example: "if/dt[" will apply the flags "ignore" and "fixed" to the pattern "dt[".

The four flags always available are: "ignore", "fixed", "word" and "magic".

  • "ignore" instructs to ignore the case. Technically, it adds the perl-flag "(?i)" at the beginning of the pattern.

  • "fixed" removes the regular expression interpretation, so that the characters ".", "$", "^", "[" (among others) lose their special meaning and are treated for what they are: simple characters.

  • "word" adds word boundaries ("\\b" in regex language) to the pattern. Further, the comma (",") becomes a word separator. Technically, "word/one, two" is treated as "\b(one|two)\b". Example: string_clean("Am I ambushed?", "wi/am") leads to " I ambushed?" thanks to the flags "ignore" and "word".

  • "magic" allows to interpolate variables inside the pattern before regex interpretation. For example if letters = "aiou" then string_clean("My great goose!", "magic/[{letters}] => e") leads to "My greet geese!"

Examples


time = "This is the year 2024."

# we break the sentence
string_split(time, " ")
#> [1] "This"  "is"    "the"   "year"  "2024."

# simplify = FALSE leads to a list
string_split(time, " ", simplify = FALSE)
#> [[1]]
#> [1] "This"  "is"    "the"   "year"  "2024."
#> 

# let's break at "is"
string_split(time, "is")
#> [1] "Th"              " "               " the year 2024."

# now breaking at the word "is"
# NOTE: we use the flag `word` (`w/`)
string_split(time, "w/is")
#> [1] "This "           " the year 2024."

# same but using a pattern from a variable
# NOTE: we use the `magic` flag
pat = "is"
string_split(time, "mw/{pat}")
#> [1] "This "           " the year 2024."