Extracts the first, or several, patterns from a character vector.
string_extract(
x,
pattern,
single = FALSE,
simplify = TRUE,
fixed = FALSE,
ignore.case = FALSE,
word = FALSE,
unlist = FALSE,
envir = parent.frame()
)
stextract(
x,
pattern,
single = FALSE,
simplify = TRUE,
fixed = FALSE,
ignore.case = FALSE,
word = FALSE,
unlist = FALSE,
envir = parent.frame()
)
A character vector.
A character scalar. It represents the pattern
to be extracted from x
. By default
this is a regular expression. You can use flags in the pattern in
the form flag1, flag2/pattern
.
Available flags are ignore
(case), fixed
(no regex), word (add word boundaries),
single (select only the first element), and magic (add interpolation with {}
) . Example:
if "ignore/hello"
and x = "Hello world
extracted text is "Hello"
.
Shortcut: use the first letters of the flags. Ex: "iw/one" will extract the word
"one" (flags 'ignore' + 'word').
Logical scalar, default is FALSE
. If TRUE
, only the first pattern
that is detected will be returned. Note that in that case, a character vector is returned
of the same length as the vector in input.
Logical scalar, default is TRUE
. If TRUE
, then when the vector input x
is of length 1, a character vector is returned instead of a list.
Logical scalar, default is FALSE
. Whether to trigger a fixed search instead of a
regular expression search (default).
Logical scalar, default is FALSE
. If TRUE
, then case insensitive search is triggered.
Logical scalar, default is FALSE
. If TRUE
then a) word boundaries are added to the pattern,
and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation.
Example: if word = TRUE
, then pattern = "The, mountain" will select strings containing either the word
'The' or the word 'mountain'.
Logical scalar, default is FALSE
. If TRUE
, the function unlist
is applied
to the resulting list, leading to a character vector in output (instead of a list).
Environment in which to evaluate the interpolations if the flag "magic"
is provided.
Default is parent.frame()
.
The object returned by this functions can be a list or a character vector.
If single = TRUE
, a character vector is returned, containing the value of the first match.
If no match is found, an empty string is returned.
If single = FALSE
(the default) and simplify = TRUE
(default), the object returned is:
a character vector if x
, the vector in input, is of length 1: the character vector contains
all the matches and is of length 0 if no match is found.
a list of the same length as x
. The ith element of the list is a character vector
of the matches for the ith element of x
.
If single = FALSE
(default) and simplify = FALSE
, the object returned is always a list.
stextract()
: Alias to string_extract
All stringmagic
functions support generic flags in regular-expression patterns.
The flags are useful to quickly give extra instructions, similarly to usual
regular expression flags.
Here the syntax is "flag1, flag2/pattern". That is: flags are a comma separated list of flag-names
separated from the pattern with a slash (/
). Example: string_which(c("hello...", "world"), "fixed/.")
returns 1
.
Here the flag "fixed" removes the regular expression meaning of "." which would have otherwise meant "any character".
The no-flag verion string_which(c("hello...", "world"), ".")
returns 1:2
.
Alternatively, and this is recommended, you can collate the initials of the flags instead of using a comma separated list. For example: "if/dt[" will apply the flags "ignore" and "fixed" to the pattern "dt[".
The four flags always available are: "ignore", "fixed", "word" and "magic".
"ignore" instructs to ignore the case. Technically, it adds the perl-flag "(?i)" at the beginning of the pattern.
"fixed" removes the regular expression interpretation, so that the characters ".", "$", "^", "[" (among others) lose their special meaning and are treated for what they are: simple characters.
"word" adds word boundaries ("\\b"
in regex language) to the pattern. Further, the comma (","
)
becomes a word separator. Technically, "word/one, two" is treated as "\b(one|two)\b". Example:
string_clean("Am I ambushed?", "wi/am")
leads to " I ambushed?" thanks to the flags "ignore" and "word".
"magic" allows to interpolate variables inside the pattern before regex interpretation.
For example if letters = "aiou"
then string_clean("My great goose!", "magic/[{letters}] => e")
leads to "My greet geese!"
cars = head(row.names(mtcars))
# Let's extract the first word:
string_extract(cars, "\\w+", single = TRUE)
#> [1] "Mazda" "Mazda" "Datsun" "Hornet" "Hornet" "Valiant"
# same using flags
string_extract(cars, "s/\\w+")
#> [1] "Mazda" "Mazda" "Datsun" "Hornet" "Hornet" "Valiant"
# extract all words composed on only letters
# NOTE: we use the flag word (`w/`)
string_extract(cars, "w/[[:alpha:]]+")
#> [[1]]
#> [1] "Mazda"
#>
#> [[2]]
#> [1] "Mazda" "Wag"
#>
#> [[3]]
#> [1] "Datsun"
#>
#> [[4]]
#> [1] "Hornet" "Drive"
#>
#> [[5]]
#> [1] "Hornet" "Sportabout"
#>
#> [[6]]
#> [1] "Valiant"
#>
# version without flag:
string_extract(cars, "\\b[[:alpha:]]+\\b")
#> [[1]]
#> [1] "Mazda"
#>
#> [[2]]
#> [1] "Mazda" "Wag"
#>
#> [[3]]
#> [1] "Datsun"
#>
#> [[4]]
#> [1] "Hornet" "Drive"
#>
#> [[5]]
#> [1] "Hornet" "Sportabout"
#>
#> [[6]]
#> [1] "Valiant"
#>
# If a vector of length 1 => a vector is returned
greet = "Hi Tom, how's Mary doing?"
string_extract(greet, "w/[[:upper:]]\\w+")
#> [1] "Hi" "Tom" "Mary"
# version with simplify = FALSE => a list is returned
string_extract(greet, "w/[[:upper:]]\\w+", simplify = FALSE)
#> [[1]]
#> [1] "Hi" "Tom" "Mary"
#>