vignettes/ref_regex_flags.rmd
ref_regex_flags.rmd
All functions in stringmagic
accept optional regular
expressions (regex) flags when regular expressions are expected. The
idea is similar to regular
regex flags, but the flags are different in name and effect.
Use "flag1, flag2/regex"
to add the flags
flag1
and flag2
to the regular expression
regex
. For example "ignore, fixed/dt["
will
add the flags ignore
and fixed
to the regex
dt[
.
Alternatively, use only the initials of the flags. Hence,
"if/dt["
would also add the flags ignore
and
fixed
.
If the regex does not contain a slash (/
), no flags are
added. If your regex should contain a slash, see the section on escaping.
Ex: let’s find lines containing "dt["
:
code = c("DT = as.data.table(iris)",
"DT[, .(pl_sl = string_magic('PL/SL = {Petal.Length / Sepal.Length}')]")
string_get(code, "if/dt[")
#> [1] "DT[, .(pl_sl = string_magic('PL/SL = {Petal.Length / Sepal.Length}')]"
There are 6 flags:
The flag "ignore"
leads to a case-insensitive
search.
Ex: let’s extract words starting with the last letters of the alphabet.
unhappy = "Rumble thy bellyful! Spit, fire! spout, rain!
Nor rain, wind, thunder, fire are my daughters.
I tax not you, you elements, with unkindness.
I never gave you kingdom, call'd you children,
You owe me no subscription. Then let fall
Your horrible pleasure. Here I stand your slave,
A poor, infirm, weak, and despis'd old man."
# the ignore flag allows to retain words starting with the
# upper cased letters
# ex: getting words starting with the letter 'r' to 'z'
cat_magic("{'ignore/\\b[r-z]\\w+'extract, c, 60 swidth ? unhappy}")
#> Rumble thy Spit spout rain rain wind thunder tax you you
#> with unkindness you you You subscription Then Your stand
#> your slave weak
Technically, the perl
expression "(?i)"
is added at the beginning of the
pattern.
The flag "fixed"
removes any special regular expression
meaning from the pattern, and treats it as verbatim.
Ex: let’s fix the equation by changing the operators.
x = "50 + 5 * 5 = 40"
string_clean(x, "f/+", "f/*", replacement = "-")
#> [1] "50 - 5 - 5 = 40"
# Without the fixed flag, we would have gotten an error since '+' or '*'
# have a special meaning in regular expressions (it is a quantifier)
# and expects something before
# Here's the error
try(string_clean(x, "+", "*", replacement = "-"))
#> Error : in string_clean(x, "+", "*", replacement = "-"):
#> CONTEXT: evaluation of "+"
#> pattern = "+"
#> replacement = "-"
#> EXPECTATION: the pattern must be a valid regular expression
#> PROBLEM: `gsub` led to an error, see below:
#> invalid regular expression '+'
#> PCRE pattern compilation error
#> 'quantifier does not follow a repeatable item'
#> at '+'
Technically, if "fixed"
is the only flag, then
the functions base::grepl
or base::gsub
are
run with the argument fixed = TRUE
. If there are also the
flags "ignore"
or "word"
, the pattern is
nested into the perl boundaries \\Q
and \\E
which strip any special meaning from the pattern.
The flag "word"
:
The logic of accepting comma-separated enumerations is to increase
readability. For example, with the flag "word"
,
"is, are, were"
is equivalent to
"\\b(is|are|were)\\b"
.
Ex: we hide a few words from Alfred de Vigny’s poem.
le_mont_des_oliviers = "S'il est vrai qu'au Jardin sacré des Écritures,
Le Fils de l'homme ai dit ce qu'on voit rapporté ;
Muet, aveugle et sourd au cri des créatures,
Si le Ciel nous laissa comme un monde avorté,
Alors le Juste opposera le dédain à l'absence
Et ne répondra plus que par un froid silence
Au silence éternel de la Divinité."
# we hide a few words from this poem
string_magic("{'wi/et, le, il, au, des?, ce => _'replace ? le_mont_des_oliviers}")
#> [1] "S'_ est vrai qu'_ Jardin sacré _ Écritures,\n_ Fils _ l'homme ai dit _ qu'on voit rapporté ;\nMuet, aveugle _ sourd _ cri _ créatures,\nSi _ Ciel nous laissa comme un monde avorté,\nAlors _ Juste opposera _ dédain à l'absence\n_ ne répondra plus que par un froid silence\n_ silence éternel _ la Divinité."
Technically, first the pattern is split with respect to
",[ \t\n]+"
, then all elements are collapsed with
"|"
. If the flag "fixed"
was also present,
each element is first wrapped into "\\Q"
and
"\\E"
. Finally, we add parentheses (to enable capture) and
word boundaries ("\\b"
) on both sides.
Use the "magic"
flag to interpolate variables inside the
regular expression before the regex is evaluated.
Ex: interpolating variables inside regular expressions.
vowels ="aeiouy"
# let's keep only the vowels
# we want the pattern: "[^aeiouy]"
lmb = "'Tis safer to be that which we destroy
Than by destruction dwell in doubtful joy."
string_replace(lmb, "magic/[^{vowels}]", "_")
#> [1] "__i___a_e___o__e___a____i____e__e___oy___a___y__e___u__io____e___i___ou___u___oy_"
#
# Illustration of `string_magic` operations before regex application
#
cars = row.names(mtcars)
# Which of these models contain a digit?
models = c("Toyota", "Hornet", "Porsche")
# we want the pattern "(Toyota|Hornet|"Porsche).+\\d"
# we collapse the models with a pipe using '|'c
string_get(cars, "m/({'|'c ? models}).+\\d")
#> [1] "Hornet 4 Drive" "Porsche 914-2"
# alternative: same as above but we first comma-split the vector
models_comma = "Toyota, Hornet, Porsche"
string_get(cars, "m/({S, '|'c ? models_comma}).+\\d")
#> [1] "Hornet 4 Drive" "Porsche 914-2"
#
# Interpolation does not apply to regex-specific curly brackets
#
# We delete only successions of 2+ vowels
# {2,} has a rexex meaning and is not interpolated:
string_replace(lmb, "magic/[{vowels}]{2,}", "_")
#> [1] "'Tis safer to be that which we destr_\nThan by destruct_n dwell in d_btful j_."
Technically, the algorithm does not interpolate curly
brackets having a regular expression meaning. The expression of the form
"{a, b}"
with "a"
and "b"
digits
means a repetition of the previous symbol of at least "a"
times and at most "b"
times. The variables are fetched in
the calling environment. To fetch them from a different location, you
can use the argument envir
.
The flag "total"
is only available to functions
performing a replacement. In that case, if a pattern is detected,
the full character string is replaced (instead of just the
pattern).
Ex: let’s replace a few car models.
cars_small = head(row.names(mtcars))
print(cars_small)
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
string_replace(cars_small, "ti/mazda", "Mazda: sold out!")
#> [1] "Mazda: sold out!" "Mazda: sold out!" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
On top of this, the "total"
flag allows to perform
logical operations across several regex patterns. You have more
information on this in the dedicated
vignette. In a nutshell, you can write
"pat1 & !pat2 | pat3"
with "patx"
regular
expresion patterns. This means: contains pat1
and does not
contain pat2
, or contains pat3
.
Ex: detect car brands with a digit and no ‘e’.
cars_small = head(row.names(mtcars))
print(cars_small)
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
string_replace(cars_small, "total, ignore/\\d & !e", "I don't like that brand!")
#> [1] "I don't like that brand!" "I don't like that brand!"
#> [3] "I don't like that brand!" "Hornet 4 Drive"
#> [5] "Hornet Sportabout" "Valiant"
Technically, instead of using gsub
to replace
the pattern, string_is
is used to detect which element contains the pattern. Each element with
the pattern is then substituted with the replacement.
The flag "single"
is only available to functions
performing a replacement. It allows only a single substitution to take
place. Said differently, only the first replacement is performed.
Ex: single substitutions.
encounter = string_vec("Hi Cyclops., Hi you. What's your name?, Odysseus is my name.")
# we only remove the first word
string_replace(encounter, "single/\\w+", "...")
#> [1] "... Cyclops." "... you. What's your name?"
#> [3] "... is my name."
Technically, the function base::sub
is used
instead of base::gsub
.
If your regular expression contains a slash ("/"
), this
will come in conflict with the parsing of the optional flags. At the
moment a /
is present in a pattern, the algorithm will
throw an error if the expected flags are not written correctly.
To use a slash in the regex without adding flags there is only one solutions:
"/"
with a double backslashEx: let’s invert the numerator and denominator of a division.
eq = "5/x = 3/2"
# escaping with backslashes
string_replace(eq, "(\\w)\\/(\\w)", "\\2/\\1")
#> [1] "x/5 = 2/3"
Warning: when applying regular expressions on file paths, to avoid unexpected behavior, the flags algorithm is very strict. Everytime a pattern contains a slash that is not associated with valid flags, an error will be thrown unless that slash is escaped.
path = "my/path/to/the/file.tex"
# we keep the directory only
# first try: an error bc flags are expected before the first '/'
try(string_replace(path, "/[^/]+$"))
#> Error : in string_replace(path, "/[^/]+$"):
#> Problem found in the regex pattern "/[^/]+$".
#> PROBLEM: The first flag is empty. To use a regular slash, escape it with
#> a double backslah.
#>
#> INFO: regex flags write 'flag1, flag2/pattern' where a flag = lower case
#> letters.
#> Ex: 'ignore, fixed/dt[' leads to the flags 'ignore' and 'fixed' for
#> the pattern 'dt['.
#> Or simply collate the first letters of the flags: 'if/dt['.
#> To use '/' without flag parsing, escape it. Ex: '\\/usr': leads to
#> '/usr' without any flag on.
# after escaping: works (only the first slash requires escaping)
string_replace(path, "\\/[^/]+$")
#> [1] "my/path/to/the"
# if we did add a flag, we would need to double the slash
# compare...
string_replace(path, "i//[^/]+$")
#> [1] "my/path/to/the"
# to...
string_replace(path, "i/[^/]+$")
#> [1] "my/path/to/the/"
Hence if you need to write path related regexes, you very likely need to escape the first slash.