Regular expression flags

All functions in stringmagic accept optional regular expressions (regex) flags when regular expressions are expected. The idea is similar to regular regex flags, but the flags are different in name and effect.

Regex flags: Syntax

Use "flag1, flag2/regex" to add the flags flag1 and flag2 to the regular expression regex. For example "ignore, fixed/dt[" will add the flags ignore and fixed to the regex dt[.

Alternatively, use only the initials of the flags. Hence, "if/dt[" would also add the flags ignore and fixed.

If the regex does not contain a slash (/), no flags are added. If your regex should contain a slash, see the section on escaping.

Ex: let’s find lines containing "dt[":

code = c("DT = as.data.table(iris)", 
         "DT[, .(pl_sl = string_magic('PL/SL = {Petal.Length / Sepal.Length}')]")

string_get(code, "if/dt[")
#> [1] "DT[, .(pl_sl = string_magic('PL/SL = {Petal.Length / Sepal.Length}')]"

Regex flags: Reference

There are 6 flags:

ignore: always available
fixed: always available
word: always available
magic: always available
total: only available in functions performing a replacement
single: only available in functions performing a replacement

ignore

The flag "ignore" leads to a case-insensitive search.

Ex: let’s extract words starting with the last letters of the alphabet.

unhappy = "Rumble thy bellyful! Spit, fire! spout, rain!
Nor rain, wind, thunder, fire are my daughters.
I tax not you, you elements, with unkindness.
I never gave you kingdom, call'd you children,
You owe me no subscription. Then let fall
Your horrible pleasure. Here I stand your slave,
A poor, infirm, weak, and despis'd old man."

# the ignore flag allows to retain words starting with the
# upper cased letters
# ex: getting words starting with the letter 'r' to 'z'
cat_magic("{'ignore/\\b[r-z]\\w+'extract, c, 60 swidth ? unhappy}")
#> Rumble thy Spit spout rain rain wind thunder tax you you
#> with unkindness you you You subscription Then Your stand
#> your slave weak

Technically, the perl expression "(?i)" is added at the beginning of the pattern.

fixed

The flag "fixed" removes any special regular expression meaning from the pattern, and treats it as verbatim.

Ex: let’s fix the equation by changing the operators.

x = "50 + 5 * 5 = 40"
string_clean(x, "f/+", "f/*", replacement = "-")
#> [1] "50 - 5 - 5 = 40"

# Without the fixed flag, we would have gotten an error since '+' or '*'
# have a special meaning in regular expressions (it is a quantifier)
# and expects something before

# Here's the error
try(string_clean(x, "+", "*", replacement = "-"))
#> Error : in string_clean(x, "+", "*", replacement = "-"): 
#> CONTEXT: evaluation of "+"
#>          pattern     = "+" 
#>          replacement = "-"
#> EXPECTATION: the pattern must be a valid regular expression
#> PROBLEM: `gsub` led to an error, see below:
#> invalid regular expression '+'
#> PCRE pattern compilation error
#>  'quantifier does not follow a repeatable item'
#>  at '+'

Technically, if "fixed" is the only flag, then the functions base::grepl or base::gsub are run with the argument fixed = TRUE. If there are also the flags "ignore" or "word", the pattern is nested into the perl boundaries \\Q and \\E which strip any special meaning from the pattern.

word

The flag "word":

adds word boundaries to the pattern
accepts comma-separated enumerations of words which are concatenated with a logical ‘or’

The logic of accepting comma-separated enumerations is to increase readability. For example, with the flag "word", "is, are, were" is equivalent to "\\b(is|are|were)\\b".

Ex: we hide a few words from Alfred de Vigny’s poem.

le_mont_des_oliviers = "S'il est vrai qu'au Jardin sacré des Écritures,
Le Fils de l'homme ai dit ce qu'on voit rapporté ;
Muet, aveugle et sourd au cri des créatures,
Si le Ciel nous laissa comme un monde avorté,
Alors le Juste opposera le dédain à l'absence
Et ne répondra plus que par un froid silence
Au silence éternel de la Divinité."

# we hide a few words from this poem
string_magic("{'wi/et, le, il, au, des?, ce => _'replace ? le_mont_des_oliviers}")
#> [1] "S'_ est vrai qu'_ Jardin sacré _ Écritures,\n_ Fils _ l'homme ai dit _ qu'on voit rapporté ;\nMuet, aveugle _ sourd _ cri _ créatures,\nSi _ Ciel nous laissa comme un monde avorté,\nAlors _ Juste opposera _ dédain à l'absence\n_ ne répondra plus que par un froid silence\n_ silence éternel _ la Divinité."

Technically, first the pattern is split with respect to ",[ \t\n]+", then all elements are collapsed with "|". If the flag "fixed" was also present, each element is first wrapped into "\\Q" and "\\E". Finally, we add parentheses (to enable capture) and word boundaries ("\\b") on both sides.

magic

Use the "magic" flag to interpolate variables inside the regular expression before the regex is evaluated.

Ex: interpolating variables inside regular expressions.

vowels ="aeiouy"
# let's keep only the vowels
# we want the pattern: "[^aeiouy]"
lmb = "'Tis safer to be that which we destroy
Than by destruction dwell in doubtful joy."
string_replace(lmb, "magic/[^{vowels}]", "_")
#> [1] "__i___a_e___o__e___a____i____e__e___oy___a___y__e___u__io____e___i___ou___u___oy_"

#
# Illustration of `string_magic` operations before regex application
#

cars = row.names(mtcars)
# Which of these models contain a digit?
models = c("Toyota", "Hornet", "Porsche")
# we want the pattern "(Toyota|Hornet|"Porsche).+\\d"
# we collapse the models with a pipe using '|'c
string_get(cars, "m/({'|'c ? models}).+\\d")
#> [1] "Hornet 4 Drive" "Porsche 914-2"

# alternative: same as above but we first comma-split the vector
models_comma = "Toyota, Hornet, Porsche"
string_get(cars, "m/({S, '|'c ? models_comma}).+\\d")
#> [1] "Hornet 4 Drive" "Porsche 914-2"

#
# Interpolation does not apply to regex-specific curly brackets
#

# We delete only successions of 2+ vowels
# {2,} has a rexex meaning and is not interpolated:
string_replace(lmb, "magic/[{vowels}]{2,}", "_")
#> [1] "'Tis safer to be that which we destr_\nThan by destruct_n dwell in d_btful j_."

Technically, the algorithm does not interpolate curly brackets having a regular expression meaning. The expression of the form "{a, b}" with "a" and "b" digits means a repetition of the previous symbol of at least "a" times and at most "b" times. The variables are fetched in the calling environment. To fetch them from a different location, you can use the argument envir.

total

The flag "total" is only available to functions performing a replacement. In that case, if a pattern is detected, the full character string is replaced (instead of just the pattern).

Ex: let’s replace a few car models.

cars_small = head(row.names(mtcars))
print(cars_small)
#> [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
#> [4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

string_replace(cars_small, "ti/mazda", "Mazda: sold out!")
#> [1] "Mazda: sold out!"  "Mazda: sold out!"  "Datsun 710"       
#> [4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

On top of this, the "total" flag allows to perform logical operations across several regex patterns. You have more information on this in the dedicated vignette. In a nutshell, you can write "pat1 & !pat2 | pat3" with "patx" regular expresion patterns. This means: contains pat1 and does not contain pat2, or contains pat3.

Ex: detect car brands with a digit and no ‘e’.

cars_small = head(row.names(mtcars))
print(cars_small)
#> [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
#> [4] "Hornet 4 Drive"    "Hornet Sportabout" "Valiant"

string_replace(cars_small, "total, ignore/\\d & !e", "I don't like that brand!")
#> [1] "I don't like that brand!" "I don't like that brand!"
#> [3] "I don't like that brand!" "Hornet 4 Drive"          
#> [5] "Hornet Sportabout"        "Valiant"

Technically, instead of using gsub to replace the pattern, string_is is used to detect which element contains the pattern. Each element with the pattern is then substituted with the replacement.

single

The flag "single" is only available to functions performing a replacement. It allows only a single substitution to take place. Said differently, only the first replacement is performed.

Ex: single substitutions.

encounter = string_vec("Hi Cyclops., Hi you. What's your name?, Odysseus is my name.")
# we only remove the first word
string_replace(encounter, "single/\\w+", "...")
#> [1] "... Cyclops."               "... you. What's your name?"
#> [3] "... is my name."

Technically, the function base::sub is used instead of base::gsub.

Escaping flags: How to, and a word of caution with paths

If your regular expression contains a slash ("/"), this will come in conflict with the parsing of the optional flags. At the moment a / is present in a pattern, the algorithm will throw an error if the expected flags are not written correctly.

To use a slash in the regex without adding flags there is only one solutions:

escape the first "/" with a double backslash

Ex: let’s invert the numerator and denominator of a division.

eq = "5/x = 3/2"
# escaping with backslashes
string_replace(eq, "(\\w)\\/(\\w)", "\\2/\\1")
#> [1] "x/5 = 2/3"

Warning: when applying regular expressions on file paths, to avoid unexpected behavior, the flags algorithm is very strict. Everytime a pattern contains a slash that is not associated with valid flags, an error will be thrown unless that slash is escaped.

path = "my/path/to/the/file.tex"
# we keep the directory only

# first try: an error bc flags are expected before the first '/'
try(string_replace(path, "/[^/]+$"))
#> Error : in string_replace(path, "/[^/]+$"): 
#> Problem found in the regex pattern "/[^/]+$".
#> PROBLEM: The first flag is empty. To use a regular slash, escape it with
#> a double backslah.
#> 
#> INFO: regex flags write 'flag1, flag2/pattern' where a flag = lower case
#> letters.
#>   Ex: 'ignore, fixed/dt[' leads to the flags 'ignore' and 'fixed' for
#>   the pattern 'dt['.
#>   Or simply collate the first letters of the flags: 'if/dt['.
#>   To use '/' without flag parsing, escape it. Ex: '\\/usr': leads to
#>   '/usr' without any flag on.

# after escaping: works (only the first slash requires escaping)
string_replace(path, "\\/[^/]+$")
#> [1] "my/path/to/the"

# if we did add a flag, we would need to double the slash
# compare...
string_replace(path, "i//[^/]+$")
#> [1] "my/path/to/the"
# to...
string_replace(path, "i/[^/]+$")
#> [1] "my/path/to/the/"

Hence if you need to write path related regexes, you very likely need to escape the first slash.

Laurent R. Bergé

2024-06-20