The purpose of stringmagic
is to facilitate the manipulation of character strings.
It introduces various functions to facilitate pattern detections via regular expression (regex) logic, or to efficienty clean character vectors in a readable way. Consistently across the package, regular expressions gain optional flags to monitor how the patterns should behave (fixed search? ignore case? add word boundaries? etc). For more information, see the vignette on string tools.
The main contribution of this package, and flagship function, is string_magic
which introduces a new language tailored to create complex character strings.
string_magic
string_magic
behaves in a similar way to the well known function glue. Use curly brackets to interpolate variables: i.e to insert their value directly in the string:
library(stringmagic)
x = "John" ; y = "Mary"
string_magic("Hi {x}! How's {y} doing?")
#> [1] "Hi John! How's Mary doing?"
Almost anything glue
can do, string_magic
can do, and if you’re thinking about speed, they’re about as fast (see the section Performance).
The difference, and originality, of string_magic
is that you can apply any arbitrary operation to the interpolated variables. You want to create an enumeration? Sure. Add quotation marks? Check. Change the case? No problem. Sort on a substring? Of course! There are over 50 built-in operations and the magic is that applying these operations is about as simple as saying them.
Operations. The syntax to add operations is as follows:
The operations
are a comma separated sequence of keywords, each keyword being bound to a specific function. Here is an example in which we apply two operations:
lovers = c("romeo", "juliet")
string_magic("Famous lovers: {title, enum ? lovers}.")
#> [1] "Famous lovers: Romeo and Juliet."
Arguments. Some operations, like split
, require arguments. Arguments are passed using quoted text just before the operation. The syntax is:
Let’s take the example of splitting an email address and keeping the text before the domain:
email = "John@Doe.com"
string_magic("This message comes from {'@'split, first ? email}.")
#> [1] "This message comes from John."
Options. Many operations accept options. These options are keywords working like flags (i.e. things that can be turned on) and change the behavior of the current operation. Add options using a dot separated sequence of keywords attached to the operation:
We have seen the enum
operation in an earlier example, let’s add a couple of options to it.
fields = c("maths", "physics")
string_magic("This position requires a PhD in either: {enum.i.or ? fields}.")
#> [1] "This position requires a PhD in either: i) maths, or ii) physics."
Anything else? So far we have just scratched the surface of string_magic
possiblities. Other features are: advanced support for pluralization, nesting, conditional operations, grouped operations, compact if-else statements, and unlimited customization. Here is a list of resources:
string_magic
tries to be friendly to the user by providing useful error messages:
x = c("Zeus", "Hades", "Poseidon")
string_magic("The {len?x} brothers: {anum?x}.")
#> Error: in string_magic("The {len?x} brothers: {anum?x}."):
#> CONTEXT: Problem found in "The {len?x} brothers: {anum?x}.",
#> when dealing with the interpolation `{anum?x}`.
#> PROBLEM: `anum` is not a valid operator. Maybe you meant `enum`?
#>
#> INFO: Type string_magic(.help = "regex") or string_magic(.help = TRUE) for help.
#> Or look at the vignette: https://lrberge.github.io/stringmagic/articles/guide_string_magic.html
string_magic("The iris species are: {unik, sort, enum ? iris[['Species']}.")
#> Error: in string_magic("The iris species are: {unik, sort, enum ?...:
#> CONTEXT: Problem found in "The iris species are: {unik, sort, enum ? iris[['Species']}.",
#> when dealing with the interpolation `{unik, sort, enum ? iris[['Species']}.`.
#> PROBLEM: in the expression `iris[['Species']`, a bracket (`[`) open is not closed.
The cost of this feature in terms of computing time is of the order of magnitude of 50 micro seconds (of course it depends on context). If you’re not interested in informative error messages, .string_magic
(note the "."
prefix) is identical to string_magic
but avoids error handling and is then slightly faster.
Basic interpolation. For regular string interpolations, the performance of string_magic
is similar to the performance of glue
. That is to say, the price to pay for user experience is in the ballpark of 100 micro seconds (on my – slow – computer). Let’s have a simple benchmark:
library(microbenchmark)
library(glue)
x = "Romeo" ; y = "Juliet"
microbenchmark( base = paste0(x, " seems to love ", y, "."),
glue = glue("{x} seems to love {y}."),
string_magic = string_magic("{x} seems to love {y}."),
.string_magic = .string_magic("{x} seems to love {y}."))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> base 1.3 1.50 1.604 1.60 1.70 2.4 100
#> glue 77.7 83.70 93.560 87.45 92.90 470.6 100
#> string_magic 69.9 73.75 78.257 77.50 80.75 106.5 100
#> .string_magic 43.2 45.70 50.205 48.20 52.15 137.9 100
The difference with the base function base::paste0
looks impressive (it looks 50 times faster), but is in fact not really important. Both glue
and string_magic
processing time is due to overheads: a fixed cost that does not depend on the size of the vectors in input. Hence for large vectors or operations that run in one millisecond or more, this difference is negligible.
As you can notice, .string_magic
, string_magic
without error handling, is about twice faster than glue
. But I’m not sure that sacrificing user expericence for a 50us overhead is really worth it!
Complex operations The function string_magic
shines when performing complex string manipulation. The question we ask here is: how much does it cost in terms of perfomance? Let’s look at the following operation:
x = c("Zeus", "Hades", "Poseidon")
string_magic("The {len?x} brothers: {enum?x}.")
#> [1] "The 3 brothers: Zeus, Hades and Poseidon."
Although the interface is different, let’s compare string_magic
to glue
and paste0
:
x = c("Zeus", "Hades", "Poseidon")
microbenchmark(base = paste0("The ", length(x), " brothers: ",
paste0(paste0(x[-length(x)], collapse = ", "),
" and ", x[length(x)]), "."),
glue = glue("The {length(x)} brothers: {x_enum}.",
x_enum = paste0(paste0(x[-length(x)], collapse = ", "),
" and ", x[length(x)])),
string_magic = string_magic("The {len?x} brothers: {enum?x}."),
string_magic_bis = string_magic("The {length(x)} brothers: {x_enum}.",
x_enum = paste0(paste0(x[-length(x)], collapse = ", "),
" and ", x[length(x)])))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> base 4.701 5.5510 54.37105 6.2010 6.802 4792.601 100
#> glue 93.900 104.6015 121.23295 118.8510 129.901 219.100 100
#> string_magic 316.701 331.3010 628.79896 351.2510 391.401 26576.501 100
#> string_magic_bis 94.601 106.5010 118.10602 114.8015 123.301 250.301 100
As we can see, the processing overhead of string_magic
specific syntax is a few hundred microseconds. Remember that since regular interpolation can be performed, you can always fall back to glue
-like processing (and benefit from the same performance), as is illustrated by the last command of the benchmark.