R/aliases.R
, R/string_tools.R
string_clean.Rd
Recursively cleans a character vector from several patterns. Quickly handle the
tedious task of data cleaning by taking advantage of the syntax.
You can also apply all sorts of cleaning operations by summoning string_ops()
operations.
string_clean_alias(
replacement = "",
pipe = " => ",
split = ",[ \n\t]+",
ignore.case = FALSE,
fixed = FALSE,
word = FALSE,
total = FALSE,
single = FALSE,
namespace = NULL
)
string_clean(
x,
...,
replacement = "",
pipe = " => ",
split = ",[ \n\t]+",
ignore.case = FALSE,
fixed = FALSE,
word = FALSE,
total = FALSE,
single = FALSE,
envir = parent.frame(),
namespace = NULL
)
string_replace(
x,
pattern,
replacement = "",
pipe = " => ",
ignore.case = FALSE,
fixed = FALSE,
word = FALSE,
total = FALSE,
single = FALSE,
envir = parent.frame()
)
stclean(
x,
...,
replacement = "",
pipe = " => ",
split = ",[ \n\t]+",
ignore.case = FALSE,
fixed = FALSE,
word = FALSE,
total = FALSE,
single = FALSE,
envir = parent.frame(),
namespace = NULL
)
streplace(
x,
pattern,
replacement = "",
pipe = " => ",
ignore.case = FALSE,
fixed = FALSE,
word = FALSE,
total = FALSE,
single = FALSE,
envir = parent.frame()
)
Character scalar, default is the empty string. It represents the default
value by which the patterns found in the character strings will be replaced. For example
string_clean(x, "e", replacement = "a")
turn all letters "e" in x
into "a".
Character scalar, default is " => "
. If thevalue of pipe
is found in a pattern,
then the string is split w.r.t. the pipe and anything after the pipe becomes the replacement.
For example in string_clean(x, "e => a")
the default pipe is found in "e => a", so the pattern
"e" will be replaced with "a". In other terms, this is equivalent to string_clean(x, "e", replacement = "a")
.
Example changing the pipe: you can obtain the previous result with string_clean(x, "e|>a", pipe = "|>")
.
Character scalar, default is ",[ \t\n]+"
(which means a comma followed with spaces
and/or new lines). By default the patterns to be replaced are comma separated, that is
the pattern is split w.r.t. the argument split
and a replacement is done for each sub-pattern.
Use NULL
or the empty string to disable pattern separation.
For example: let's look at string_clean(x, "w/one, two => three")
. First the flag "word" is extracted from
the pattern (see arg. ...
) as well as the replacement (see arg. pipe
), leading to "one, two" the
pattern to be replaced. Then the pattern is split w.r.t. split
, leading
to two patterns "one" and "two". Hence the words (thanks to the flag "w") "one" and "two" from
the string x
will be replaced with "three".
Logical scalar, default is FALSE
. If TRUE
, then case insensitive search is triggered.
Logical scalar, default is FALSE
. Whether to trigger a fixed search instead of a
regular expression search (default).
Logical scalar, default is FALSE
. If TRUE
then a) word boundaries are added to the pattern,
and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation.
Example: if word = TRUE
, then pattern = "The, mountain" will select strings containing either the word
'The' or the word 'mountain'.
Logical scalar, default is FALSE
. If TRUE
, then when a pattern is found
in a string, the full string is replaced (instead of just the pattern). Note, importantly,
that when total = TRUE
you can use logical operators in the patterns.
Example: string_clean(x, "wi/ & two, three & !four => ", total = TRUE)
Logical scalar, default is FALSE
. Whether, in substitutions, to stop at
the first match found. Ex: string_clean("abc", "[[:alpha:]] => _", single = TRUE)
leads
to "_bc"
, while string_clean("abc", "[[:alpha:]] => _")
leads to "___"
.
Character scalar or NULL
(default). Only useful for package developers.
As a regular end-user you shouldn't care! If your package uses string_magic
, you should care.
It is useful only if your package uses 'custom' string_magic
operations, set with
string_magic_register_fun()
or string_magic_register_ops()
.
If so pass the name of your package in this argument so that your function can access
the new string_magic
operations defined within your package.
A character vector.
Character scalars representing patterns. A pattern is of the form
"flags/pat1, pat2 => replacement". This means that patterns 'pat1' and 'pat2' will be replaced
with the string 'replacement'. By default patterns are comma separated and the replacement comes
after a ' => ' (see args split
and pipe
to change this). By default the replacement is the empty string
(so "pat1, pat2" removes the patterns).
Available regex flags are: 'word' (add word boundaries), 'ignore' (the case), 'fixed' (no regex),
'total', 'single' and 'magic'.
The flag total
leads to a total replacement of the string if the pattern is found.
The flag 'magic' allows to interpolate variables within the pattern. Use flags
with comma separation ("word, total/pat") or use only their initials ("wt/pat").
Starting with an '@' leads to operations in string_ops()
. Ex: "@ascii, lower, ws" turns
the string into ASCII, lowers the case and normalizes white spaces (see help of string_ops()
).
Environment in which to evaluate the interpolations if the flag "magic"
is provided.
Default is parent.frame()
.
A character scalar containing a regular expression pattern to be replaced.
You can write the replacement directly in the string after a pipe: ' => ' (see arg. pipe
to change this).
By default the replacement is the empty string (so "pat1" removes the pattern).
Available regex flags are: 'word' (add word boundaries), 'ignore' (the case), 'fixed' (no regex),
'total', 'single' and 'magic'.
The flag total
leads to a total replacement of the string if the pattern is found.
The flag 'magic' allows to interpolate variables within the pattern. Use flags
with comma separation ("word, total/pat") or use only their initials ("wt/pat").
The main usage returns a character vector of the same length as the vector in input.
Note, however, that since you can apply arbitrary string_ops()
operations, the length and type
of the final vector may depend on those (if they are used).
string_clean_alias()
: Create a string_clean
alias with custom defaults
string_replace()
: Simplified version of string_clean
stclean()
: Alias to string_clean
streplace()
: Alias to string_replace
This function benefits from two specific regex flags: "total" and "single".
"total" replaces the complete string if the pattern is found (remember that the default behavior is to replace just the pattern).
"single" performs a single substitution for each string element and stops there.
Only the first match of each string is replaced. Technically we use base::sub()
internally instead of base::gsub()
.
All stringmagic
functions support generic flags in regular-expression patterns.
The flags are useful to quickly give extra instructions, similarly to usual
regular expression flags.
Here the syntax is "flag1, flag2/pattern". That is: flags are a comma separated list of flag-names
separated from the pattern with a slash (/
). Example: string_which(c("hello...", "world"), "fixed/.")
returns 1
.
Here the flag "fixed" removes the regular expression meaning of "." which would have otherwise meant "any character".
The no-flag verion string_which(c("hello...", "world"), ".")
returns 1:2
.
Alternatively, and this is recommended, you can collate the initials of the flags instead of using a comma separated list. For example: "if/dt[" will apply the flags "ignore" and "fixed" to the pattern "dt[".
The four flags always available are: "ignore", "fixed", "word" and "magic".
"ignore" instructs to ignore the case. Technically, it adds the perl-flag "(?i)" at the beginning of the pattern.
"fixed" removes the regular expression interpretation, so that the characters ".", "$", "^", "[" (among others) lose their special meaning and are treated for what they are: simple characters.
"word" adds word boundaries ("\\b"
in regex language) to the pattern. Further, the comma (","
)
becomes a word separator. Technically, "word/one, two" is treated as "\b(one|two)\b". Example:
string_clean("Am I ambushed?", "wi/am")
leads to " I ambushed?" thanks to the flags "ignore" and "word".
"magic" allows to interpolate variables inside the pattern before regex interpretation.
For example if letters = "aiou"
then string_clean("My great goose!", "magic/[{letters}] => e")
leads to "My greet geese!"
String operations: string_is()
, string_get()
, string_clean()
, string_split2df()
.
Chain basic operations with string_ops()
. Clean character vectors efficiently
with string_clean()
.
Use string_vec()
to create simple string vectors.
String interpolation combined with operation chaining: string_magic()
. You can change string_magic
default values with string_magic_alias()
and add custom operations with string_magic_register_fun()
.
Display messages while benefiting from string_magic
interpolation with cat_magic()
and message_magic()
.
Other tools with aliases:
cat_magic_alias()
,
string_magic()
,
string_magic_alias()
,
string_ops_alias()
,
string_vec_alias()
x = c("hello world ", "it's 5 am....")
# we clean the o's and the points (we use 'fixed' to trigger fixed-search)
string_clean(x, "o", "f/.")
#> [1] "hell wrld " "it's 5 am"
# equivalently
string_clean(x, "fixed/o, .")
#> [1] "hell wrld " "it's 5 am"
# equivalently
string_clean(x, "o, .", fixed = TRUE)
#> [1] "hell wrld " "it's 5 am"
# equivalently
string_clean(x, "o", ".", fixed = TRUE)
#> [1] "hell wrld " "it's 5 am"
#
# chaining operations: example using cars
#
cars = row.names(mtcars)
new = string_clean(cars,
# replace strings containing "Maz" with Mazda
"total/Maz => Mazda",
# replace the word 'Merc' with Mercedes
"wi/merc => Mercedes",
# replace strings containing "Merc" and a digit followed with an 'S'
"t/Merc & \\dS => Mercedes S!",
# put to lower case, remove isolated characters and normalize white spaces
"@lower, ws.isolated")
cbind(cars, new)
#> cars new
#> [1,] "Mazda RX4" "mazda"
#> [2,] "Mazda RX4 Wag" "mazda"
#> [3,] "Datsun 710" "datsun 710"
#> [4,] "Hornet 4 Drive" "hornet drive"
#> [5,] "Hornet Sportabout" "hornet sportabout"
#> [6,] "Valiant" "valiant"
#> [7,] "Duster 360" "duster 360"
#> [8,] "Merc 240D" "mercedes 240d"
#> [9,] "Merc 230" "mercedes 230"
#> [10,] "Merc 280" "mercedes 280"
#> [11,] "Merc 280C" "mercedes 280c"
#> [12,] "Merc 450SE" "mercedes s!"
#> [13,] "Merc 450SL" "mercedes s!"
#> [14,] "Merc 450SLC" "mercedes s!"
#> [15,] "Cadillac Fleetwood" "cadillac fleetwood"
#> [16,] "Lincoln Continental" "lincoln continental"
#> [17,] "Chrysler Imperial" "chrysler imperial"
#> [18,] "Fiat 128" "fiat 128"
#> [19,] "Honda Civic" "honda civic"
#> [20,] "Toyota Corolla" "toyota corolla"
#> [21,] "Toyota Corona" "toyota corona"
#> [22,] "Dodge Challenger" "dodge challenger"
#> [23,] "AMC Javelin" "amc javelin"
#> [24,] "Camaro Z28" "camaro z28"
#> [25,] "Pontiac Firebird" "pontiac firebird"
#> [26,] "Fiat X1-9" "fiat x1-9"
#> [27,] "Porsche 914-2" "porsche 914-2"
#> [28,] "Lotus Europa" "lotus europa"
#> [29,] "Ford Pantera L" "ford pantera"
#> [30,] "Ferrari Dino" "ferrari dino"
#> [31,] "Maserati Bora" "maserati bora"
#> [32,] "Volvo 142E" "volvo 142e"