vignettes/guide_string_tools.rmd
guide_string_tools.rmd
This vignette describes stringmagic
tools for handling
character vectors. It details:
Detecting a single regex pattern is pretty straightforward with
regular tools like base::grepl
or
stringr::string_detect
. Things become more complicated when
we want to detect the presence of multiple patterns.
stringmagic
offers three functions with an intuitive
syntax to deal with complex pattern detection:
string_is
,
string_which
and string_get
Use string_is
, string_which
and
string_get
to detect patterns in character vectors and
obtain either a logical vector, an integer vector, or the values.
In this section we give examples for string_get
which
hopefully will be explicit enough to illustrate how it works. For the
record, string_get
uses string_is
internally
so these examples are equivalent with string_is
or
string_which
.
Ex.1: series of examples using the recommended syntax.
cars = row.names(mtcars)
cat_magic("All cars from mtcars:\n{C, 60 swidth ? cars}")
#> All cars from mtcars:
#> Mazda RX4, Mazda RX4 Wag, Datsun 710, Hornet 4 Drive, Hornet
#> Sportabout, Valiant, Duster 360, Merc 240D, Merc 230, Merc
#> 280, Merc 280C, Merc 450SE, Merc 450SL, Merc 450SLC,
#> Cadillac Fleetwood, Lincoln Continental, Chrysler Imperial,
#> Fiat 128, Honda Civic, Toyota Corolla, Toyota Corona, Dodge
#> Challenger, AMC Javelin, Camaro Z28, Pontiac Firebird, Fiat
#> X1-9, Porsche 914-2, Lotus Europa, Ford Pantera L, Ferrari
#> Dino, Maserati Bora and Volvo 142E
# cars with an 'a', an 'e', an 'i', and an 'o', all in lower case
string_get(cars, "a & e & i & o")
#> [1] "Cadillac Fleetwood" "Lincoln Continental" "Pontiac Firebird"
#> [4] "Ferrari Dino" "Maserati Bora"
# cars with no 'e' and at least one digit
string_get(cars, "!e & \\d")
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Fiat 128"
#> [5] "Camaro Z28" "Fiat X1-9" "Volvo 142E"
# flags apply to all
# contains the 'words' 2, 9 or l
# alternative syntax for flags: "wi/2 | 9 | l"
string_get(cars, "word, ignore/2 | 9 | l")
#> [1] "Fiat X1-9" "Porsche 914-2" "Ford Pantera L"
The default syntax is string_get(x, ...)
(same for
string_is
and string_which
), where
...
contains any number of patterns to detect. By default
the results of these pattern detections are combined with a logical AND.
To combine them with a logical OR, you need to use the argument
or = TRUE
. You can also pass the flags as regular function
arguments. They then apply to all patterns.
Ex.2: replication of Ex.1 using an alternative syntax.
# string_get(cars, "a & e & i & o")
# cars with an 'a', an 'e', an 'i', and an 'o', all in lower case
string_get(cars, "a", "e", "i", "o")
#> [1] "Cadillac Fleetwood" "Lincoln Continental" "Pontiac Firebird"
#> [4] "Ferrari Dino" "Maserati Bora"
# string_get(cars, "!e & \\d")
# cars with no 'e' and at least one digit
string_get(cars, "!e", "\\d")
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Fiat 128"
#> [5] "Camaro Z28" "Fiat X1-9" "Volvo 142E"
# string_get(cars, "!/e & \\d")
# This example cannot be replicated directly, we need to apply logical equivalence
string_get(cars, "!e", "!\\d", or = TRUE)
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
#> [4] "Hornet Sportabout" "Valiant" "Cadillac Fleetwood"
#> [7] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
#> [10] "Honda Civic" "Toyota Corolla" "Toyota Corona"
#> [13] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
#> [16] "Pontiac Firebird" "Fiat X1-9" "Lotus Europa"
#> [19] "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
#> [22] "Volvo 142E"
# string_get(cars, "wi/2 | 9 | l")
# contains the 'words' 2, 9 or l
string_get(cars, "2", "9", "l", or = TRUE, word = TRUE, ignore.case = TRUE)
#> [1] "Fiat X1-9" "Porsche 914-2" "Ford Pantera L"
srt_get
On top of the detection previously described, the function
srt_get
changes its behavior with the arguments
seq
or seq.unik
. It also supports automatic caching.
As seen previously, patterns in ...
are combined with a
logical AND. If you set seq = TRUE
, this behavior changes.
The results of each pattern becomes stacked sequentially. Schematically,
you obtain the vector
c(x_that_contains_pat1, x_that_contains_pat2, etc)
with
pat1
the first pattern in ...
,
pat2
the second pattern, etc.
Using seq.unik = TRUE
is like seq
but
applies the function unique()
at the end.
Ex: sequentially combining results.
# cars without digits, then cars with 2 'a's or 2 'e's and a digit
string_get(cars, "!\\d", "i/a.+a | e.+e & \\d", seq = TRUE)
#> [1] "Hornet Sportabout" "Valiant" "Cadillac Fleetwood"
#> [4] "Lincoln Continental" "Chrysler Imperial" "Honda Civic"
#> [7] "Toyota Corolla" "Toyota Corona" "Dodge Challenger"
#> [10] "AMC Javelin" "Pontiac Firebird" "Lotus Europa"
#> [13] "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
#> [16] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive"
#> [19] "Merc 450SE" "Camaro Z28"
# let's get the first word of each car name
car_first = string_ops(cars, "extract.first")
# we select car brands ending with 'a', then ending with 'i'
string_get(car_first, "a$", "i$", seq = TRUE)
#> [1] "Mazda" "Mazda" "Honda" "Toyota" "Toyota" "Ferrari" "Maserati"
# seq.unik is similar to seq but applies unique()
string_get(car_first, "a$", "i$", seq.unik = TRUE)
#> [1] "Mazda" "Honda" "Toyota" "Ferrari" "Maserati"
At the exploration stage, we often run the same command with a few
variations on the same data set. Acknowledging this,
string_get
supports the caching of the data argument in
interactive use. This means that the user can concentrate in the pattern
to find and need not bother to write the data from where to fectch the
values. Note that string_get
is the only
stringmagic
function to have this ability.
Caching is always enabled, you don’t need to do anything.
Ex: caching of the data.
# Since we used `car_first` in the previous example, we don't need to provide
# it explicitly now
# => brands containing 'M' and ending with 'a' or 'i'; brands containing 'M'
string_get("M & [ai]$", "M", seq.unik = TRUE)
#> [1] "Mazda" "Maserati" "Merc" "AMC"
string_ops
Formatting text data often requires applying many functions (be it for parsing, text analysis, etc). Even for simple tasks, the number of operations can quickly balloon, adding many lines of code, reducing readability, and all this for basic processing.
The function string_ops
tries to solve this problem. It
has access to all (50+) string_magic
operations, allowing for a compact and readable way to chain basic
operations on character strings.
Below are a few motivating examples.
Ex.1: Parsing data.
# parsing an input: extracting the numbers
input = "8.5in, 5.5, .5 cm"
string_ops(input, "','split, tws, '^\\. => 0.'replace, '^\\D+|\\D+$'replace, num")
#> [1] 8.5 5.5 0.5
# Explanation------------------------------------------------------------------|
# ','split: splitting w.r.t. ',' |
# tws: trimming the whitespaces |
# '^\\. => 0.'replace: adds a 0 to strings starting with '.' |
# '^\\D+|\\D+$'replace: removes non-digits on both ends of the string |
# num: converts to numeric |
# now extracting the units
string_ops(input, "','split, '^[ \\d.]+'replace, tws")
#> [1] "in" "" "cm"
# Explanation------------------------------------------------------------------|
# ','split: splitting w.r.t. ',' |
# '^[ \\d.]+'replace: removes the ' ', digit |
# and '.' at the beginning of the string |
# tws: trimming the whitespaces |
Ex.2: extracing information from text.
# Now using the car data
cars = row.names(mtcars)
# let's get the brands starting with an "m"
string_ops(cars, "'i/^m'get, x, unik")
#> [1] "Mazda" "Merc" "Maserati"
# Explanation------------------------------------------------------------------|
# 'i/^m'get: keeps only the elements starting with an m, |
# i/ is the 'regex-flag' "ignore" to ignore the case |
# ^m means "starts with an m" in regex language |
# x: extracts the first pattern. The default pattern is "[[:alnum:]]+" |
# which means an alpha-numeric word |
# unik: applies unique() to the vector |
# let's get the 3 largest numbers appearing in the car models
string_ops(cars, "'\\d+'x, rm, unik, num, dsort, 3 first")
#> [1] 914 710 450
# Explanation------------------------------------------------------------------|
# '\\d+'x: extracts the first pattern, the pattern meaning "a succession" |
# of digits in regex language |
# rm: removes elements equal to the empty string (default behavior) |
# unik: applies unique() to the vector |
# num: converts to numeric |
# dsort: sorts in decreasing order |
# 3 first: keeps only the first three element |
As you can see, an operation that would take multiple lines to read and understand now can be read from left to right in a single line.
string_clean
: One function to clean them all
The function string_clean
streamlines the cleaning of
character vectors by providing:
This function is of the form string_clean(x, ...)
with
x
the vector to clean and ...
any number of
cleaning operations which can be of two types:
"pat1, pat2 => replacement"
to replace the regex
patterns pat1
and pat2
with the value
replacement
."@op1, op2"
to perform any arbitrary sequence of string_magic
operation
In the operation "pat1, pat2 => replacement"
, the
pattern is first split with respect to the pipe, " => "
(change it with argument pipe
), to get
replacement
. Then the pattern is split with respect to
commas (i.e. ",[ \t\n]+"
, change it with argument
sep
) to get pat1
and pat2
. A
sequence of base::gsub
calls is performed to replace each
patx
with replacement
.
By default the replacement is the empty string. This means that
writting "pat1, pat2"
will lead to erasing these two
patterns.
If a pattern starts with an "@"
, the subsequent
character string is sent to string_ops
. For example
"@ascii, lower"
is equivalent to
string_ops(x, "ascii, lower")
which turns x
to
ASCII and lowers the case.
monologue = c("For who would bear the whips and scorns of time",
"Th' oppressor's wrong, the proud man's contumely,",
"The pangs of despis'd love, the law's delay,",
"The insolence of office, and the spurns",
"That patient merit of th' unworthy takes,",
"When he himself might his quietus make",
"With a bare bodkin? Who would these fardels bear,",
"To grunt and sweat under a weary life,",
"But that the dread of something after death-",
"The undiscover'd country, from whose bourn",
"No traveller returns- puzzles the will,",
"And makes us rather bear those ills we have",
"Than fly to others that we know not of?")
# Cleaning a text
string_clean(monologue,
# use string_magic to: lower the case and remove basic stopwords
"@lower, stopword",
# remove a few extra stopwords(we use the flag word 'w/')
"w/th, 's",
# manually stem some verbs
"despis'd => despise", "undiscover'd => undiscover", "(m|t)akes => \\1ake",
# still stemming: dropping the ending 's' for words of 4+ letters, except for quietus
"(\\w{3,}[^u])s\\b => \\1",
# normalizing the whitespaces + removing punctuation
"@ws.punct")
#> [1] "bear whip scorn time" "oppressor wrong proud man contumely"
#> [3] "pang despise love law delay" "insolence office spurn"
#> [5] "patient merit unworthy take" "might quietus make"
#> [7] "bare bodkin fardel bear" "grunt sweat weary life"
#> [9] "dread something death" "undiscover country whose bourn"
#> [11] "traveller return puzzle will" "make us rather bear ills"
#> [13] "fly other know"
string_vec
The function string_vec
is dedicated to the creation of
small character vectors. You feed it a comma separated list of values in
a string and it will turn it into a vector.
Ex.1: creating a simple vector.
fruits = string_vec("orange, apple, pineapple, strawberry")
fruits
#> [1] "orange" "apple" "pineapple" "strawberry"
Within the enumeration, you can use interpolation, with curly
brackets ({}
), to insert the elements from a vector into
the current string.
Ex.2: adding a vector into an enumeration.
more_fruits = string_vec("lemon, {fruits}, peach")
more_fruits
#> [1] "lemon" "orange" "apple" "pineapple" "strawberry"
#> [6] "peach"
The interpolation is performed with string_magic
.
This means that any string_magic
operation can be applied on-the-fly.
Ex.3: replicating Ex.2 but shortening long fruit names.
more_fruits = string_vec("lemon, {6 Shorten ? fruits}, peach")
more_fruits
#> [1] "lemon" "orange" "apple" "pinea.." "straw.." "peach"
Since interpolations are resolved with string_magic
, you
can add any text before/after the interpolation:
Ex.4: adding text before the interpolation.
pkgs = string_vec("pandas, os, time, re")
imports = string_vec("import numpy as np, import {pkgs}")
imports
#> [1] "import numpy as np" "import pandas" "import os"
#> [4] "import time" "import re"
You can transform the returned vector into a matrix or a data frame
using the arguments .cmat
, .nmat
(character or numeric matrix) or
.df
.
Ex.5: returning a matrix.
string_vec("1, 5,
3, 2,
5, 12", .nmat = TRUE)
#> [,1] [,2]
#> [1,] 1 5
#> [2,] 3 2
#> [3,] 5 12
The number of rows is guessed from the number of newlines in the string. You can avoid using character strings, but in that case you need to explicitly give the number of rows.
Ex.5-bis: returning a numeric matrix, giving .nmat
the
number of rows.
string_vec(1, 5,
3, 2,
5, 12, .nmat = 3)
#> [,1] [,2]
#> [1,] 1 5
#> [2,] 3 2
#> [3,] 5 12
If you want to return a data.frame, you can add the column names in the .df argument: either in a regular vector, either in a comma separated list. Note that columns looking like numeric values are always converted.
Ex.6: returning a data frame.
# you can add the column names directly in the argument .df
df = string_vec("1, john,
3, marie,
5, harry", .df = "id, name")
df
#> id name
#> 1 1 john
#> 2 3 marie
#> 3 5 harry
# automatic conversion of numeric values
df$id * 5
#> [1] 5 15 25
The function string_split2df
(and
string_split2dt
) splits a vector using a regular expression
pattern and turns it into a data frame, remembering the original
identifiers. You can get the original vectors back (almost) with the
function paste_conditional
.
Ex.1: breaking up two sentences with respect to punctuation and spaces; then merging them back.
x = c("Nor rain, wind, thunder, fire are my daughters.",
"When my information changes, I alter my conclusions.")
# we split at each word
sentences_split = string_split2df(x, "[[:punct:] ]+")
sentences_split
#> obs x
#> 1 1 Nor
#> 2 1 rain
#> 3 1 wind
#> 4 1 thunder
#> 5 1 fire
#> 6 1 are
#> 7 1 my
#> 8 1 daughters
#> 9 2 When
#> 10 2 my
#> 11 2 information
#> 12 2 changes
#> 13 2 I
#> 14 2 alter
#> 15 2 my
#> 16 2 conclusions
# recovering the original vectors (we only lose the punctuation)
paste_conditional(sentences_split$x, sentences_split$obs)
#> 1
#> "Nor rain wind thunder fire are my daughters"
#> 2
#> "When my information changes I alter my conclusions"
If identifiers are associated to the elements of the vector, you can provide them so that the data frame returned contains them.
Ex.2: splitting with identifiers and merging back with a formula.
id = c("ws", "jmk")
# we add the identifier
base_words = string_split2df(x, "[[:punct:] ]+", id = list(author = id))
# merging back using a formula
paste_conditional(x ~ author, base_words)
#> author: jmk
#> "When my information changes I alter my conclusions"
#> author: ws
#> "Nor rain wind thunder fire are my daughters"