Splits a character vector and formats the resulting substrings into a data.frame
string_split2df(
x,
data = NULL,
split = NULL,
id = NULL,
add.pos = FALSE,
id_unik = TRUE,
fixed = FALSE,
ignore.case = FALSE,
word = FALSE,
envir = parent.frame(),
dt = FALSE,
...
)
string_split2dt(
x,
data = NULL,
split = NULL,
id = NULL,
add.pos = FALSE,
id_unik = TRUE,
fixed = FALSE
)
A character vector or a two-sided formula. If a two-sided formula, then the
argument data
must be provided since the variables will be fetched in there.
A formula is of the form char_var ~ id1 + id2
where char_var
on the left is a
character variable and on the right id1
and id2
are identifiers which will be
included in the resulting table. Alternatively, you can provide identifiers via
the argument id
.
Optional, only used if the argument x
is a formula. It should
contain the variables of the formula.
A character scalar. Used to split the character vectors. By default
this is a regular expression. You can use flags in the pattern in the form flag1, flag2/pattern
.
Available flags are ignore
(case), fixed
(no regex), word (add word boundaries),
magic (add interpolation with "{}"
). Example:
if "ignore/hello" and the text contains "Hello", it will be split at "Hello".
Shortcut: use the first letters of the flags. Ex: "iw/one" will split at the word
"one" (flags 'ignore' + 'word').
Optional. A character vector or a list of vectors. If provided, the
values of id
are considered as identifiers that will be included in the resulting table.
Logical, default is FALSE
. Whether to include the position of each split element.
Logical, default is TRUE
. In the case identifiers are provided,
whether to trigger a message if the identifiers are not unique. Indeed, if
the identifiers are not unique, it is not possible to reconstruct the original texts.
Logical, default is FALSE
. Whether to consider the argument split
as fixed (and not as a regular expression).
Logical scalar, default is FALSE
. If TRUE
, then case insensitive search is triggered.
Logical scalar, default is FALSE
. If TRUE
then a) word boundaries are added to the pattern,
and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation.
Example: if word = TRUE
, then pattern = "The, mountain" will select strings containing either the word
'The' or the word 'mountain'.
Environment in which to evaluate the interpolations if the flag "magic"
is provided.
Default is parent.frame()
.
Logical, default is FALSE
. Whether to return a data.table
. See also the function string_split2dt
.
Not currently used.
It returns a data.frame
or a data.table
which will contain: i) obs
: the observation index,
ii) pos
: the position of the text element in the initial string (optional, via add.pos),
iii) the text element, iv) the identifier(s) (optional, only if id
was provided).
string_split2dt()
: Splits a string vector and returns a data.table
String operations: string_is()
, string_get()
, string_clean()
, string_split2df()
.
Chain basic operations with string_ops()
. Clean character vectors efficiently
with string_clean()
.
Use string_vec()
to create simple string vectors.
String interpolation combined with operation chaining: string_magic()
. You can change string_magic
default values with string_magic_alias()
and add custom operations with string_magic_register_fun()
.
Display messages while benefiting from string_magic
interpolation with cat_magic()
and message_magic()
.
Other tools with aliases:
cat_magic_alias()
,
string_magic()
,
string_magic_alias()
,
string_ops_alias()
,
string_vec_alias()
x = c("Nor rain, wind, thunder, fire are my daughters.",
"When my information changes, I alter my conclusions.")
id = c("ws", "jmk")
# we split at each word
string_split2df(x, "[[:punct:] ]+")
#> obs x
#> 1 1 Nor
#> 2 1 rain
#> 3 1 wind
#> 4 1 thunder
#> 5 1 fire
#> 6 1 are
#> 7 1 my
#> 8 1 daughters
#> 9 2 When
#> 10 2 my
#> 11 2 information
#> 12 2 changes
#> 13 2 I
#> 14 2 alter
#> 15 2 my
#> 16 2 conclusions
# we add the 'id'
string_split2df(x, "[[:punct:] ]+", id = id)
#> obs x id
#> 1 1 Nor ws
#> 2 1 rain ws
#> 3 1 wind ws
#> 4 1 thunder ws
#> 5 1 fire ws
#> 6 1 are ws
#> 7 1 my ws
#> 8 1 daughters ws
#> 9 2 When jmk
#> 10 2 my jmk
#> 11 2 information jmk
#> 12 2 changes jmk
#> 13 2 I jmk
#> 14 2 alter jmk
#> 15 2 my jmk
#> 16 2 conclusions jmk
# TO NOTE:
# - the second argument is `data`
# - when it is missing, the argument `split` becomes implicitly the second
# - ex: above we did not use `split = "[[:punct:] ]+"`
#
# using the formula
base = data.frame(text = x, my_id = id)
string_split2df(text ~ my_id, base, "[[:punct:] ]+")
#> obs text my_id
#> 1 1 Nor ws
#> 2 1 rain ws
#> 3 1 wind ws
#> 4 1 thunder ws
#> 5 1 fire ws
#> 6 1 are ws
#> 7 1 my ws
#> 8 1 daughters ws
#> 9 2 When jmk
#> 10 2 my jmk
#> 11 2 information jmk
#> 12 2 changes jmk
#> 13 2 I jmk
#> 14 2 alter jmk
#> 15 2 my jmk
#> 16 2 conclusions jmk
#
# with 2+ identifiers
base = within(mtcars, carname <- rownames(mtcars))
# we have a message because the identifiers are not unique
string_split2df(carname ~ am + gear + carb, base, " +")
#> The identifiers are not unique, you will not be able to reconstruct the data using only them.
#> obs carname am gear carb
#> 1 1 Mazda 1 4 4
#> 2 1 RX4 1 4 4
#> 3 2 Mazda 1 4 4
#> 4 2 RX4 1 4 4
#> 5 2 Wag 1 4 4
#> 6 3 Datsun 1 4 1
#> 7 3 710 1 4 1
#> 8 4 Hornet 0 3 1
#> 9 4 4 0 3 1
#> 10 4 Drive 0 3 1
#> 11 5 Hornet 0 3 2
#> 12 5 Sportabout 0 3 2
#> 13 6 Valiant 0 3 1
#> 14 7 Duster 0 3 4
#> 15 7 360 0 3 4
#> 16 8 Merc 0 4 2
#> 17 8 240D 0 4 2
#> 18 9 Merc 0 4 2
#> 19 9 230 0 4 2
#> 20 10 Merc 0 4 4
#> 21 10 280 0 4 4
#> 22 11 Merc 0 4 4
#> 23 11 280C 0 4 4
#> 24 12 Merc 0 3 3
#> 25 12 450SE 0 3 3
#> 26 13 Merc 0 3 3
#> 27 13 450SL 0 3 3
#> 28 14 Merc 0 3 3
#> 29 14 450SLC 0 3 3
#> 30 15 Cadillac 0 3 4
#> 31 15 Fleetwood 0 3 4
#> 32 16 Lincoln 0 3 4
#> 33 16 Continental 0 3 4
#> 34 17 Chrysler 0 3 4
#> 35 17 Imperial 0 3 4
#> 36 18 Fiat 1 4 1
#> 37 18 128 1 4 1
#> 38 19 Honda 1 4 2
#> 39 19 Civic 1 4 2
#> 40 20 Toyota 1 4 1
#> 41 20 Corolla 1 4 1
#> 42 21 Toyota 0 3 1
#> 43 21 Corona 0 3 1
#> 44 22 Dodge 0 3 2
#> 45 22 Challenger 0 3 2
#> 46 23 AMC 0 3 2
#> 47 23 Javelin 0 3 2
#> 48 24 Camaro 0 3 4
#> 49 24 Z28 0 3 4
#> 50 25 Pontiac 0 3 2
#> 51 25 Firebird 0 3 2
#> 52 26 Fiat 1 4 1
#> 53 26 X1-9 1 4 1
#> 54 27 Porsche 1 5 2
#> 55 27 914-2 1 5 2
#> 56 28 Lotus 1 5 2
#> 57 28 Europa 1 5 2
#> 58 29 Ford 1 5 4
#> 59 29 Pantera 1 5 4
#> 60 29 L 1 5 4
#> 61 30 Ferrari 1 5 6
#> 62 30 Dino 1 5 6
#> 63 31 Maserati 1 5 8
#> 64 31 Bora 1 5 8
#> 65 32 Volvo 1 4 2
#> 66 32 142E 1 4 2
# adding the position of the words & removing the message
string_split2df(carname ~ am + gear + carb, base, " +", id_unik = FALSE, add.pos = TRUE)
#> obs pos carname am gear carb
#> 1 1 1 Mazda 1 4 4
#> 2 1 2 RX4 1 4 4
#> 3 2 1 Mazda 1 4 4
#> 4 2 2 RX4 1 4 4
#> 5 2 3 Wag 1 4 4
#> 6 3 1 Datsun 1 4 1
#> 7 3 2 710 1 4 1
#> 8 4 1 Hornet 0 3 1
#> 9 4 2 4 0 3 1
#> 10 4 3 Drive 0 3 1
#> 11 5 1 Hornet 0 3 2
#> 12 5 2 Sportabout 0 3 2
#> 13 6 1 Valiant 0 3 1
#> 14 7 1 Duster 0 3 4
#> 15 7 2 360 0 3 4
#> 16 8 1 Merc 0 4 2
#> 17 8 2 240D 0 4 2
#> 18 9 1 Merc 0 4 2
#> 19 9 2 230 0 4 2
#> 20 10 1 Merc 0 4 4
#> 21 10 2 280 0 4 4
#> 22 11 1 Merc 0 4 4
#> 23 11 2 280C 0 4 4
#> 24 12 1 Merc 0 3 3
#> 25 12 2 450SE 0 3 3
#> 26 13 1 Merc 0 3 3
#> 27 13 2 450SL 0 3 3
#> 28 14 1 Merc 0 3 3
#> 29 14 2 450SLC 0 3 3
#> 30 15 1 Cadillac 0 3 4
#> 31 15 2 Fleetwood 0 3 4
#> 32 16 1 Lincoln 0 3 4
#> 33 16 2 Continental 0 3 4
#> 34 17 1 Chrysler 0 3 4
#> 35 17 2 Imperial 0 3 4
#> 36 18 1 Fiat 1 4 1
#> 37 18 2 128 1 4 1
#> 38 19 1 Honda 1 4 2
#> 39 19 2 Civic 1 4 2
#> 40 20 1 Toyota 1 4 1
#> 41 20 2 Corolla 1 4 1
#> 42 21 1 Toyota 0 3 1
#> 43 21 2 Corona 0 3 1
#> 44 22 1 Dodge 0 3 2
#> 45 22 2 Challenger 0 3 2
#> 46 23 1 AMC 0 3 2
#> 47 23 2 Javelin 0 3 2
#> 48 24 1 Camaro 0 3 4
#> 49 24 2 Z28 0 3 4
#> 50 25 1 Pontiac 0 3 2
#> 51 25 2 Firebird 0 3 2
#> 52 26 1 Fiat 1 4 1
#> 53 26 2 X1-9 1 4 1
#> 54 27 1 Porsche 1 5 2
#> 55 27 2 914-2 1 5 2
#> 56 28 1 Lotus 1 5 2
#> 57 28 2 Europa 1 5 2
#> 58 29 1 Ford 1 5 4
#> 59 29 2 Pantera 1 5 4
#> 60 29 3 L 1 5 4
#> 61 30 1 Ferrari 1 5 6
#> 62 30 2 Dino 1 5 6
#> 63 31 1 Maserati 1 5 8
#> 64 31 2 Bora 1 5 8
#> 65 32 1 Volvo 1 4 2
#> 66 32 2 142E 1 4 2