Splits a character vector and formats the resulting substrings into a data.frame

string_split2df(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  envir = parent.frame(),
  dt = FALSE,
  ...
)

string_split2dt(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE
)

Arguments

x

A character vector or a two-sided formula. If a two-sided formula, then the argument data must be provided since the variables will be fetched in there. A formula is of the form char_var ~ id1 + id2 where char_var on the left is a character variable and on the right id1 and id2 are identifiers which will be included in the resulting table. Alternatively, you can provide identifiers via the argument id.

data

Optional, only used if the argument x is a formula. It should contain the variables of the formula.

split

A character scalar. Used to split the character vectors. By default this is a regular expression. You can use flags in the pattern in the form flag1, flag2/pattern. Available flags are ignore (case), fixed (no regex), word (add word boundaries), magic (add interpolation with "{}"). Example: if "ignore/hello" and the text contains "Hello", it will be split at "Hello". Shortcut: use the first letters of the flags. Ex: "iw/one" will split at the word "one" (flags 'ignore' + 'word').

id

Optional. A character vector or a list of vectors. If provided, the values of id are considered as identifiers that will be included in the resulting table.

add.pos

Logical, default is FALSE. Whether to include the position of each split element.

id_unik

Logical, default is TRUE. In the case identifiers are provided, whether to trigger a message if the identifiers are not unique. Indeed, if the identifiers are not unique, it is not possible to reconstruct the original texts.

fixed

Logical, default is FALSE. Whether to consider the argument split as fixed (and not as a regular expression).

ignore.case

Logical scalar, default is FALSE. If TRUE, then case insensitive search is triggered.

word

Logical scalar, default is FALSE. If TRUE then a) word boundaries are added to the pattern, and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation. Example: if word = TRUE, then pattern = "The, mountain" will select strings containing either the word 'The' or the word 'mountain'.

envir

Environment in which to evaluate the interpolations if the flag "magic" is provided. Default is parent.frame().

dt

Logical, default is FALSE. Whether to return a data.table. See also the function string_split2dt.

...

Not currently used.

Value

It returns a data.frame or a data.table which will contain: i) obs: the observation index, ii) pos: the position of the text element in the initial string (optional, via add.pos), iii) the text element, iv) the identifier(s) (optional, only if id was provided).

Functions

  • string_split2dt(): Splits a string vector and returns a data.table

See also

String operations: string_is(), string_get(), string_clean(), string_split2df(). Chain basic operations with string_ops(). Clean character vectors efficiently with string_clean().

Use string_vec() to create simple string vectors.

String interpolation combined with operation chaining: string_magic(). You can change string_magic default values with string_magic_alias() and add custom operations with string_magic_register_fun().

Display messages while benefiting from string_magic interpolation with cat_magic() and message_magic().

Other tools with aliases: cat_magic_alias(), string_magic(), string_magic_alias(), string_ops_alias(), string_vec_alias()

Examples


x = c("Nor rain, wind, thunder, fire are my daughters.",
      "When my information changes, I alter my conclusions.")

id = c("ws", "jmk")

# we split at each word
string_split2df(x, "[[:punct:] ]+")
#>    obs           x
#> 1    1         Nor
#> 2    1        rain
#> 3    1        wind
#> 4    1     thunder
#> 5    1        fire
#> 6    1         are
#> 7    1          my
#> 8    1   daughters
#> 9    2        When
#> 10   2          my
#> 11   2 information
#> 12   2     changes
#> 13   2           I
#> 14   2       alter
#> 15   2          my
#> 16   2 conclusions

# we add the 'id'
string_split2df(x, "[[:punct:] ]+", id = id)
#>    obs           x  id
#> 1    1         Nor  ws
#> 2    1        rain  ws
#> 3    1        wind  ws
#> 4    1     thunder  ws
#> 5    1        fire  ws
#> 6    1         are  ws
#> 7    1          my  ws
#> 8    1   daughters  ws
#> 9    2        When jmk
#> 10   2          my jmk
#> 11   2 information jmk
#> 12   2     changes jmk
#> 13   2           I jmk
#> 14   2       alter jmk
#> 15   2          my jmk
#> 16   2 conclusions jmk

# TO NOTE:
# - the second argument is `data`
# - when it is missing, the argument `split` becomes implicitly the second
# - ex: above we did not use `split = "[[:punct:] ]+"`

#
# using the formula

base = data.frame(text = x, my_id = id)
string_split2df(text ~ my_id, base, "[[:punct:] ]+")
#>    obs        text my_id
#> 1    1         Nor    ws
#> 2    1        rain    ws
#> 3    1        wind    ws
#> 4    1     thunder    ws
#> 5    1        fire    ws
#> 6    1         are    ws
#> 7    1          my    ws
#> 8    1   daughters    ws
#> 9    2        When   jmk
#> 10   2          my   jmk
#> 11   2 information   jmk
#> 12   2     changes   jmk
#> 13   2           I   jmk
#> 14   2       alter   jmk
#> 15   2          my   jmk
#> 16   2 conclusions   jmk

#
# with 2+ identifiers

base = within(mtcars, carname <- rownames(mtcars))

# we have a message because the identifiers are not unique
string_split2df(carname ~ am + gear + carb, base, " +")
#> The identifiers are not unique, you will not be able to reconstruct the data using only them.
#>    obs     carname am gear carb
#> 1    1       Mazda  1    4    4
#> 2    1         RX4  1    4    4
#> 3    2       Mazda  1    4    4
#> 4    2         RX4  1    4    4
#> 5    2         Wag  1    4    4
#> 6    3      Datsun  1    4    1
#> 7    3         710  1    4    1
#> 8    4      Hornet  0    3    1
#> 9    4           4  0    3    1
#> 10   4       Drive  0    3    1
#> 11   5      Hornet  0    3    2
#> 12   5  Sportabout  0    3    2
#> 13   6     Valiant  0    3    1
#> 14   7      Duster  0    3    4
#> 15   7         360  0    3    4
#> 16   8        Merc  0    4    2
#> 17   8        240D  0    4    2
#> 18   9        Merc  0    4    2
#> 19   9         230  0    4    2
#> 20  10        Merc  0    4    4
#> 21  10         280  0    4    4
#> 22  11        Merc  0    4    4
#> 23  11        280C  0    4    4
#> 24  12        Merc  0    3    3
#> 25  12       450SE  0    3    3
#> 26  13        Merc  0    3    3
#> 27  13       450SL  0    3    3
#> 28  14        Merc  0    3    3
#> 29  14      450SLC  0    3    3
#> 30  15    Cadillac  0    3    4
#> 31  15   Fleetwood  0    3    4
#> 32  16     Lincoln  0    3    4
#> 33  16 Continental  0    3    4
#> 34  17    Chrysler  0    3    4
#> 35  17    Imperial  0    3    4
#> 36  18        Fiat  1    4    1
#> 37  18         128  1    4    1
#> 38  19       Honda  1    4    2
#> 39  19       Civic  1    4    2
#> 40  20      Toyota  1    4    1
#> 41  20     Corolla  1    4    1
#> 42  21      Toyota  0    3    1
#> 43  21      Corona  0    3    1
#> 44  22       Dodge  0    3    2
#> 45  22  Challenger  0    3    2
#> 46  23         AMC  0    3    2
#> 47  23     Javelin  0    3    2
#> 48  24      Camaro  0    3    4
#> 49  24         Z28  0    3    4
#> 50  25     Pontiac  0    3    2
#> 51  25    Firebird  0    3    2
#> 52  26        Fiat  1    4    1
#> 53  26        X1-9  1    4    1
#> 54  27     Porsche  1    5    2
#> 55  27       914-2  1    5    2
#> 56  28       Lotus  1    5    2
#> 57  28      Europa  1    5    2
#> 58  29        Ford  1    5    4
#> 59  29     Pantera  1    5    4
#> 60  29           L  1    5    4
#> 61  30     Ferrari  1    5    6
#> 62  30        Dino  1    5    6
#> 63  31    Maserati  1    5    8
#> 64  31        Bora  1    5    8
#> 65  32       Volvo  1    4    2
#> 66  32        142E  1    4    2

# adding the position of the words & removing the message
string_split2df(carname ~ am + gear + carb, base, " +", id_unik = FALSE, add.pos = TRUE)
#>    obs pos     carname am gear carb
#> 1    1   1       Mazda  1    4    4
#> 2    1   2         RX4  1    4    4
#> 3    2   1       Mazda  1    4    4
#> 4    2   2         RX4  1    4    4
#> 5    2   3         Wag  1    4    4
#> 6    3   1      Datsun  1    4    1
#> 7    3   2         710  1    4    1
#> 8    4   1      Hornet  0    3    1
#> 9    4   2           4  0    3    1
#> 10   4   3       Drive  0    3    1
#> 11   5   1      Hornet  0    3    2
#> 12   5   2  Sportabout  0    3    2
#> 13   6   1     Valiant  0    3    1
#> 14   7   1      Duster  0    3    4
#> 15   7   2         360  0    3    4
#> 16   8   1        Merc  0    4    2
#> 17   8   2        240D  0    4    2
#> 18   9   1        Merc  0    4    2
#> 19   9   2         230  0    4    2
#> 20  10   1        Merc  0    4    4
#> 21  10   2         280  0    4    4
#> 22  11   1        Merc  0    4    4
#> 23  11   2        280C  0    4    4
#> 24  12   1        Merc  0    3    3
#> 25  12   2       450SE  0    3    3
#> 26  13   1        Merc  0    3    3
#> 27  13   2       450SL  0    3    3
#> 28  14   1        Merc  0    3    3
#> 29  14   2      450SLC  0    3    3
#> 30  15   1    Cadillac  0    3    4
#> 31  15   2   Fleetwood  0    3    4
#> 32  16   1     Lincoln  0    3    4
#> 33  16   2 Continental  0    3    4
#> 34  17   1    Chrysler  0    3    4
#> 35  17   2    Imperial  0    3    4
#> 36  18   1        Fiat  1    4    1
#> 37  18   2         128  1    4    1
#> 38  19   1       Honda  1    4    2
#> 39  19   2       Civic  1    4    2
#> 40  20   1      Toyota  1    4    1
#> 41  20   2     Corolla  1    4    1
#> 42  21   1      Toyota  0    3    1
#> 43  21   2      Corona  0    3    1
#> 44  22   1       Dodge  0    3    2
#> 45  22   2  Challenger  0    3    2
#> 46  23   1         AMC  0    3    2
#> 47  23   2     Javelin  0    3    2
#> 48  24   1      Camaro  0    3    4
#> 49  24   2         Z28  0    3    4
#> 50  25   1     Pontiac  0    3    2
#> 51  25   2    Firebird  0    3    2
#> 52  26   1        Fiat  1    4    1
#> 53  26   2        X1-9  1    4    1
#> 54  27   1     Porsche  1    5    2
#> 55  27   2       914-2  1    5    2
#> 56  28   1       Lotus  1    5    2
#> 57  28   2      Europa  1    5    2
#> 58  29   1        Ford  1    5    4
#> 59  29   2     Pantera  1    5    4
#> 60  29   3           L  1    5    4
#> 61  30   1     Ferrari  1    5    6
#> 62  30   2        Dino  1    5    6
#> 63  31   1    Maserati  1    5    8
#> 64  31   2        Bora  1    5    8
#> 65  32   1       Volvo  1    4    2
#> 66  32   2        142E  1    4    2