Splits a character vector into a data frame

Splits a character vector and formats the resulting substrings into a data.frame

string_split2df(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE,
  ignore.case = FALSE,
  word = FALSE,
  envir = parent.frame(),
  dt = FALSE,
  ...
)

string_split2dt(
  x,
  data = NULL,
  split = NULL,
  id = NULL,
  add.pos = FALSE,
  id_unik = TRUE,
  fixed = FALSE
)

Arguments

x: A character vector or a two-sided formula. If a two-sided formula, then the argument data must be provided since the variables will be fetched in there. A formula is of the form char_var ~ id1 + id2 where char_var on the left is a character variable and on the right id1 and id2 are identifiers which will be included in the resulting table. Alternatively, you can provide identifiers via the argument id.
data: Optional, only used if the argument x is a formula. It should contain the variables of the formula.
split: A character scalar. Used to split the character vectors. By default this is a regular expression. You can use flags in the pattern in the form flag1, flag2/pattern. Available flags are ignore (case), fixed (no regex), word (add word boundaries), magic (add interpolation with "{}"). Example: if "ignore/hello" and the text contains "Hello", it will be split at "Hello". Shortcut: use the first letters of the flags. Ex: "iw/one" will split at the word "one" (flags 'ignore' + 'word').
id: Optional. A character vector or a list of vectors. If provided, the values of id are considered as identifiers that will be included in the resulting table.
add.pos: Logical, default is FALSE. Whether to include the position of each split element.
id_unik: Logical, default is TRUE. In the case identifiers are provided, whether to trigger a message if the identifiers are not unique. Indeed, if the identifiers are not unique, it is not possible to reconstruct the original texts.
fixed: Logical, default is FALSE. Whether to consider the argument split as fixed (and not as a regular expression).
ignore.case: Logical scalar, default is FALSE. If TRUE, then case insensitive search is triggered.
word: Logical scalar, default is FALSE. If TRUE then a) word boundaries are added to the pattern, and b) patterns can be chained by separating them with a comma, they are combined with an OR logical operation. Example: if word = TRUE, then pattern = "The, mountain" will select strings containing either the word 'The' or the word 'mountain'.
envir: Environment in which to evaluate the interpolations if the flag "magic" is provided. Default is parent.frame().
dt: Logical, default is FALSE. Whether to return a data.table. See also the function string_split2dt.
...: Not currently used.

Value

It returns a data.frame or a data.table which will contain: i) obs: the observation index, ii) pos: the position of the text element in the initial string (optional, via add.pos), iii) the text element, iv) the identifier(s) (optional, only if id was provided).

Functions

string_split2dt(): Splits a string vector and returns a data.table

Examples


x = c("Nor rain, wind, thunder, fire are my daughters.",
      "When my information changes, I alter my conclusions.")

id = c("ws", "jmk")

# we split at each word
string_split2df(x, "[[:punct:] ]+")
#>    obs           x
#> 1    1         Nor
#> 2    1        rain
#> 3    1        wind
#> 4    1     thunder
#> 5    1        fire
#> 6    1         are
#> 7    1          my
#> 8    1   daughters
#> 9    2        When
#> 10   2          my
#> 11   2 information
#> 12   2     changes
#> 13   2           I
#> 14   2       alter
#> 15   2          my
#> 16   2 conclusions

# we add the 'id'
string_split2df(x, "[[:punct:] ]+", id = id)
#>    obs           x  id
#> 1    1         Nor  ws
#> 2    1        rain  ws
#> 3    1        wind  ws
#> 4    1     thunder  ws
#> 5    1        fire  ws
#> 6    1         are  ws
#> 7    1          my  ws
#> 8    1   daughters  ws
#> 9    2        When jmk
#> 10   2          my jmk
#> 11   2 information jmk
#> 12   2     changes jmk
#> 13   2           I jmk
#> 14   2       alter jmk
#> 15   2          my jmk
#> 16   2 conclusions jmk

# TO NOTE:
# - the second argument is `data`
# - when it is missing, the argument `split` becomes implicitly the second
# - ex: above we did not use `split = "[[:punct:] ]+"`

#
# using the formula

base = data.frame(text = x, my_id = id)
string_split2df(text ~ my_id, base, "[[:punct:] ]+")
#>    obs        text my_id
#> 1    1         Nor    ws
#> 2    1        rain    ws
#> 3    1        wind    ws
#> 4    1     thunder    ws
#> 5    1        fire    ws
#> 6    1         are    ws
#> 7    1          my    ws
#> 8    1   daughters    ws
#> 9    2        When   jmk
#> 10   2          my   jmk
#> 11   2 information   jmk
#> 12   2     changes   jmk
#> 13   2           I   jmk
#> 14   2       alter   jmk
#> 15   2          my   jmk
#> 16   2 conclusions   jmk

#
# with 2+ identifiers

base = within(mtcars, carname <- rownames(mtcars))

# we have a message because the identifiers are not unique
string_split2df(carname ~ am + gear + carb, base, " +")
#> The identifiers are not unique, you will not be able to reconstruct the data using only them.
#>    obs     carname am gear carb
#> 1    1       Mazda  1    4    4
#> 2    1         RX4  1    4    4
#> 3    2       Mazda  1    4    4
#> 4    2         RX4  1    4    4
#> 5    2         Wag  1    4    4
#> 6    3      Datsun  1    4    1
#> 7    3         710  1    4    1
#> 8    4      Hornet  0    3    1
#> 9    4           4  0    3    1
#> 10   4       Drive  0    3    1
#> 11   5      Hornet  0    3    2
#> 12   5  Sportabout  0    3    2
#> 13   6     Valiant  0    3    1
#> 14   7      Duster  0    3    4
#> 15   7         360  0    3    4
#> 16   8        Merc  0    4    2
#> 17   8        240D  0    4    2
#> 18   9        Merc  0    4    2
#> 19   9         230  0    4    2
#> 20  10        Merc  0    4    4
#> 21  10         280  0    4    4
#> 22  11        Merc  0    4    4
#> 23  11        280C  0    4    4
#> 24  12        Merc  0    3    3
#> 25  12       450SE  0    3    3
#> 26  13        Merc  0    3    3
#> 27  13       450SL  0    3    3
#> 28  14        Merc  0    3    3
#> 29  14      450SLC  0    3    3
#> 30  15    Cadillac  0    3    4
#> 31  15   Fleetwood  0    3    4
#> 32  16     Lincoln  0    3    4
#> 33  16 Continental  0    3    4
#> 34  17    Chrysler  0    3    4
#> 35  17    Imperial  0    3    4
#> 36  18        Fiat  1    4    1
#> 37  18         128  1    4    1
#> 38  19       Honda  1    4    2
#> 39  19       Civic  1    4    2
#> 40  20      Toyota  1    4    1
#> 41  20     Corolla  1    4    1
#> 42  21      Toyota  0    3    1
#> 43  21      Corona  0    3    1
#> 44  22       Dodge  0    3    2
#> 45  22  Challenger  0    3    2
#> 46  23         AMC  0    3    2
#> 47  23     Javelin  0    3    2
#> 48  24      Camaro  0    3    4
#> 49  24         Z28  0    3    4
#> 50  25     Pontiac  0    3    2
#> 51  25    Firebird  0    3    2
#> 52  26        Fiat  1    4    1
#> 53  26        X1-9  1    4    1
#> 54  27     Porsche  1    5    2
#> 55  27       914-2  1    5    2
#> 56  28       Lotus  1    5    2
#> 57  28      Europa  1    5    2
#> 58  29        Ford  1    5    4
#> 59  29     Pantera  1    5    4
#> 60  29           L  1    5    4
#> 61  30     Ferrari  1    5    6
#> 62  30        Dino  1    5    6
#> 63  31    Maserati  1    5    8
#> 64  31        Bora  1    5    8
#> 65  32       Volvo  1    4    2
#> 66  32        142E  1    4    2

# adding the position of the words & removing the message
string_split2df(carname ~ am + gear + carb, base, " +", id_unik = FALSE, add.pos = TRUE)
#>    obs pos     carname am gear carb
#> 1    1   1       Mazda  1    4    4
#> 2    1   2         RX4  1    4    4
#> 3    2   1       Mazda  1    4    4
#> 4    2   2         RX4  1    4    4
#> 5    2   3         Wag  1    4    4
#> 6    3   1      Datsun  1    4    1
#> 7    3   2         710  1    4    1
#> 8    4   1      Hornet  0    3    1
#> 9    4   2           4  0    3    1
#> 10   4   3       Drive  0    3    1
#> 11   5   1      Hornet  0    3    2
#> 12   5   2  Sportabout  0    3    2
#> 13   6   1     Valiant  0    3    1
#> 14   7   1      Duster  0    3    4
#> 15   7   2         360  0    3    4
#> 16   8   1        Merc  0    4    2
#> 17   8   2        240D  0    4    2
#> 18   9   1        Merc  0    4    2
#> 19   9   2         230  0    4    2
#> 20  10   1        Merc  0    4    4
#> 21  10   2         280  0    4    4
#> 22  11   1        Merc  0    4    4
#> 23  11   2        280C  0    4    4
#> 24  12   1        Merc  0    3    3
#> 25  12   2       450SE  0    3    3
#> 26  13   1        Merc  0    3    3
#> 27  13   2       450SL  0    3    3
#> 28  14   1        Merc  0    3    3
#> 29  14   2      450SLC  0    3    3
#> 30  15   1    Cadillac  0    3    4
#> 31  15   2   Fleetwood  0    3    4
#> 32  16   1     Lincoln  0    3    4
#> 33  16   2 Continental  0    3    4
#> 34  17   1    Chrysler  0    3    4
#> 35  17   2    Imperial  0    3    4
#> 36  18   1        Fiat  1    4    1
#> 37  18   2         128  1    4    1
#> 38  19   1       Honda  1    4    2
#> 39  19   2       Civic  1    4    2
#> 40  20   1      Toyota  1    4    1
#> 41  20   2     Corolla  1    4    1
#> 42  21   1      Toyota  0    3    1
#> 43  21   2      Corona  0    3    1
#> 44  22   1       Dodge  0    3    2
#> 45  22   2  Challenger  0    3    2
#> 46  23   1         AMC  0    3    2
#> 47  23   2     Javelin  0    3    2
#> 48  24   1      Camaro  0    3    4
#> 49  24   2         Z28  0    3    4
#> 50  25   1     Pontiac  0    3    2
#> 51  25   2    Firebird  0    3    2
#> 52  26   1        Fiat  1    4    1
#> 53  26   2        X1-9  1    4    1
#> 54  27   1     Porsche  1    5    2
#> 55  27   2       914-2  1    5    2
#> 56  28   1       Lotus  1    5    2
#> 57  28   2      Europa  1    5    2
#> 58  29   1        Ford  1    5    4
#> 59  29   2     Pantera  1    5    4
#> 60  29   3           L  1    5    4
#> 61  30   1     Ferrari  1    5    6
#> 62  30   2        Dino  1    5    6
#> 63  31   1    Maserati  1    5    8
#> 64  31   2        Bora  1    5    8
#> 65  32   1       Volvo  1    4    2
#> 66  32   2        142E  1    4    2

Arguments

Value

Functions

See also

Examples