Merges data to a HDD file

This function merges in-memory/HDD data to a HDD file.

hdd_merge(
  x,
  y,
  newfile,
  chunkMB,
  rowsPerChunk,
  all = FALSE,
  all.x = all,
  all.y = all,
  allow.cartesian = FALSE,
  replace = FALSE,
  verbose
)

Arguments

x: A HDD object or a data.frame.
y: A data set either a data.frame of a HDD object.
newfile: Destination of the result, i.e., a destination folder that will receive the HDD data.
chunkMB: Numeric, default is missing. If provided, the data 'x' is split in chunks of 'chunkMB' MB and the merge is applied chunkwise.
rowsPerChunk: Integer, default is missing. If provided, the data 'x' is split in chunks of 'rowsPerChunk' rows and the merge is applied chunkwise.
all: Default is FALSE.
all.x: Default is all.
all.y: Default is all.
allow.cartesian: Logical: whether to allow cartesian merge. Defaults to FALSE.
replace: Default is FALSE: if the destination folder already contains data, whether to replace it.
verbose: Numeric. Whether information on the advancement should be displayed. If equal to 0, nothing is displayed. By default it is equal to 1 if the size of x is greater than 1GB.

Value

This function does not return anything. It applies the merging between two potentially large (out of memory) data set and saves them on disk at the location of newfile, the destination folder which will be populated with .fst files representing chunks of the resulting merge.

To interact with the data (on disk) newly created, use the function hdd().

Details

If x (resp y) is a HDD object, then the merging will be operated chunkwise, with the original chunks of the objects. To change the size of the chunks for x: you can use the argument chunkMB or rowsPerChunk.

To change the chunk size of y, you can rewrite y with a new chunk size using write_hdd.

Note that the merging operation could also be achieved with hdd_slice (although it would require setting up an ad hoc function).

Author

Laurent Berge

Examples


# Toy example with iris data

# Cartesian merge example
iris_bis = iris
names(iris_bis) = c(paste0("x_", 1:4), "species_bis")
# We must have a common key on which to merge
iris_bis$id = iris$id = 1

# merge, we chunk 'x' by 50 rows
hdd_path = tempfile()
hdd_merge(iris, iris_bis, newfile = hdd_path,
      rowsPerChunk = 50, allow.cartesian = TRUE)

base_merged = hdd(hdd_path)
summary(base_merged)
#> Hard drive data of 63.9 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884b051eb/
#> 22,500 lines, 11 variables.
print(base_merged)
#>         id Sepal.Length Sepal.Width Petal.Length Petal.Width   Species x_1 x_2
#> 1        1          5.1         3.5          1.4         0.2    setosa 5.1 3.5
#> 2        1          5.1         3.5          1.4         0.2    setosa 4.9   3
#> 3        1          5.1         3.5          1.4         0.2    setosa 4.7 3.2
#> -------                                                                       
#> 22,498   1          5.9           3          5.1         1.8 virginica 6.5   3
#> 22,499   1          5.9           3          5.1         1.8 virginica 6.2 3.4
#> 22,500   1          5.9           3          5.1         1.8 virginica 5.9   3
#>         x_3 x_4 species_bis
#> 1       1.4 0.2      setosa
#> 2       1.4 0.2      setosa
#> 3       1.3 0.2      setosa
#> -------                    
#> 22,498  5.2   2   virginica
#> 22,499  5.4 2.3   virginica
#> 22,500  5.1 1.8   virginica

Arguments

Value

Details

See also

Author

Examples