This function merges in-memory/HDD data to a HDD file.

hdd_merge(
  x,
  y,
  newfile,
  chunkMB,
  rowsPerChunk,
  all = FALSE,
  all.x = all,
  all.y = all,
  allow.cartesian = FALSE,
  replace = FALSE,
  verbose
)

Arguments

x

A HDD object or a data.frame.

y

A data set either a data.frame of a HDD object.

newfile

Destination of the result, i.e., a destination folder that will receive the HDD data.

chunkMB

Numeric, default is missing. If provided, the data 'x' is split in chunks of 'chunkMB' MB and the merge is applied chunkwise.

rowsPerChunk

Integer, default is missing. If provided, the data 'x' is split in chunks of 'rowsPerChunk' rows and the merge is applied chunkwise.

all

Default is FALSE.

all.x

Default is all.

all.y

Default is all.

allow.cartesian

Logical: whether to allow cartesian merge. Defaults to FALSE.

replace

Default is FALSE: if the destination folder already contains data, whether to replace it.

verbose

Numeric. Whether information on the advancement should be displayed. If equal to 0, nothing is displayed. By default it is equal to 1 if the size of x is greater than 1GB.

Value

This function does not return anything. It applies the merging between two potentially large (out of memory) data set and saves them on disk at the location of newfile, the destination folder which will be populated with .fst files representing chunks of the resulting merge.

To interact with the data (on disk) newly created, use the function hdd().

Details

If x (resp y) is a HDD object, then the merging will be operated chunkwise, with the original chunks of the objects. To change the size of the chunks for x: you can use the argument chunkMB or rowsPerChunk.

To change the chunk size of y, you can rewrite y with a new chunk size using write_hdd.

Note that the merging operation could also be achieved with hdd_slice (although it would require setting up an ad hoc function).

See also

See hdd, sub-.hdd and cash-.hdd for the extraction and manipulation of out of memory data. For importation of HDD data sets from text files: see txt2hdd.

See hdd_slice to apply functions to chunks of data (and create HDD objects) and hdd_merge to merge large files.

To create/reshape HDD objects from memory or from other HDD objects, see write_hdd.

To display general information from HDD objects: origin, summary.hdd, print.hdd, dim.hdd and names.hdd.

Author

Laurent Berge

Examples


# Toy example with iris data

# Cartesian merge example
iris_bis = iris
names(iris_bis) = c(paste0("x_", 1:4), "species_bis")
# We must have a common key on which to merge
iris_bis$id = iris$id = 1

# merge, we chunk 'x' by 50 rows
hdd_path = tempfile()
hdd_merge(iris, iris_bis, newfile = hdd_path,
      rowsPerChunk = 50, allow.cartesian = TRUE)

base_merged = hdd(hdd_path)
summary(base_merged)
#> Hard drive data of 63.9 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884b051eb/
#> 22,500 lines, 11 variables.
print(base_merged)
#>         id Sepal.Length Sepal.Width Petal.Length Petal.Width   Species x_1 x_2
#> 1        1          5.1         3.5          1.4         0.2    setosa 5.1 3.5
#> 2        1          5.1         3.5          1.4         0.2    setosa 4.9   3
#> 3        1          5.1         3.5          1.4         0.2    setosa 4.7 3.2
#> -------                                                                       
#> 22,498   1          5.9           3          5.1         1.8 virginica 6.5   3
#> 22,499   1          5.9           3          5.1         1.8 virginica 6.2 3.4
#> 22,500   1          5.9           3          5.1         1.8 virginica 5.9   3
#>         x_3 x_4 species_bis
#> 1       1.4 0.2      setosa
#> 2       1.4 0.2      setosa
#> 3       1.3 0.2      setosa
#> -------                    
#> 22,498  5.2   2   virginica
#> 22,499  5.4 2.3   virginica
#> 22,500  5.1 1.8   virginica