hdd_merge.Rd
This function merges in-memory/HDD data to a HDD file.
hdd_merge(
x,
y,
newfile,
chunkMB,
rowsPerChunk,
all = FALSE,
all.x = all,
all.y = all,
allow.cartesian = FALSE,
replace = FALSE,
verbose
)
A HDD object or a data.frame
.
A data set either a data.frame of a HDD object.
Destination of the result, i.e., a destination folder that will receive the HDD data.
Numeric, default is missing. If provided, the data 'x' is split in chunks of 'chunkMB' MB and the merge is applied chunkwise.
Integer, default is missing. If provided, the data 'x' is split in chunks of 'rowsPerChunk' rows and the merge is applied chunkwise.
Default is FALSE
.
Default is all
.
Default is all
.
Logical: whether to allow cartesian merge. Defaults to FALSE
.
Default is FALSE
: if the destination folder already contains
data, whether to replace it.
Numeric. Whether information on the advancement should be displayed.
If equal to 0, nothing is displayed. By default it is equal to 1 if the size
of x
is greater than 1GB.
This function does not return anything. It applies the merging between
two potentially large (out of memory) data set and saves them on disk at the location
of newfile
, the destination folder which will be populated with .fst files
representing chunks of the resulting merge.
To interact with the data (on disk) newly created, use the function hdd()
.
If x
(resp y
) is a HDD object, then the merging will be operated
chunkwise, with the original chunks of the objects. To change the size of the
chunks for x
: you can use the argument chunkMB
or rowsPerChunk.
To change the chunk size of y
, you can rewrite y
with a new chunk
size using write_hdd
.
Note that the merging operation could also be achieved with hdd_slice
(although it would require setting up an ad hoc function).
See hdd
, sub-.hdd
and cash-.hdd
for the extraction and manipulation of out of memory data. For importation of
HDD data sets from text files: see txt2hdd
.
See hdd_slice
to apply functions to chunks of data (and create
HDD objects) and hdd_merge
to merge large files.
To create/reshape HDD objects from memory or from other HDD objects, see
write_hdd
.
To display general information from HDD objects: origin
,
summary.hdd
, print.hdd
,
dim.hdd
and names.hdd
.
# Toy example with iris data
# Cartesian merge example
iris_bis = iris
names(iris_bis) = c(paste0("x_", 1:4), "species_bis")
# We must have a common key on which to merge
iris_bis$id = iris$id = 1
# merge, we chunk 'x' by 50 rows
hdd_path = tempfile()
hdd_merge(iris, iris_bis, newfile = hdd_path,
rowsPerChunk = 50, allow.cartesian = TRUE)
base_merged = hdd(hdd_path)
summary(base_merged)
#> Hard drive data of 63.9 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884b051eb/
#> 22,500 lines, 11 variables.
print(base_merged)
#> id Sepal.Length Sepal.Width Petal.Length Petal.Width Species x_1 x_2
#> 1 1 5.1 3.5 1.4 0.2 setosa 5.1 3.5
#> 2 1 5.1 3.5 1.4 0.2 setosa 4.9 3
#> 3 1 5.1 3.5 1.4 0.2 setosa 4.7 3.2
#> -------
#> 22,498 1 5.9 3 5.1 1.8 virginica 6.5 3
#> 22,499 1 5.9 3 5.1 1.8 virginica 6.2 3.4
#> 22,500 1 5.9 3 5.1 1.8 virginica 5.9 3
#> x_3 x_4 species_bis
#> 1 1.4 0.2 setosa
#> 2 1.4 0.2 setosa
#> 3 1.3 0.2 setosa
#> -------
#> 22,498 5.2 2 virginica
#> 22,499 5.4 2.3 virginica
#> 22,500 5.1 1.8 virginica