hdd_slice.Rd
This function is useful to apply complex R functions to large data sets (out of memory). It slices the input data, applies the function, then saves each chunk into a hard drive folder. This can then be a HDD data set.
hdd_slice(
x,
fun,
dir,
chunkMB = 500,
rowsPerChunk,
replace = FALSE,
verbose = 1,
...
)
A data set (data.frame, HDD).
A function to be applied to slices of the data set. The function must return a data frame like object.
The destination directory where the data is saved.
The size of the slices, default is 500MB. That is: the function fun
is applied to each 500Mb of data x
. If the function creates a lot of additional information, you may want this number to go down. On the other hand, if the function reduces the information you may want this number to go up. In the end it will depend on the amount of memory available.
Integer, default is missing. Alternative to the argument chunkMB
. If provided, the functions will be applied to chunks of rowsPerChunk
of x
.
Whether all information on the destination directory should be erased beforehand. Default is FALSE
.
Integer, defaults to 1. If greater than 0 then the progress is displayed.
Other parameters to be passed to fun
.
It doesn't return anything, the output is a "hard drive data" saved in the hard drive.
This function splits the original data into several slices and then apply a function to each of them, saving the results into a HDD data set.
You can perform merging operations with hdd_slice
, but for regular merges not that you have the function hdd_merge
that may prove more convenient (not need to write a ad hoc function).
See hdd
, sub-.hdd
and cash-.hdd
for the extraction and manipulation of out of memory data. For importation of
HDD data sets from text files: see txt2hdd
.
See hdd_slice
to apply functions to chunks of data (and create
HDD objects) and hdd_merge
to merge large files.
To create/reshape HDD objects from memory or from other HDD objects, see
write_hdd
.
To display general information from HDD objects: origin
,
summary.hdd
, print.hdd
,
dim.hdd
and names.hdd
.
# Toy example with iris data.
# Say you want to perform a cartesian merge
# If the results of the function is out of memory
# you can use hdd_slice (not the case for this example)
# preparing the cartesian merge
iris_bis = iris
names(iris_bis) = c(paste0("x_", 1:4), "species_bis")
fun_cartesian = function(x){
# Note that x is treated as a data.table
# => we need the argument allow.cartesian
merge(x, iris_bis, allow.cartesian = TRUE)
}
hdd_result = tempfile() # => folder where results are saved
hdd_slice(iris, fun_cartesian, dir = hdd_result, rowsPerChunk = 30)
#> 1..
#> 2..
#> 3..
#> 4..
#> 5..
#> end.
# Let's look at the result
base_hdd = hdd(hdd_result)
summary(base_hdd)
#> Hard drive data of 63.6 KB. Made of 5 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file568834da5d0c/
#> 22,500 lines, 10 variables.
head(base_hdd)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species x_1 x_2 x_3 x_4
#> 1: 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2
#> 2: 4.9 3.0 1.4 0.2 setosa 5.1 3.5 1.4 0.2
#> 3: 4.7 3.2 1.3 0.2 setosa 5.1 3.5 1.4 0.2
#> 4: 4.6 3.1 1.5 0.2 setosa 5.1 3.5 1.4 0.2
#> 5: 5.0 3.6 1.4 0.2 setosa 5.1 3.5 1.4 0.2
#> 6: 5.4 3.9 1.7 0.4 setosa 5.1 3.5 1.4 0.2
#> species_bis
#> 1: setosa
#> 2: setosa
#> 3: setosa
#> 4: setosa
#> 5: setosa
#> 6: setosa