Sorts HDD objects

This function sets a key to a HDD file. It creates a copy of the HDD file sorted by the key. Note that the sorting process is very time consuming.

hdd_setkey(x, key, newfile, chunkMB = 500, replace = FALSE, verbose = 1)

Arguments

x: A hdd file.
key: A character vector of the keys.
newfile: Destination of the result, i.e., a destination folder that will receive the HDD data.
chunkMB: The size of chunks used to sort the data. Default is 500MB. The bigger this number the faster the sorting is (depends on your memory available though).
replace: Default is FALSE: if the destination folder already contains data, whether to replace it.
verbose: Numeric, default is 1. Whether to display information on the advancement of the algorithm. If equal to 0, nothing is displayed.

Value

This functions does not return anything in R, instead its result is a new folder populated with .fst files which represent a data set that can be loaded with the function hdd().

Details

This function is provided for convenience reason: it does the job of sorting the data and ensuring consistency across files, but it is very slow since it involves copying several times the entire data set. To be used parsimoniously.

Author

Laurent Berge

Examples


# Toy example with iris data

# Creating HDD data to be sorted
hdd_path = tempfile() # => folder where the data will be saved
write_hdd(iris, hdd_path)
# Let's add data to it
for(i in 1:5) write_hdd(iris, hdd_path, add = TRUE)

base_hdd = hdd(hdd_path)
summary(base_hdd)
#> Hard drive data of 26.5 KB. Made of 6 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56887d781cf3/
#> 900 lines, 5 variables.

# Sorting by Sepal.Width
hdd_sorted = tempfile()
# we use a very small chunkMB to show how the function works
hdd_setkey(base_hdd, key = "Sepal.Width",
       newfile = hdd_sorted, chunkMB = 0.010)
#> Guessing R size
#> ...
#> 5 files...
#> .
#> .
#> +
#> 2
#> .
#> .
#> +
#> 0
#> 


base_hdd_sorted = hdd(hdd_sorted)
summary(base_hdd_sorted) # => additional line "Sorted by:"
#> Hard drive data of 9.24 KB. Made of 4 files.
#> Sorted by: Sepal.Width 
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file5688185915d9/
#> 900 lines, 5 variables.
print(base_hdd_sorted)
#>       Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1                5           2          3.5           1 versicolor
#> 2                5           2          3.5           1 versicolor
#> 3                5           2          3.5           1 versicolor
#> -----                                                             
#> 898            5.7         4.4          1.5         0.4     setosa
#> 899            5.7         4.4          1.5         0.4     setosa
#> 900            5.7         4.4          1.5         0.4     setosa

# Sort with two keys:
hdd_sorted = tempfile()
# we use a very small chunkMB to show how the function works
hdd_setkey(base_hdd, key = c("Species", "Sepal.Width"),
       newfile = hdd_sorted, chunkMB = 0.010)
#> Guessing R size
#> ...
#> 5 files...
#> .
#> .
#> +
#> 2
#> .
#> .
#> +
#> 1
#> .
#> +
#> 0
#> 


base_hdd_sorted = hdd(hdd_sorted)
summary(base_hdd_sorted)
#> Hard drive data of 9.02 KB. Made of 4 files.
#> Sorted by: Species, Sepal.Width 
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file5688384d1a4/
#> 900 lines, 5 variables.
print(base_hdd_sorted)
#>       Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 1              4.5         2.3          1.3         0.3    setosa
#> 2              4.5         2.3          1.3         0.3    setosa
#> 3              4.5         2.3          1.3         0.3    setosa
#> -----                                                            
#> 898            7.9         3.8          6.4           2 virginica
#> 899            7.7         3.8          6.7         2.2 virginica
#> 900            7.9         3.8          6.4           2 virginica

Arguments

Value

Details

See also

Author

Examples