This function connects to a hard drive data set (HDD). You can access the hard drive data in a similar way to a data.table.

hdd(dir)

Arguments

dir

The directory where the hard drive data set is.

Value

This function returns an object of class hdd which is linked to a folder on disk containing the data. The data is not loaded in R.

This object is not intended to be interacted with directly as a regular list. Please use the methods sub-.hdd and cash-.hdd to extract the data.

Details

HDD has been created to deal with out of memory data sets. The data set exists in the hard drive, split in multiple files -- each file being workable in memory.

You can perform extraction and manipulation operations as with a regular data set with sub-.hdd. Each operation is performed chunk-by-chunk behind the scene.

In terms of performance, working with complete data sets in memory will always be faster. This is because read/write operations on disk are order of magnitude slower than read/write in memory. However, this might be the only way to deal with out of memory data.

See also

See hdd, sub-.hdd and cash-.hdd for the extraction and manipulation of out of memory data. For importation of HDD data sets from text files: see txt2hdd.

See hdd_slice to apply functions to chunks of data (and create HDD objects) and hdd_merge to merge large files.

To create/reshape HDD objects from memory or from other HDD objects, see write_hdd.

To display general information from HDD objects: origin, summary.hdd, print.hdd, dim.hdd and names.hdd.

Author

Laurent Berge

Examples


# Toy example with iris data
iris_path = tempfile()
fwrite(iris, iris_path)

# destination path
hdd_path = tempfile()

# reading the text file with 50 rows chunks:
txt2hdd(iris_path, dirDest = hdd_path, rowsPerChunk = 50)

# creating a HDD object
base_hdd = hdd(hdd_path)

# Summary information on the whole data set
summary(base_hdd)
#> Hard drive data of 7.28 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file568828a73083/
#> 150 lines, 5 variables.

# Looking at it like a regular data.frame
print(base_hdd)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 1             5.1         3.5          1.4         0.2    setosa
#> 2             4.9           3          1.4         0.2    setosa
#> 3             4.7         3.2          1.3         0.2    setosa
#> ----                                                            
#> 148           6.5           3          5.2           2 virginica
#> 149           6.2         3.4          5.4         2.3 virginica
#> 150           5.9           3          5.1         1.8 virginica
dim(base_hdd)
#> [1] 150   5
names(base_hdd)
#> [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"