Extraction of HDD data

This function extract data from HDD files, in a similar fashion as data.table but with more arguments.

# S3 method for hdd
[(x, index, ..., file, newfile, replace = FALSE, all.vars = FALSE)

Arguments

x: A hdd file.
index: An index, you can use .N and variable names, like in data.table.
...: Other components of the extraction to be passed to data.table.
file: Which file to extract from? (Remember hdd data is split in several files.) You can use .N.
newfile: A destination directory. Default is missing. Should be result of the query be saved into a new HDD directory? Otherwise, it is put in memory.
replace: Only used if argument newfile is not missing: default is FALSE. If the newfile points to an existing HDD data, then to replace it you must have replace = TRUE.
all.vars: Logical, default is FALSE. By default, if the first argument of ... is provided (i.e. argument j) then only variables appearing in all ... plus the variable names found in index are extracted. If TRUE all variables are extracted before any selection is done. (This can be useful when the algorithm getting the variable names gets confused in case of complex queries.)

Value

Returns a data.table extracted from a HDD file (except if newwfile is not missing).

Details

The extraction of variables look like a regular data.table extraction but in fact all operations are made chunk-by-chunk behind the scene.

The extra arguments file, newfile and replace are added to a regular data.table call. Argument file is used to select the chunks, you can use the special variable .N to identify the last chunk.

By default, the operation loads the data in memory. But if the expected size is still too large, you can use the argument newfile to create a new HDD data set without size restriction. If a HDD data set already exists in the newfile destination, you can use the argument replace=TRUE to override it.

Author

Laurent Berge

Examples


# Toy example with iris data

# First we create a hdd data set to run the example
hdd_path = tempfile()
write_hdd(iris, hdd_path, rowsPerChunk = 40)

# your data set is in the hard drive, in hdd format already.
data_hdd = hdd(hdd_path)

# summary information on the whole file:
summary(data_hdd)
#> Hard drive data of 7.36 KB. Made of 4 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884add139a/
#> 150 lines, 5 variables.

# You can use the argument 'file' to subselect slices.
# Let's have some descriptive statistics of the first slice of HDD
summary(data_hdd[, file = 1])
#>   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width    
#>  Min.   :4.300   Min.   :2.900   Min.   :1.000   Min.   :0.1000  
#>  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.2000  
#>  Median :5.000   Median :3.400   Median :1.500   Median :0.2000  
#>  Mean   :5.057   Mean   :3.462   Mean   :1.465   Mean   :0.2405  
#>  3rd Qu.:5.400   3rd Qu.:3.700   3rd Qu.:1.600   3rd Qu.:0.3000  
#>  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.5000  
#>        Species  
#>  setosa    :37  
#>  versicolor: 0  
#>  virginica : 0  
#>                 
#>                 
#>                 

# It extract the data from the first HDD slice and
# returns a data.table in memory, we then apply summary to it
# You can use the special argument .N, as in data.table.

# the following query shows the first and last lines of
# each slice of the HDD data set:
data_hdd[c(1, .N), file = 1:.N]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 1:          5.1         3.5          1.4         0.2     setosa
#> 2:          5.5         3.5          1.3         0.2     setosa
#> 3:          4.9         3.6          1.4         0.1     setosa
#> 4:          6.4         2.9          4.3         1.3 versicolor
#> 5:          6.6         3.0          4.4         1.4 versicolor
#> 6:          6.4         2.7          5.3         1.9  virginica
#> 7:          6.8         3.0          5.5         2.1  virginica
#> 8:          5.9         3.0          5.1         1.8  virginica

# Extraction of observations for which the variable
# Petal.Width is lower than 0.1
data_hdd[Petal.Width < 0.2, ]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1:          4.9         3.1          1.5         0.1  setosa
#> 2:          4.8         3.0          1.4         0.1  setosa
#> 3:          4.3         3.0          1.1         0.1  setosa
#> 4:          5.2         4.1          1.5         0.1  setosa
#> 5:          4.9         3.6          1.4         0.1  setosa

# You can apply data.table syntax:
data_hdd[, .(pl = Petal.Length)]
#>       pl
#>   1: 1.4
#>   2: 1.4
#>   3: 1.3
#>   4: 1.5
#>   5: 1.4
#>  ---    
#> 146: 5.2
#> 147: 5.0
#> 148: 5.2
#> 149: 5.4
#> 150: 5.1

# and create variables
data_hdd[, pl2 := Petal.Length**2]
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species   pl2
#>   1:          5.1         3.5          1.4         0.2    setosa  1.96
#>   2:          4.9         3.0          1.4         0.2    setosa  1.96
#>   3:          4.7         3.2          1.3         0.2    setosa  1.69
#>   4:          4.6         3.1          1.5         0.2    setosa  2.25
#>   5:          5.0         3.6          1.4         0.2    setosa  1.96
#>  ---                                                                  
#> 146:          6.7         3.0          5.2         2.3 virginica 27.04
#> 147:          6.3         2.5          5.0         1.9 virginica 25.00
#> 148:          6.5         3.0          5.2         2.0 virginica 27.04
#> 149:          6.2         3.4          5.4         2.3 virginica 29.16
#> 150:          5.9         3.0          5.1         1.8 virginica 26.01

# You can use the by clause, but then
# the by is applied slice by slice, NOT on the full data set:
data_hdd[, .(mean_pl = mean(Petal.Length)), by = Species]
#>       Species  mean_pl
#> 1:     setosa 1.464865
#> 2:     setosa 1.453846
#> 3: versicolor 4.312000
#> 4: versicolor 4.208000
#> 5:  virginica 5.675000
#> 6:  virginica 5.513158

# If the data you extract does not fit into memory,
# you can create a new HDD file with the argument 'newfile':
hdd_path_new = tempfile()
data_hdd[, pl2 := Petal.Length**2, newfile = hdd_path_new]
#> NULL
# check the result:
data_hdd_bis = hdd(hdd_path_new)
summary(data_hdd_bis)
#> Hard drive data of 8.77 KB. Made of 4 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56887d156215/
#> 150 lines, 6 variables.
print(data_hdd_bis)
#>      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  pl2
#> 1             5.1         3.5          1.4         0.2    setosa 1.96
#> 2             4.9           3          1.4         0.2    setosa 1.96
#> 3             4.7         3.2          1.3         0.2    setosa 1.69
#> ----                                                                 
#> 148           6.5           3          5.2           2 virginica   27
#> 149           6.2         3.4          5.4         2.3 virginica 29.2
#> 150           5.9           3          5.1         1.8 virginica   26

Arguments

Value

Details

See also

Author

Examples