txt2hdd.Rd
Imports text data and saves it into a HDD file. It uses read_delim_chunked
to extract the data. It also allows to preprocess the data.
txt2hdd(
path,
dirDest,
chunkMB = 500,
rowsPerChunk,
col_names,
col_types,
nb_skip,
delim,
preprocessfun,
replace = FALSE,
encoding = "UTF-8",
verbose = 0,
locale = NULL,
...
)
Character vector that represents the path to the data. Note that it can be equal to patterns if multiple files with the same name are to be imported (if so it must be a fixed pattern, NOT a regular expression).
The destination directory, where the new HDD data should be saved.
The chunk sizes in MB, defaults to 500MB. Instead of using this
argument, you can alternatively use the argument rowsPerChunk
which decides
the size of chunks in terms of lines.
Number of rows per chunk. By default it is missing: its value
is deduced from argument chunkMB
and the size of the file. If provided,
replaces any value provided in chunkMB
.
The column names, by default is uses the ones of the data set. If the data set lacks column names, you must provide them.
The column types, in the readr
fashion. You can use guess_col_types
to find them.
Number of lines to skip.
The delimiter. By default the function tries to find the delimiter, but sometimes it fails.
A function that is applied to the data before saving. Default is missing. Note that if a function is provided, it MUST return a data.frame, anything other than data.frame is ignored.
If the destination directory already exists, you need to set the
argument replace=TRUE
to overwrite all the HDD files in it.
Character scalar containing the encoding of the file to be read.
By default it is "UTF-8" and is passed to the readr
function locale
which is used
in read_delim_chunked
(the reading function). A common encoding in Western Europe is
"ISO-8859-1" (simply use "file filename" in a non-Windows console to get the encoding).
Note that this argument is ignored if the argument locale
is not NULL.
Logical scalar or NULL
(default). If TRUE
, then the evolution of
the importing process as well as the time to import are reported.
If NULL
, it becomes TRUE
when the data to import is greater than 5GB or there are
more than one chunk.
Either NULL
(default), either an object created with locale
.
This object will be passed to the reading function read_delim_chunked
and handles
how the data is imported.
Other arguments to be passed to read_delim_chunked
,
quote = ""
can be interesting sometimes.
This function does not return anything in R. Instead it creates a folder
on disk containing .fst
files. These files represent the data that has been
imported and converted to the hdd
format.
You can then read the created data with the function hdd()
.
This function uses read_delim_chunked
from readr
to read a large text file per chunk, and generate a HDD data set.
Since the main function for importation uses readr
, the column specification
must also be in readr's style (namely cols
or cols_only
).
By default a guess of the column types is made on the first 10,000 rows. The
guess is the application of guess_col_types
on these rows.
Note that by default, columns that are found to be integers are imported as double (in want of integer64 type in readr). Note that for large data sets, sometimes integer-like identifiers can be larger than 16 digits: in these case you must import them as character not to lose information.
The delimiter is found with the function guess_delim
, which
uses the guessing from fread
. Note that fixed width
delimited files are not supported.
See hdd
, sub-.hdd
and cash-.hdd
for the extraction and manipulation of out of memory data. For importation of
HDD data sets from text files: see txt2hdd
.
See hdd_slice
to apply functions to chunks of data (and create
HDD objects) and hdd_merge
to merge large files.
To create/reshape HDD objects from memory or from other HDD objects, see
write_hdd
.
To display general information from HDD objects: origin
,
summary.hdd
, print.hdd
,
dim.hdd
and names.hdd
.
# Toy example with iris data
# we create a text file on disk
iris_path = tempfile()
fwrite(iris, iris_path)
# destination path
hdd_path = tempfile()
# reading the text file with HDD, with approx. 50 rows per chunk:
txt2hdd(iris_path, dirDest = hdd_path, rowsPerChunk = 50)
base_hdd = hdd(hdd_path)
summary(base_hdd)
#> Hard drive data of 7.28 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884ab37227/
#> 150 lines, 5 variables.
# Same example with preprocessing
sl_keep = sort(unique(sample(iris$Sepal.Length, 40)))
fun = function(x){
# we keep only some observations & vars + renaming
res = x[Sepal.Length %in% sl_keep, .(sl = Sepal.Length, Species)]
# we create some variables
res[, sl2 := sl**2]
res
}
# reading with preprocessing
hdd_path_preprocess = tempfile()
txt2hdd(iris_path, hdd_path_preprocess,
preprocessfun = fun, rowsPerChunk = 50)
base_hdd_preprocess = hdd(hdd_path_preprocess)
summary(base_hdd_preprocess)
#> Hard drive data of 4.11 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884cd168ba/
#> 121 lines, 3 variables.