Transforms text data into a HDD file

Imports text data and saves it into a HDD file. It uses read_delim_chunked to extract the data. It also allows to preprocess the data.

txt2hdd(
  path,
  dirDest,
  chunkMB = 500,
  rowsPerChunk,
  col_names,
  col_types,
  nb_skip,
  delim,
  preprocessfun,
  replace = FALSE,
  encoding = "UTF-8",
  verbose = 0,
  locale = NULL,
  ...
)

Arguments

path

Character vector that represents the path to the data. Note that it can be equal to patterns if multiple files with the same name are to be imported (if so it must be a fixed pattern, NOT a regular expression).

dirDest

The destination directory, where the new HDD data should be saved.

chunkMB

The chunk sizes in MB, defaults to 500MB. Instead of using this argument, you can alternatively use the argument rowsPerChunk which decides the size of chunks in terms of lines.

rowsPerChunk

Number of rows per chunk. By default it is missing: its value is deduced from argument chunkMB and the size of the file. If provided, replaces any value provided in chunkMB.

col_names

The column names, by default is uses the ones of the data set. If the data set lacks column names, you must provide them.

col_types

The column types, in the readr fashion. You can use guess_col_types to find them.

nb_skip

Number of lines to skip.

delim

The delimiter. By default the function tries to find the delimiter, but sometimes it fails.

preprocessfun

A function that is applied to the data before saving. Default is missing. Note that if a function is provided, it MUST return a data.frame, anything other than data.frame is ignored.

replace

If the destination directory already exists, you need to set the argument replace=TRUE to overwrite all the HDD files in it.

encoding

Character scalar containing the encoding of the file to be read. By default it is "UTF-8" and is passed to the readr function locale which is used in read_delim_chunked (the reading function). A common encoding in Western Europe is "ISO-8859-1" (simply use "file filename" in a non-Windows console to get the encoding).

Note that this argument is ignored if the argument locale is not NULL.

verbose

Logical scalar or NULL (default). If TRUE, then the evolution of the importing process as well as the time to import are reported. If NULL, it becomes TRUE when the data to import is greater than 5GB or there are more than one chunk.

locale

Either NULL (default), either an object created with locale. This object will be passed to the reading function read_delim_chunked and handles how the data is imported.

...

Other arguments to be passed to read_delim_chunked, quote = "" can be interesting sometimes.

Value

This function does not return anything in R. Instead it creates a folder on disk containing .fst files. These files represent the data that has been imported and converted to the hdd format.

You can then read the created data with the function hdd().

Details

This function uses read_delim_chunked from readr to read a large text file per chunk, and generate a HDD data set.

Since the main function for importation uses readr, the column specification must also be in readr's style (namely cols or cols_only).

By default a guess of the column types is made on the first 10,000 rows. The guess is the application of guess_col_types on these rows.

Note that by default, columns that are found to be integers are imported as double (in want of integer64 type in readr). Note that for large data sets, sometimes integer-like identifiers can be larger than 16 digits: in these case you must import them as character not to lose information.

The delimiter is found with the function guess_delim, which uses the guessing from fread. Note that fixed width delimited files are not supported.

Author

Laurent Berge

Examples


# Toy example with iris data

# we create a text file on disk
iris_path = tempfile()
fwrite(iris, iris_path)

# destination path
hdd_path = tempfile()
# reading the text file with HDD, with approx. 50 rows per chunk:
txt2hdd(iris_path, dirDest = hdd_path, rowsPerChunk = 50)

base_hdd = hdd(hdd_path)
summary(base_hdd)
#> Hard drive data of 7.28 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884ab37227/
#> 150 lines, 5 variables.

# Same example with preprocessing
sl_keep = sort(unique(sample(iris$Sepal.Length, 40)))
fun = function(x){
  # we keep only some observations & vars + renaming
  res = x[Sepal.Length %in% sl_keep, .(sl = Sepal.Length, Species)]
  # we create some variables
  res[, sl2 := sl**2]
  res
}
# reading with preprocessing
hdd_path_preprocess = tempfile()
txt2hdd(iris_path, hdd_path_preprocess,
    preprocessfun = fun, rowsPerChunk = 50)

base_hdd_preprocess = hdd(hdd_path_preprocess)
summary(base_hdd_preprocess)
#> Hard drive data of 4.11 KB. Made of 3 files.
#> Location: C:/Users/lrberge/AppData/Local/Temp/Rtmpa0wfuK/file56884cd168ba/
#> 121 lines, 3 variables.