dataUpdate — dataUpdate • ReUseData

Function to update the local data records by reading the yaml files in the specified directory recursively.

Usage

dataUpdate(
  dir,
  cachePath = "ReUseData",
  outMeta = FALSE,
  keepTags = TRUE,
  cleanup = FALSE,
  cloud = FALSE,
  remote = FALSE,
  checkData = TRUE,
  duplicate = FALSE
)

Arguments

dir: a character string for the directory where all data are saved. Data information will be collected recursively within this directory.
cachePath: A character string specifying the name for the BiocFileCache object to store all the curated data resources. Once specified, must match the cachePath argument in dataSearch. Default is "ReUseData".
outMeta: Logical. If TRUE, a "meta_data.csv" file will be generated in the dir, containing information about all available datasets in the directory: The file path to the yaml files, and yaml entries including parameter values for data recipe, file path to datasets, notes, version (from getData()), if available and data generating date.
keepTags: If keep the prior assigned data tags. Default is TRUE.
cleanup: If remove any invalid intermediate files. Default is FALSE. In cases one data recipe (with same parameter values) was evaluated multiple times, the same data file(s) will match to multiple intermediate files (e.g., .yml). cleanup will remove older intermediate files, and only keep the most recent ones that matches the data file. When there are any intermediate files that don't match to any data file, cleanup will also remove those.
cloud: Whether to return the pregenerated data from Google Cloud bucket of ReUseData. Default is FALSE.
remote: Whether to use the csv file (containing information about pregenerated data on Google Cloud) from GitHub, which is most up-to-date. Only works when cloud = TRUE. Default is FALSE.
checkData: check if the data (listed as "# output: " in the yml file) exists. If not, do not include in the output csv file. This argument is added for internal testing purpose.
duplicate: Whether to remove duplicates. If TRUE, older version of duplicates will be removed.

Value

a dataHub object containing the information about local data cache, e.g., data name, data path, etc.

Details

Users can directly retrieve information for all available datasets by using meta_data(dir=), which generates a data frame in R with same information as described above and can be saved out. dataUpdate does extra check for all datasets (check the file path in "output" column), remove invalid ones, e.g., empty or non-existing file path, and create a data cache for all valid datasets.

Examples

## Generate data
if (FALSE) {
library(Rcwl)
outdir <- file.path(tempdir(), "SharedData")

echo_out <- recipeLoad("echo_out")
Rcwl::inputs(echo_out)
echo_out$input <- "Hello World!"
echo_out$outfile <- "outfile"
res <- getData(echo_out,
               outdir = outdir,
               notes = c("echo", "hello", "world", "txt"),
               showLog = TRUE)

ensembl_liftover <- recipeLoad("ensembl_liftover")
Rcwl::inputs(ensembl_liftover)
ensembl_liftover$species <- "human"
ensembl_liftover$from <- "GRCh37"
ensembl_liftover$to <- "GRCh38"
res <- getData(ensembl_liftover,
        outdir = outdir, 
        notes = c("ensembl", "liftover", "human", "GRCh37", "GRCh38"),
        showLog = TRUE)

## Update data cache (with or without prebuilt data sets from ReUseData cloud bucket)
dataUpdate(dir = outdir)
dataUpdate(dir = outdir, cloud = TRUE)

## newly generated data are now cached and searchable
dataSearch(c("hello", "world"))
dataSearch(c("ensembl", "liftover"))  ## both locally generated data and google cloud data! 
}