Function to update the local data records by reading the yaml files in the specified directory recursively.
Usage
dataUpdate(
dir,
cachePath = "ReUseData",
outMeta = FALSE,
keepTags = TRUE,
cleanup = FALSE,
cloud = FALSE,
remote = FALSE,
checkData = TRUE,
duplicate = FALSE
)
Arguments
- dir
a character string for the directory where all data are saved. Data information will be collected recursively within this directory.
- cachePath
A character string specifying the name for the
BiocFileCache
object to store all the curated data resources. Once specified, must match thecachePath
argument indataSearch
. Default is "ReUseData".- outMeta
Logical. If TRUE, a "meta_data.csv" file will be generated in the
dir
, containing information about all available datasets in the directory: The file path to the yaml files, and yaml entries including parameter values for data recipe, file path to datasets, notes, version (fromgetData()
), if available and data generating date.- keepTags
If keep the prior assigned data tags. Default is TRUE.
- cleanup
If remove any invalid intermediate files. Default is FALSE. In cases one data recipe (with same parameter values) was evaluated multiple times, the same data file(s) will match to multiple intermediate files (e.g., .yml).
cleanup
will remove older intermediate files, and only keep the most recent ones that matches the data file. When there are any intermediate files that don't match to any data file,cleanup
will also remove those.- cloud
Whether to return the pregenerated data from Google Cloud bucket of ReUseData. Default is FALSE.
- remote
Whether to use the csv file (containing information about pregenerated data on Google Cloud) from GitHub, which is most up-to-date. Only works when
cloud = TRUE
. Default is FALSE.- checkData
check if the data (listed as "# output: " in the yml file) exists. If not, do not include in the output csv file. This argument is added for internal testing purpose.
- duplicate
Whether to remove duplicates. If TRUE, older version of duplicates will be removed.
Value
a dataHub
object containing the information about local
data cache, e.g., data name, data path, etc.
Details
Users can directly retrieve information for all available
datasets by using meta_data(dir=)
, which generates a data
frame in R with same information as described above and can be
saved out. dataUpdate
does extra check for all datasets
(check the file path in "output" column), remove invalid ones,
e.g., empty or non-existing file path, and create a data cache
for all valid datasets.
Examples
## Generate data
if (FALSE) {
library(Rcwl)
outdir <- file.path(tempdir(), "SharedData")
echo_out <- recipeLoad("echo_out")
Rcwl::inputs(echo_out)
echo_out$input <- "Hello World!"
echo_out$outfile <- "outfile"
res <- getData(echo_out,
outdir = outdir,
notes = c("echo", "hello", "world", "txt"),
showLog = TRUE)
ensembl_liftover <- recipeLoad("ensembl_liftover")
Rcwl::inputs(ensembl_liftover)
ensembl_liftover$species <- "human"
ensembl_liftover$from <- "GRCh37"
ensembl_liftover$to <- "GRCh38"
res <- getData(ensembl_liftover,
outdir = outdir,
notes = c("ensembl", "liftover", "human", "GRCh37", "GRCh38"),
showLog = TRUE)
## Update data cache (with or without prebuilt data sets from ReUseData cloud bucket)
dataUpdate(dir = outdir)
dataUpdate(dir = outdir, cloud = TRUE)
## newly generated data are now cached and searchable
dataSearch(c("hello", "world"))
dataSearch(c("ensembl", "liftover")) ## both locally generated data and google cloud data!
}