vignettes/ReUseData_Rcwl_Workshop.Rmd
ReUseData_Rcwl_Workshop.Rmd
The workshop format is a 45 minute session consisting of hands-on demos, exercises and Q&A.
In this workshop, we will demonstrate three Bioconductor
packages: Rcwl
as an R interface for CWL
;
RcwlPipelines
with >200 pre-built bioinformatics tools
and best practice pipelines in R, that are easily usable and
highly customizable; and ReUseData
for the management of
reusable genomic data.
This workshop will implement variant calling workflows within R using
the Mutect2
from GATK. This whole workflow is based on R
programming language and can be deployed in local computer, HPC and
cloud computing platforms, using docker
,
singularity
or udocker
. [FIXME]
With these tools, we should be able to conduct reproducible data analysis using commonly used bioinformatics tools (including command-line based tools and R/Bioconductor packages) and validated, best practice workflows (based on workflow languages such as CWL) within a unified R programming environment.
For the somatic variant calling, we will need to prepare the following:
.bam
, .bam.bai
filesb37
or hg38
)Mutect2
to Call somatic SNVs and indels via
local assembly of haplotypes. ref
There will be three ways to do the analysis task.
Command-line interface. We can install the software and write shell scripts.
Workflow language. We can use workflow language (e.g., CWL, WDL, nextflow, snakemake) to streamline the workflow of data analysis, using containerized tools (no software installation, version tracking, etc.)
Rcwl/RcwlPipelines
. Write/use CWL-based data
analysis tools/workflows within R programming language
ReUseData
: Reproducible and reusable genomic data
management.
This workshop will implement variant calling and annotation workflows
within R using the Mutect2
from GATK. This whole
workflow is based on a unified R programming language and can
be deployed in local computer, HPC and cloud computing platforms, using
docker
, singularity
or udocker
.
[FIXME] With the pre-built tools in RcwlPipelines
, we
should be able to conduct reproducible data analysis using commonly used
bioinformatics tools (including command-line based tools and
R/Bioconductor packages) and validated, best practice workflows
(based on workflow languages such as CWL) within a unified R
programming environment.
RcwlPipelines
RcwlPipelines
includes more than 200 pre-built, commonly
used bioinformatics tools, such as BWA
for DNA read
alignment, STAR
for RNA read quantification,
Mutect2
for somatic variant calling. Here we use
Mutect2
to demonstrate the use of these tools to run
CWL-based data analysis workflow within R.
Below we will show 3 core functions: cwlUpdate
,
cwlSearch
and cwlLoad
to update, search and
load the needed tool or pipeline in R.
The cwlUpdate
function syncs the current
Rcwl
tool recipes in the local cache. It returns a
cwlHub
object that contains the most updated
Rcwl
recipes. User need to call this function for
first-time use or if they want to use a newly added tool/pipeline from
RcwlPipelines
.
library(RcwlPipelines)
library(Rcwl)
workdir <- "/tmp/Bioc2023_rcwl"
dir.create(workdir)
#> Warning in dir.create(workdir): '/tmp/Bioc2023_rcwl' already exists
Then the existing tool recipes can be queried using with
cwlSearch
with multiple keywords. Then it’s ready to be
loaded into R with cwlLoad
.
Multiple tool/recipes are available with Mutect2
. We
first show the pipeline which includes multiple tools for the tasks of
variant calling, QC, and filtering https://rcwl.org/RcwlRecipes/Mutect2PL.html, where
Mutect2
is part of the pipeline.
cwlUpdate()
#> Update scripts...
#> cwlHub with 232 records
#> cache path: ~/.cache/Rcwl
#> # last modified date: 2023-01-05
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#>
#> title
#> BFC178 | pl_alignMerge
#> BFC179 | pl_AnnPhaseVcf
#> BFC180 | pl_BaseRecal
#> BFC181 | pl_bwaAlign
#> BFC182 | pl_bwaMMRecal
#> ... ...
#> BFC405 | tl_samtools_collate
#> BFC406 | tl_samtools_fastq
#> BFC407 | tl_strip_sam
#> BFC408 | tl_TPMCalculator
#> BFC409 | tl_trimRRBSdiversity
#> Command
#> BFC178 bwaAlign+mergeBamDup
#> BFC179 VCFvep+dVCFcoverage+rVCFcoverage+VCFexpression+PhaseVcf
#> BFC180 BaseRecalibrator+ApplyBQSR+samtools_index+samtools_flagstat+samt...
#> BFC181 bwa+sam2bam+sortBam+idxBam
#> BFC182 bwaAlign+mergeBamDup+BaseRecal
#> ... ...
#> BFC405 <NA>
#> BFC406 <NA>
#> BFC407 <NA>
#> BFC408 <NA>
#> BFC409 <NA>
cwlSearch("mutect2")
#> cwlHub with 6 records
#> cache path: ~/.cache/Rcwl
#> # last modified date: 2022-08-16
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#>
#> title
#> BFC198 | pl_Mutect2PL
#> BFC203 | pl_SomaticCallers
#> BFC291 | tl_Mutect2_gatk3
#> BFC292 | tl_Mutect2
#> BFC369 | pl_SomaticCaller4
#> BFC377 | pl_SomaticCaller_mouse
#> Command
#> BFC198 Mutect2+GetPileupSummariesT+GetPileupSummariesN+CalculateContami...
#> BFC203 Mutect2PL+MuSE+bgzip+tabixIndex+mantaStrelka+SomaticSniper+VarDi...
#> BFC291 java -jar /usr/GenomeAnalysisTK.jar -T MuTect2
#> BFC292 gatk Mutect2
#> BFC369 Mutect2PL+MuSE+bgzip+tabixIndex+mantaStrelka+VarDict+combine
#> BFC377 Mutect2+MuSE+bgzip+tabixIndex+mantaStrelka+VarDict+combine
mutect2pl <- cwlLoad("pl_Mutect2PL")
#> Mutect2 loaded
#> GetPileupSummaries loaded
#> CalculateContamination loaded
#> LearnReadOrientationModel loaded
#> FilterMutectCalls loaded
#> bcfview loaded
plotCWL(mutect2pl)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> ℹ The deprecated feature was likely used in the DiagrammeR package.
#> Please report the issue at
#> <https://github.com/rich-iannone/DiagrammeR/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
In this workshop, we will use the simple tool of
tl_Mutect2
for variant calling only. We can customize the
tools by using a smaller docker image for demo purposes.
mutect2 <- cwlLoad("tl_Mutect2")
## the official GATK docker image
requirements(mutect2)[[1]]
#> $class
#> [1] "DockerRequirement"
#>
#> $dockerPull
#> [1] "broadinstitute/gatk:latest"
## change to smaller BioConda docker image
ds <- searchContainer("gatk4")
req1 <- requireDocker(ds[1, "container"])
requirements(mutect2)[[1]] <- req1
requirements(mutect2)[[1]]
#> $class
#> [1] "DockerRequirement"
#>
#> $dockerPull
#> [1] "quay.io/biocontainers/gatk4:4.4.0.0--py36hdfd78af_0"
We can use the Rcwl::inputs
function to check the
required input parameters for the tool. For Mutect2
,
multiple parameters are defined to map to the tool arguments. Major
inputs would be the BAM files containing reads (tbam
and
nbam
for tumor and normal separately), the reference
sequence file (Ref
), and the pon
for panel of
normals VCF files.
Some input parameters come with default values (such as
f1r2
, which can be changed easily. More details about the
tool arguments can be found here.
inputs(mutect2)
#> inputs:
#> tbam (File): -I
#> nbam (File?): -I
#> Ref (File): -R
#> normal (string?): -normal
#> germline (File?): --germline-resource
#> pon (File?): --panel-of-normals
#> interval (File?): -L
#> out (string): -O
#> f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#> threads (int?): --native-pair-hmm-threads
We will assign values to the input parameters before evaluation.
For the experiment data, we have prepared a small DNA-seq dataset for demo purposes.
For the public genomic data resources that are needed to run
mutect2
for somatic variant calling: the reference sequence
file (Ref
) and PON VCF file (pon
), we can
assign the file path directly to the corresponding parameters if you
already have them, and evaluate the data recipe now with
runCWL
function.
However, as these genomic data files can be repeatedly used in many
different projects (e.g., DNA-seq data), we propose an
R/Bioconductor package ReUseData
to manage these
files so that they can be easily tracked for tool/data provenance to
safely reproduce and reuse.
mutect2$tbam <- system.file("extdata/tumor.bam", package="RcwlWorkshop")
mutect2$nbam <- system.file("extdata/normal.bam", package="RcwlWorkshop")
mutect2$normal <- "NA12892"
mutect2$out <- "somatic.vcf"
## prepare capture region
write.table(rbind(c(21, 10400000, 10500000)), file.path(workdir, "region.bed"),
row.names=FALSE, col.names = FALSE, quote=FALSE, sep="\t")
mutect2$interval <- file.path(workdir, "region.bed")
inputs(mutect2)
#> inputs:
#> tbam (File): -I /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/tumor.bam
#> nbam (File?): -I File /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/normal.bam
#> Ref (File): -R
#> normal (string?): -normal NA12892
#> germline (File?): --germline-resource
#> pon (File?): --panel-of-normals
#> interval (File?): -L File /tmp/Bioc2023_rcwl/region.bed
#> out (string): -O somatic.vcf
#> f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#> threads (int?): --native-pair-hmm-threads
## mutect2$Ref <-
## mutect2$pon <-
## runCWL(mutect2, outdir = file.path(workdir, "output"), docker = "udocker", showLog = TRUE)
ReUseData
facilitates transformation of shell scripts
for data preprocessing (e.g., downloading, indexing) into workflow-based
data recipes. Evaluation of data recipes generate curated data files in
their generic formats (e.g., VCF, bed) with automatically generated meta
files, this will help track data and tool provenance for subsequent data
reuse.
There are some pre-built data recipes in ReUseData
that
are ready to be queried and used. Basically, we run the
recipeUpdate
function to synchronize the existing and newly
added recipes (by default the recipes included in ReUseData
package, can also be a user-specific GitHub repository with similar
settings).
The prebuilt data recipe scripts are included in the package, and are
physically residing in a dedicated GitHub
repository. These can serve as template scripts for recipe
construction in different situations. The most common case is that a
data recipe can manage multiple data resources with different input
parameters (species, versions, etc.). For example, the
gencode_transcripts
recipe download from GENCODE, unzip and
index the transcript fasta file for human or mouse with different
versions. A simple data downloading (using wget
) for a
specific file can be written as a data recipe without any input
parameter. For example, The pre-built recipes of
gcp_gatk_mutect2_b37
and gcp_gatk_mutect2_hg38
are for the downloading of genomic data resources from the GATK resource
bundle on Google
Cloud Bucket based on reference build of b37
and
hg38
separately. See more details here.
In this workshop, we will use the gcp_gatk_mutect2_b37
data recipe to download necessary genomic data files for the somatic
variant calling.
We will 4 core functions to update (recipeUpdate
),
search (recipeSearch
), load (recipeLoad
) the
data recipe, assign values to the input parameters
(filename
and idx
), and then evaluate the
recipe (getData
) to download the data to your local machine
with automatic tracking and user-specified notes/keywords.
The getData
function evaluates the data recipe and
generate the desired data output in a user-specified
outdir
. Arbitrary notes
can be added for the
dataset for easy query later.
Internally, the data recipe is submitted as a cwl task evaluting the shell script for data downloading/processing.
recipeUpdate()
#> Updating recipes...
#>
#> recipeHub with 17 records
#> cache path: /home/qian/.cache/R/ReUseDataRecipe
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#>
#> name
#> BFC1906 | bowtie2_index
#> BFC1907 | echo_out
#> BFC1908 | ensembl_liftover
#> BFC1909 | gcp_broad_gatk_hg19
#> BFC1910 | gcp_broad_gatk_hg38
#> ... ...
#> BFC1918 | reference_genome
#> BFC1919 | salmon_index
#> BFC1920 | STAR_index
#> BFC1921 | ucsc_database
#> BFC1922 | ReUseDataRecipe-master
recipeSearch("mutect2")
#> recipeHub with 2 records
#> cache path: /home/qian/.cache/R/ReUseDataRecipe
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#>
#> name
#> BFC1911 | gcp_gatk_mutect2_b37
#> BFC1912 | gcp_gatk_mutect2_hg38
rcp <- recipeLoad("gcp_gatk_mutect2_b37")
#> Note: you need to assign a name for the recipe: rcpName <- recipeLoad('xx')
#> Data recipe loaded!
#> Use inputs() to check required input parameters before evaluation.
#> Check here: https://rcwl.org/dataRecipes/gcp_gatk_mutect2_b37.html
#> for user instructions (e.g., eligible input values, data source, etc.)
## get panel of normal vcf
rcp$filename <- "Mutect2-exome-panel.vcf"
rcp$idx <- "idx"
getData(rcp, outdir = file.path(workdir, "shareData"),
notes = c("mutect2", "panel of normal"),
showLog = TRUE)
#> }
#> List of length 4
#> names(4): command output logs yml
## check output
list.files(file.path(workdir, "shareData"), pattern= c("panel", "normal"))
#> [1] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.cwl"
#> [2] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.md5"
#> [3] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.sh"
#> [4] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.yml"
#> [5] "Mutect2-exome-panel.vcf"
#> [6] "Mutect2-exome-panel.vcf.idx"
The output directory includes the downloaded genomic file dataset, as
well as some automatically generated meta files (ending with
md5
, cwl
, yml
and
sh
) by getData
, where the the meta information
for data recipe are recorded for subsequent reuse. More details about
these meta files, please see the For developers
section
below.
Similarly, we will download the reference genome files using the same data recipe:
## Reference genome files
rcp$filename <- "Homo_sapiens_assembly19.fasta"
rcp$idx = "fai"
getData(rcp, outdir = file.path(workdir, "shareData"),
notes = c("human", "reference genome", "hg19", "b37"),
showLog = TRUE)
#> }
#> List of length 4
#> names(4): command output logs yml
rcp$filename <- "Homo_sapiens_assembly19.dict"
rcp$idx <- ""
getData(rcp, outdir = file.path(workdir, "shareData"),
notes = c("human", "reference genome dict", "hg19", "b37"),
showLog = TRUE)
#> }
#> List of length 4
#> names(4): command output logs yml
## check output
list.files(file.path(workdir, "shareData"), pattern = c("assembly19"))
#> [1] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.cwl"
#> [2] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.md5"
#> [3] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.sh"
#> [4] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.yml"
#> [5] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.cwl"
#> [6] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.md5"
#> [7] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.sh"
#> [8] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.yml"
#> [9] "Homo_sapiens_assembly19.dict"
#> [10] "Homo_sapiens_assembly19.fasta"
#> [11] "Homo_sapiens_assembly19.fasta.fai"
Now we have all the data needed. Here we show some utility functions to manage, query, and use the data.
There are 2 core functions to manage the data locally for easy query and use.
Then we use dataUpdate
to cache the data in specified
data directory dir
. Then the data can be easily tracked and
queried using dataSearch
function. Meta information about
the data, such as file paths, can be easily retrieved for later use.
dataUpdate
: Cache the data in specified data directory
dir
(works recursively). Need to call for first-time use or
if any new datasets are generated locally with
getData
.dataSearch
: Query of cached datasets with (multiple)
keyword(s) (can be keywords from the file name or the user-specified
notes
.
dataUpdate(dir = workdir, keepTags=FALSE, cleanup = TRUE)
#>
#> Updating data record...
#> normal.bam added
#> normal.bam.bai added
#> tumor.bam added
#> tumor.bam.bai added
#> Homo_sapiens_assembly19.dict added
#> Homo_sapiens_assembly19.fasta added
#> Homo_sapiens_assembly19.fasta.fai added
#> Mutect2-exome-panel.vcf added
#> Mutect2-exome-panel.vcf.idx added
#> dataHub with 9 records
#> cache path: /home/qian/.cache/R/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name
#> BFC32429 | normal.bam
#> BFC32430 | normal.bam.bai
#> BFC32431 | tumor.bam
#> BFC32432 | tumor.bam.bai
#> BFC32433 | Homo_sapiens_assembly19.dict
#> BFC32434 | Homo_sapiens_assembly19.fasta
#> BFC32435 | Homo_sapiens_assembly19.fasta.fai
#> BFC32436 | Mutect2-exome-panel.vcf
#> BFC32437 | Mutect2-exome-panel.vcf.idx
#> Path
#> BFC32429 /tmp/Bioc2023_rcwl/data/normal.bam
#> BFC32430 /tmp/Bioc2023_rcwl/data/normal.bam.bai
#> BFC32431 /tmp/Bioc2023_rcwl/data/tumor.bam
#> BFC32432 /tmp/Bioc2023_rcwl/data/tumor.bam.bai
#> BFC32433 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.dict
#> BFC32434 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta
#> BFC32435 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta.fai
#> BFC32436 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf
#> BFC32437 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx
rs1 <- dataSearch("mutect2")
rs2 <- dataSearch("reference")
Besides the notes
added in getData
, users
can further tag the data. For example, if a set of datasets can be used
as inputs for a specific software tool, we’ll tag these datasets with
the software name, so that they can be retrieved easily with that
keyword in dataSearch
.
Here we’ll tag the reference genome data with mutect2
,
and use some utility functions to retrieve specific information about
the data. Data path can be retrieved using dataPaths
function, which will be passed directly to the Rcwl
tool
recipe for reproducible data analysis in R.
dataTags(rs2[2]) <- "mutect2"
rs <- dataSearch("mutect2")
rs
#> dataHub with 3 records
#> cache path: /home/qian/.cache/R/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name
#> BFC32434 | Homo_sapiens_assembly19.fasta
#> BFC32436 | Mutect2-exome-panel.vcf
#> BFC32437 | Mutect2-exome-panel.vcf.idx
#> Path
#> BFC32434 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta
#> BFC32436 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf
#> BFC32437 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx
ref <- dataPaths(rs)[1]
pon <- dataPaths(rs[2])
## try: dataNames, dataYml, dataNotes, dataParams, etc.
Coming back to the RcwlPipelines
tool recipe, here we
will assign values to the tool for the reusable genomic data resources
that are managed by ReUseData
, and evaluate the tool
(internally submit the CWL task) using runCWL
for
reproducible data analysis in R.
## check inputs
inputs(mutect2)
#> inputs:
#> tbam (File): -I /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/tumor.bam
#> nbam (File?): -I File /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/normal.bam
#> Ref (File): -R
#> normal (string?): -normal NA12892
#> germline (File?): --germline-resource
#> pon (File?): --panel-of-normals
#> interval (File?): -L File /tmp/Bioc2023_rcwl/region.bed
#> out (string): -O somatic.vcf
#> f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#> threads (int?): --native-pair-hmm-threads
## assign inputs of reusable genomic data
mutect2$Ref <- ref
mutect2$pon <- pon
inputs(mutect2)
#> inputs:
#> tbam (File): -I /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/tumor.bam
#> nbam (File?): -I File /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/normal.bam
#> Ref (File): -R /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta
#> normal (string?): -normal NA12892
#> germline (File?): --germline-resource
#> pon (File?): --panel-of-normals File /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf
#> interval (File?): -L File /tmp/Bioc2023_rcwl/region.bed
#> out (string): -O somatic.vcf
#> f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#> threads (int?): --native-pair-hmm-threads
runCWL(mutect2, outdir = file.path(workdir, "output"),
docker = "udocker", showLog = TRUE)
#> }
#> List of length 3
#> names(3): command output logs
The results should be successfully generated in the user-specified
output directory. The somatic.vcf
contains all the somatic
variants that are called from the input bam files using the
bioinformatics software tool Mutect2
.
## checkout results
list.files(file.path(workdir, "output"))
#> [1] "f1r2.tar.gz" "somatic.vcf" "somatic.vcf.idx"
#> [4] "somatic.vcf.stats"
The workflow/tool result files can be conveniently passed to the downstream R/Bioconductor packages for a unified R programming environment.
library(VariantAnnotation)
fl <- file.path(workdir, "output/somatic.vcf")
readVcf(fl)
#> class: CollapsedVCF
#> dim: 17 2
#> rowRanges(vcf):
#> GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
#> info(vcf):
#> DataFrame with 24 columns: AS_SB_TABLE, AS_UNIQ_ALT_READ_COUNT, CONTQ, DP,...
#> info(header(vcf)):
#> Number Type Description
#> AS_SB_TABLE 1 String Allele-specific forward/reverse rea...
#> AS_UNIQ_ALT_READ_COUNT A Integer Number of reads with unique start a...
#> CONTQ 1 Float Phred-scaled qualities that alt all...
#> DP 1 Integer Approximate read depth; some reads ...
#> ECNT 1 Integer Number of events in this haplotype
#> GERMQ 1 Integer Phred-scaled quality that alt allel...
#> MBQ R Integer median base quality by allele
#> MFRL R Integer median fragment length by allele
#> MMQ R Integer median mapping quality by allele
#> MPOS A Integer median distance from end of read
#> NALOD A Float Negative log 10 odds of artifact in...
#> NCount 1 Integer Count of N bases in the pileup
#> NLOD A Float Normal log 10 likelihood ratio of d...
#> OCM 1 Integer Number of alt reads whose original ...
#> PON 0 Flag site found in panel of normals
#> POPAF A Float negative log 10 population allele f...
#> ROQ 1 Float Phred-scaled qualities that alt all...
#> RPA R Integer Number of times tandem repeat unit ...
#> RU 1 String Tandem repeat unit (bases)
#> SEQQ 1 Integer Phred-scaled quality that alt allel...
#> STR 0 Flag Variant is a short tandem repeat
#> STRANDQ 1 Integer Phred-scaled quality of strand bias...
#> STRQ 1 Integer Phred-scaled quality that alt allel...
#> TLOD A Float Log 10 likelihood ratio score of va...
#> geno(vcf):
#> List of length 13: GT, AD, AF, DP, F1R2, F2R1, FAD, GQ, PGT, PID, ...
#> geno(header(vcf)):
#> Number Type Description
#> GT 1 String Genotype
#> AD R Integer Allelic depths for the ref and alt alleles in the ord...
#> AF A Float Allele fractions of alternate alleles in the tumor
#> DP 1 Integer Approximate read depth (reads with MQ=255 or with bad...
#> F1R2 R Integer Count of reads in F1R2 pair orientation supporting ea...
#> F2R1 R Integer Count of reads in F2R1 pair orientation supporting ea...
#> FAD R Integer Count of fragments supporting each allele.
#> GQ 1 Integer Genotype Quality
#> PGT 1 String Physical phasing haplotype information, describing ho...
#> PID 1 String Physical phasing ID information, where each unique ID...
#> PL G Integer Normalized, Phred-scaled likelihoods for genotypes as...
#> PS 1 Integer Phasing set (typically the position of the first vari...
#> SB 4 Integer Per-sample component statistics which comprise the Fi...
ReUseData
, as the name suggests, commits to promoting
the data reuse. Data (rom dataSearch
) can be prepared in
standard input formats (toList
), e.g., YAML and JSON, to be
easily integrated in workflow methods that are locally or
cloud-hosted.
Edits may need for the “key” of data value for different workflow scripts.
dataUpdate(dir = workdir)
#>
#> Updating data record...
#> normal.bam added
#> normal.bam.bai added
#> tumor.bam added
#> tumor.bam.bai added
#> Homo_sapiens_assembly19.dict added
#> Homo_sapiens_assembly19.fasta added
#> Homo_sapiens_assembly19.fasta.fai added
#> Mutect2-exome-panel.vcf added
#> Mutect2-exome-panel.vcf.idx added
#> dataHub with 9 records
#> cache path: /home/qian/.cache/R/ReUseData
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using:
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#>
#> name
#> BFC32438 | normal.bam
#> BFC32439 | normal.bam.bai
#> BFC32440 | tumor.bam
#> BFC32441 | tumor.bam.bai
#> BFC32442 | Homo_sapiens_assembly19.dict
#> BFC32443 | Homo_sapiens_assembly19.fasta
#> BFC32444 | Homo_sapiens_assembly19.fasta.fai
#> BFC32445 | Mutect2-exome-panel.vcf
#> BFC32446 | Mutect2-exome-panel.vcf.idx
#> Path
#> BFC32438 /tmp/Bioc2023_rcwl/data/normal.bam
#> BFC32439 /tmp/Bioc2023_rcwl/data/normal.bam.bai
#> BFC32440 /tmp/Bioc2023_rcwl/data/tumor.bam
#> BFC32441 /tmp/Bioc2023_rcwl/data/tumor.bam.bai
#> BFC32442 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.dict
#> BFC32443 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta
#> BFC32444 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta.fai
#> BFC32445 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf
#> BFC32446 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx
ds_mutect2 <- dataSearch("mutect2")
toList(ds_mutect2, format = "json", file=file.path(workdir, "data.json"))
#> File is saved as: "/tmp/Bioc2023_rcwl/data.json"
#> {
#> "Homo_sapiens_assembly19.fasta": "/tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta",
#> "Mutect2-exome-panel.vcf": "/tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf",
#> "Mutect2-exome-panel.vcf.idx": "/tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx"
#> }
By evaluating a ReUseData
data recipe with
getData
, some meta files for the data will be automatically
generated for data and tool provenance.
By default, the meta file names for these meta files are paste of
recipe name, and input parameter values separated by _
,
which can be changed by the prefix
argument in
getData
function if needed.
list.files(file.path(workdir, "shareData"), pattern = "cp_gatk_mutect2_b37") ## meta files
#> [1] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.cwl"
#> [2] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.md5"
#> [3] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.sh"
#> [4] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.yml"
#> [5] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.cwl"
#> [6] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.md5"
#> [7] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.sh"
#> [8] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.yml"
#> [9] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.cwl"
#> [10] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.md5"
#> [11] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.sh"
#> [12] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.yml"
[recipeName_params].cwl
: Rcwl object defined in
Rcwl
for tool or data recipe.[recipeName_params].yml
: File containing values for
input parameters for both tool and data recipes through recipe
evaluation functions (e.g., runCWL
, getData
).
For data recipes, it includes additional meta information for the output
files, notes, and date, etc for data tracking purposes, which are added
by getData
function.[recipeName_params].sh
: The command lines in a shell
script.[recipeName_params].md5
: unique identifier for each
dataset.Here we use a simple example to show how to create a data recipe. This recipe will do a simple task of just downloading specific files from fixed web source without any input parameter.
## library(ReUseData)
The recipeMake
function will wrap the command line
script for data downloading/processings into an executable data recipe
in R. We will specify the inputs and outputs of the data recipe and use
outputGlob
to specify the output pattern (for internal
check).
script <- '
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/tumor.bam
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/tumor.bam.bai
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/normal.bam
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/normal.bam.bai
'
rcp_expdata <- recipeMake(shscript = script,
outputID = "bams",
outputGlob = "*.bam*")
We need to assign values to the input parameters if they exist (data
inputs for the shell script). Then the data recipe is ready for the
evaluation using getData
function (see the
ReUseData
section [FIXME] above).
Roswell Park Comprehensive Cancer Center, Qian.Liu@RoswellPark.org↩︎
Roswell Park Comprehensive Cancer Center↩︎