Overview

Pre-requisites

  • Basic familiarity with DNA-seq data variant calling
  • Interest of using workflow language

Workshop Participation

The workshop format is a 45 minute session consisting of hands-on demos, exercises and Q&A.

R / Bioconductor packages used

  • RcwlPipelines
  • Rcwl
  • ReUseData
  • VariantAnnotation

Description

In this workshop, we will demonstrate three Bioconductor packages: Rcwl as an R interface for CWL; RcwlPipelines with >200 pre-built bioinformatics tools and best practice pipelines in R, that are easily usable and highly customizable; and ReUseData for the management of reusable genomic data.

This workshop will implement variant calling workflows within R using the Mutect2 from GATK. This whole workflow is based on R programming language and can be deployed in local computer, HPC and cloud computing platforms, using docker, singularity or udocker. [FIXME]

With these tools, we should be able to conduct reproducible data analysis using commonly used bioinformatics tools (including command-line based tools and R/Bioconductor packages) and validated, best practice workflows (based on workflow languages such as CWL) within a unified R programming environment.

Workshop: Somatic variant calling

For the somatic variant calling, we will need to prepare the following:

  • Experiment data
    • In the format of .bam, .bam.bai files
  • ReUsable Genomic data
    • reference sequence file (b37 or hg38)
    • Panel of Normals (PON) ref
  • Software tool:
    • Here we use Mutect2to Call somatic SNVs and indels via local assembly of haplotypes. ref

There will be three ways to do the analysis task.

  1. Command-line interface. We can install the software and write shell scripts.

  2. Workflow language. We can use workflow language (e.g., CWL, WDL, nextflow, snakemake) to streamline the workflow of data analysis, using containerized tools (no software installation, version tracking, etc.)

    • More efficient (especially if multiple tools are included)
    • REPRODUCIBLE!
    • Need to install a workflow runner (e.g., cwltool, arvados-cwl-runner)
    • Learning curve.
  3. Rcwl/RcwlPipelines. Write/use CWL-based data analysis tools/workflows within R programming language

    • All benefits of workflow language
    • Hands-on experience
    • Easy connection with downstream analysis tools in R/Bioconductor.

    ReUseData: Reproducible and reusable genomic data management.

    • Public genomic data resources can be easily managed and reused in different projects.

This workshop will implement variant calling and annotation workflows within R using the Mutect2 from GATK. This whole workflow is based on a unified R programming language and can be deployed in local computer, HPC and cloud computing platforms, using docker, singularity or udocker. [FIXME] With the pre-built tools in RcwlPipelines, we should be able to conduct reproducible data analysis using commonly used bioinformatics tools (including command-line based tools and R/Bioconductor packages) and validated, best practice workflows (based on workflow languages such as CWL) within a unified R programming environment.

Prepare the software tools in R using RcwlPipelines

RcwlPipelines includes more than 200 pre-built, commonly used bioinformatics tools, such as BWA for DNA read alignment, STAR for RNA read quantification, Mutect2 for somatic variant calling. Here we use Mutect2 to demonstrate the use of these tools to run CWL-based data analysis workflow within R.

Below we will show 3 core functions: cwlUpdate, cwlSearch and cwlLoad to update, search and load the needed tool or pipeline in R.

Load tool or pipeline

The cwlUpdate function syncs the current Rcwl tool recipes in the local cache. It returns a cwlHub object that contains the most updated Rcwl recipes. User need to call this function for first-time use or if they want to use a newly added tool/pipeline from RcwlPipelines.

library(RcwlPipelines)
library(Rcwl)
workdir <- "/tmp/Bioc2023_rcwl"
dir.create(workdir)
#> Warning in dir.create(workdir): '/tmp/Bioc2023_rcwl' already exists

Then the existing tool recipes can be queried using with cwlSearch with multiple keywords. Then it’s ready to be loaded into R with cwlLoad.

Multiple tool/recipes are available with Mutect2. We first show the pipeline which includes multiple tools for the tasks of variant calling, QC, and filtering https://rcwl.org/RcwlRecipes/Mutect2PL.html, where Mutect2 is part of the pipeline.

cwlUpdate()
#> Update scripts...
#> cwlHub with 232 records
#> cache path:  ~/.cache/Rcwl 
#> # last modified date:  2023-01-05 
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#> 
#>            title               
#>   BFC178 | pl_alignMerge       
#>   BFC179 | pl_AnnPhaseVcf      
#>   BFC180 | pl_BaseRecal        
#>   BFC181 | pl_bwaAlign         
#>   BFC182 | pl_bwaMMRecal       
#>   ...      ...                 
#>   BFC405 | tl_samtools_collate 
#>   BFC406 | tl_samtools_fastq   
#>   BFC407 | tl_strip_sam        
#>   BFC408 | tl_TPMCalculator    
#>   BFC409 | tl_trimRRBSdiversity
#>          Command                                                            
#>   BFC178 bwaAlign+mergeBamDup                                               
#>   BFC179 VCFvep+dVCFcoverage+rVCFcoverage+VCFexpression+PhaseVcf            
#>   BFC180 BaseRecalibrator+ApplyBQSR+samtools_index+samtools_flagstat+samt...
#>   BFC181 bwa+sam2bam+sortBam+idxBam                                         
#>   BFC182 bwaAlign+mergeBamDup+BaseRecal                                     
#>   ...    ...                                                                
#>   BFC405 <NA>                                                               
#>   BFC406 <NA>                                                               
#>   BFC407 <NA>                                                               
#>   BFC408 <NA>                                                               
#>   BFC409 <NA>
cwlSearch("mutect2")
#> cwlHub with 6 records
#> cache path:  ~/.cache/Rcwl 
#> # last modified date:  2022-08-16 
#> # cwlSearch() to query scripts
#> # cwlLoad('title') to load the script
#> # additional mcols(): rid, rpath, Type, Container, mtime, ...
#> 
#>            title                 
#>   BFC198 | pl_Mutect2PL          
#>   BFC203 | pl_SomaticCallers     
#>   BFC291 | tl_Mutect2_gatk3      
#>   BFC292 | tl_Mutect2            
#>   BFC369 | pl_SomaticCaller4     
#>   BFC377 | pl_SomaticCaller_mouse
#>          Command                                                            
#>   BFC198 Mutect2+GetPileupSummariesT+GetPileupSummariesN+CalculateContami...
#>   BFC203 Mutect2PL+MuSE+bgzip+tabixIndex+mantaStrelka+SomaticSniper+VarDi...
#>   BFC291 java -jar /usr/GenomeAnalysisTK.jar -T MuTect2                     
#>   BFC292 gatk Mutect2                                                       
#>   BFC369 Mutect2PL+MuSE+bgzip+tabixIndex+mantaStrelka+VarDict+combine       
#>   BFC377 Mutect2+MuSE+bgzip+tabixIndex+mantaStrelka+VarDict+combine
mutect2pl <- cwlLoad("pl_Mutect2PL")
#> Mutect2 loaded
#> GetPileupSummaries loaded
#> CalculateContamination loaded
#> LearnReadOrientationModel loaded
#> FilterMutectCalls loaded
#> bcfview loaded
plotCWL(mutect2pl)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#>  Using compatibility `.name_repair`.
#>  The deprecated feature was likely used in the DiagrammeR package.
#>   Please report the issue at
#>   <https://github.com/rich-iannone/DiagrammeR/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

In this workshop, we will use the simple tool of tl_Mutect2 for variant calling only. We can customize the tools by using a smaller docker image for demo purposes.

mutect2 <- cwlLoad("tl_Mutect2")
## the official GATK docker image
requirements(mutect2)[[1]]  
#> $class
#> [1] "DockerRequirement"
#> 
#> $dockerPull
#> [1] "broadinstitute/gatk:latest"
## change to smaller BioConda docker image
ds <- searchContainer("gatk4")
req1 <- requireDocker(ds[1, "container"])
requirements(mutect2)[[1]] <- req1
requirements(mutect2)[[1]]
#> $class
#> [1] "DockerRequirement"
#> 
#> $dockerPull
#> [1] "quay.io/biocontainers/gatk4:4.4.0.0--py36hdfd78af_0"

Check data inputs

We can use the Rcwl::inputs function to check the required input parameters for the tool. For Mutect2, multiple parameters are defined to map to the tool arguments. Major inputs would be the BAM files containing reads (tbam and nbam for tumor and normal separately), the reference sequence file (Ref), and the pon for panel of normals VCF files.

Some input parameters come with default values (such as f1r2, which can be changed easily. More details about the tool arguments can be found here.

inputs(mutect2)
#> inputs:
#>   tbam (File): -I 
#>   nbam (File?): -I 
#>   Ref (File): -R 
#>   normal (string?): -normal 
#>   germline (File?): --germline-resource 
#>   pon (File?): --panel-of-normals 
#>   interval (File?): -L 
#>   out (string): -O 
#>   f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#>   threads (int?): --native-pair-hmm-threads

We will assign values to the input parameters before evaluation.

For the experiment data, we have prepared a small DNA-seq dataset for demo purposes.

For the public genomic data resources that are needed to run mutect2 for somatic variant calling: the reference sequence file (Ref) and PON VCF file (pon), we can assign the file path directly to the corresponding parameters if you already have them, and evaluate the data recipe now with runCWL function.

However, as these genomic data files can be repeatedly used in many different projects (e.g., DNA-seq data), we propose an R/Bioconductor package ReUseData to manage these files so that they can be easily tracked for tool/data provenance to safely reproduce and reuse.

mutect2$tbam <- system.file("extdata/tumor.bam", package="RcwlWorkshop")
mutect2$nbam <- system.file("extdata/normal.bam", package="RcwlWorkshop")
mutect2$normal <- "NA12892"
mutect2$out <- "somatic.vcf"
## prepare capture region
write.table(rbind(c(21, 10400000, 10500000)), file.path(workdir, "region.bed"),
            row.names=FALSE, col.names = FALSE, quote=FALSE, sep="\t")
mutect2$interval <- file.path(workdir, "region.bed")
inputs(mutect2)
#> inputs:
#>   tbam (File): -I /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/tumor.bam
#>   nbam (File?): -I File /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/normal.bam
#>   Ref (File): -R 
#>   normal (string?): -normal NA12892
#>   germline (File?): --germline-resource 
#>   pon (File?): --panel-of-normals 
#>   interval (File?): -L File /tmp/Bioc2023_rcwl/region.bed
#>   out (string): -O somatic.vcf
#>   f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#>   threads (int?): --native-pair-hmm-threads
## mutect2$Ref <-
## mutect2$pon <-
## runCWL(mutect2, outdir = file.path(workdir, "output"), docker = "udocker", showLog = TRUE)

ReUseData: Reusable and Reproducible Management of Genomic Data Resources

ReUseData facilitates transformation of shell scripts for data preprocessing (e.g., downloading, indexing) into workflow-based data recipes. Evaluation of data recipes generate curated data files in their generic formats (e.g., VCF, bed) with automatically generated meta files, this will help track data and tool provenance for subsequent data reuse.

There are some pre-built data recipes in ReUseData that are ready to be queried and used. Basically, we run the recipeUpdate function to synchronize the existing and newly added recipes (by default the recipes included in ReUseData package, can also be a user-specific GitHub repository with similar settings).

The prebuilt data recipe scripts are included in the package, and are physically residing in a dedicated GitHub repository. These can serve as template scripts for recipe construction in different situations. The most common case is that a data recipe can manage multiple data resources with different input parameters (species, versions, etc.). For example, the gencode_transcripts recipe download from GENCODE, unzip and index the transcript fasta file for human or mouse with different versions. A simple data downloading (using wget) for a specific file can be written as a data recipe without any input parameter. For example, The pre-built recipes of gcp_gatk_mutect2_b37 and gcp_gatk_mutect2_hg38 are for the downloading of genomic data resources from the GATK resource bundle on Google Cloud Bucket based on reference build of b37 and hg38 separately. See more details here.

In this workshop, we will use the gcp_gatk_mutect2_b37 data recipe to download necessary genomic data files for the somatic variant calling.

Data recipes

We will 4 core functions to update (recipeUpdate), search (recipeSearch), load (recipeLoad) the data recipe, assign values to the input parameters (filename and idx), and then evaluate the recipe (getData) to download the data to your local machine with automatic tracking and user-specified notes/keywords.

The getData function evaluates the data recipe and generate the desired data output in a user-specified outdir. Arbitrary notes can be added for the dataset for easy query later.

Internally, the data recipe is submitted as a cwl task evaluting the shell script for data downloading/processing.

recipeUpdate()
#> Updating recipes...
#> 
#> recipeHub with 17 records
#> cache path:  /home/qian/.cache/R/ReUseDataRecipe 
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#> 
#>             name                  
#>   BFC1906 | bowtie2_index         
#>   BFC1907 | echo_out              
#>   BFC1908 | ensembl_liftover      
#>   BFC1909 | gcp_broad_gatk_hg19   
#>   BFC1910 | gcp_broad_gatk_hg38   
#>   ...       ...                   
#>   BFC1918 | reference_genome      
#>   BFC1919 | salmon_index          
#>   BFC1920 | STAR_index            
#>   BFC1921 | ucsc_database         
#>   BFC1922 | ReUseDataRecipe-master
recipeSearch("mutect2")
#> recipeHub with 2 records
#> cache path:  /home/qian/.cache/R/ReUseDataRecipe 
#> # recipeSearch() to query specific recipes using multipe keywords
#> # recipeUpdate() to update the local recipe cache
#> 
#>             name                 
#>   BFC1911 | gcp_gatk_mutect2_b37 
#>   BFC1912 | gcp_gatk_mutect2_hg38
rcp <- recipeLoad("gcp_gatk_mutect2_b37")
#> Note: you need to assign a name for the recipe: rcpName <- recipeLoad('xx')
#> Data recipe loaded!
#> Use inputs() to check required input parameters before evaluation.
#> Check here: https://rcwl.org/dataRecipes/gcp_gatk_mutect2_b37.html
#> for user instructions (e.g., eligible input values, data source, etc.)

## get panel of normal vcf
rcp$filename <- "Mutect2-exome-panel.vcf"
rcp$idx <- "idx"
getData(rcp, outdir = file.path(workdir, "shareData"),
        notes = c("mutect2", "panel of normal"),
        showLog = TRUE)
#> }
#> List of length 4
#> names(4): command output logs yml

## check output
list.files(file.path(workdir, "shareData"), pattern= c("panel", "normal"))
#> [1] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.cwl"
#> [2] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.md5"
#> [3] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.sh" 
#> [4] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.yml"
#> [5] "Mutect2-exome-panel.vcf"                             
#> [6] "Mutect2-exome-panel.vcf.idx"

The output directory includes the downloaded genomic file dataset, as well as some automatically generated meta files (ending with md5, cwl, yml and sh) by getData, where the the meta information for data recipe are recorded for subsequent reuse. More details about these meta files, please see the For developers section below.

Similarly, we will download the reference genome files using the same data recipe:

## Reference genome files
rcp$filename <- "Homo_sapiens_assembly19.fasta"
rcp$idx = "fai"
getData(rcp, outdir = file.path(workdir, "shareData"),
        notes = c("human", "reference genome", "hg19", "b37"),
        showLog = TRUE)
#> }
#> List of length 4
#> names(4): command output logs yml

rcp$filename <- "Homo_sapiens_assembly19.dict"
rcp$idx <- ""
getData(rcp, outdir = file.path(workdir, "shareData"),
        notes = c("human", "reference genome dict", "hg19", "b37"),
        showLog = TRUE)
#> }
#> List of length 4
#> names(4): command output logs yml
## check output
list.files(file.path(workdir, "shareData"), pattern = c("assembly19"))
#>  [1] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.cwl"    
#>  [2] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.md5"    
#>  [3] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.sh"     
#>  [4] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.yml"    
#>  [5] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.cwl"
#>  [6] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.md5"
#>  [7] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.sh" 
#>  [8] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.yml"
#>  [9] "Homo_sapiens_assembly19.dict"                              
#> [10] "Homo_sapiens_assembly19.fasta"                             
#> [11] "Homo_sapiens_assembly19.fasta.fai"

Now we have all the data needed. Here we show some utility functions to manage, query, and use the data.

Data management

There are 2 core functions to manage the data locally for easy query and use.

Then we use dataUpdate to cache the data in specified data directory dir. Then the data can be easily tracked and queried using dataSearch function. Meta information about the data, such as file paths, can be easily retrieved for later use.

  • dataUpdate: Cache the data in specified data directory dir (works recursively). Need to call for first-time use or if any new datasets are generated locally with getData.
  • dataSearch: Query of cached datasets with (multiple) keyword(s) (can be keywords from the file name or the user-specified notes.
dataUpdate(dir = workdir, keepTags=FALSE, cleanup = TRUE)
#> 
#> Updating data record...
#> normal.bam added
#> normal.bam.bai added
#> tumor.bam added
#> tumor.bam.bai added
#> Homo_sapiens_assembly19.dict added
#> Homo_sapiens_assembly19.fasta added
#> Homo_sapiens_assembly19.fasta.fai added
#> Mutect2-exome-panel.vcf added
#> Mutect2-exome-panel.vcf.idx added
#> dataHub with 9 records
#> cache path:  /home/qian/.cache/R/ReUseData 
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using: 
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#> 
#>              name                             
#>   BFC32429 | normal.bam                       
#>   BFC32430 | normal.bam.bai                   
#>   BFC32431 | tumor.bam                        
#>   BFC32432 | tumor.bam.bai                    
#>   BFC32433 | Homo_sapiens_assembly19.dict     
#>   BFC32434 | Homo_sapiens_assembly19.fasta    
#>   BFC32435 | Homo_sapiens_assembly19.fasta.fai
#>   BFC32436 | Mutect2-exome-panel.vcf          
#>   BFC32437 | Mutect2-exome-panel.vcf.idx      
#>            Path                                                          
#>   BFC32429 /tmp/Bioc2023_rcwl/data/normal.bam                            
#>   BFC32430 /tmp/Bioc2023_rcwl/data/normal.bam.bai                        
#>   BFC32431 /tmp/Bioc2023_rcwl/data/tumor.bam                             
#>   BFC32432 /tmp/Bioc2023_rcwl/data/tumor.bam.bai                         
#>   BFC32433 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.dict     
#>   BFC32434 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta    
#>   BFC32435 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta.fai
#>   BFC32436 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf          
#>   BFC32437 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx
rs1 <- dataSearch("mutect2")
rs2 <- dataSearch("reference")

Besides the notes added in getData, users can further tag the data. For example, if a set of datasets can be used as inputs for a specific software tool, we’ll tag these datasets with the software name, so that they can be retrieved easily with that keyword in dataSearch.

Here we’ll tag the reference genome data with mutect2, and use some utility functions to retrieve specific information about the data. Data path can be retrieved using dataPaths function, which will be passed directly to the Rcwl tool recipe for reproducible data analysis in R.

dataTags(rs2[2]) <- "mutect2"
rs <- dataSearch("mutect2")
rs
#> dataHub with 3 records
#> cache path:  /home/qian/.cache/R/ReUseData 
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using: 
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#> 
#>              name                         
#>   BFC32434 | Homo_sapiens_assembly19.fasta
#>   BFC32436 | Mutect2-exome-panel.vcf      
#>   BFC32437 | Mutect2-exome-panel.vcf.idx  
#>            Path                                                      
#>   BFC32434 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta
#>   BFC32436 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf      
#>   BFC32437 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx
ref <- dataPaths(rs)[1]
pon <- dataPaths(rs[2])
## try: dataNames, dataYml, dataNotes, dataParams, etc.

Run reproducible data analysis in R

Coming back to the RcwlPipelines tool recipe, here we will assign values to the tool for the reusable genomic data resources that are managed by ReUseData, and evaluate the tool (internally submit the CWL task) using runCWL for reproducible data analysis in R.

## check inputs
inputs(mutect2)
#> inputs:
#>   tbam (File): -I /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/tumor.bam
#>   nbam (File?): -I File /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/normal.bam
#>   Ref (File): -R 
#>   normal (string?): -normal NA12892
#>   germline (File?): --germline-resource 
#>   pon (File?): --panel-of-normals 
#>   interval (File?): -L File /tmp/Bioc2023_rcwl/region.bed
#>   out (string): -O somatic.vcf
#>   f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#>   threads (int?): --native-pair-hmm-threads
## assign inputs of reusable genomic data
mutect2$Ref <- ref
mutect2$pon <- pon
inputs(mutect2)
#> inputs:
#>   tbam (File): -I /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/tumor.bam
#>   nbam (File?): -I File /tmp/RtmpF8xa6f/temp_libpath362b397748e5d2/RcwlWorkshop/extdata/normal.bam
#>   Ref (File): -R /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta
#>   normal (string?): -normal NA12892
#>   germline (File?): --germline-resource 
#>   pon (File?): --panel-of-normals File /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf
#>   interval (File?): -L File /tmp/Bioc2023_rcwl/region.bed
#>   out (string): -O somatic.vcf
#>   f1r2 (string?): --f1r2-tar-gz f1r2.tar.gz
#>   threads (int?): --native-pair-hmm-threads
runCWL(mutect2, outdir = file.path(workdir, "output"),
       docker = "udocker", showLog = TRUE)
#> }
#> List of length 3
#> names(3): command output logs

The results should be successfully generated in the user-specified output directory. The somatic.vcf contains all the somatic variants that are called from the input bam files using the bioinformatics software tool Mutect2.

## checkout results
list.files(file.path(workdir, "output"))
#> [1] "f1r2.tar.gz"       "somatic.vcf"       "somatic.vcf.idx"  
#> [4] "somatic.vcf.stats"

Data use

In R/Bioconductor packages for downstream data analysis

The workflow/tool result files can be conveniently passed to the downstream R/Bioconductor packages for a unified R programming environment.

library(VariantAnnotation)
fl <- file.path(workdir, "output/somatic.vcf")
readVcf(fl)
#> class: CollapsedVCF 
#> dim: 17 2 
#> rowRanges(vcf):
#>   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
#> info(vcf):
#>   DataFrame with 24 columns: AS_SB_TABLE, AS_UNIQ_ALT_READ_COUNT, CONTQ, DP,...
#> info(header(vcf)):
#>                           Number Type    Description                           
#>    AS_SB_TABLE            1      String  Allele-specific forward/reverse rea...
#>    AS_UNIQ_ALT_READ_COUNT A      Integer Number of reads with unique start a...
#>    CONTQ                  1      Float   Phred-scaled qualities that alt all...
#>    DP                     1      Integer Approximate read depth; some reads ...
#>    ECNT                   1      Integer Number of events in this haplotype    
#>    GERMQ                  1      Integer Phred-scaled quality that alt allel...
#>    MBQ                    R      Integer median base quality by allele         
#>    MFRL                   R      Integer median fragment length by allele      
#>    MMQ                    R      Integer median mapping quality by allele      
#>    MPOS                   A      Integer median distance from end of read      
#>    NALOD                  A      Float   Negative log 10 odds of artifact in...
#>    NCount                 1      Integer Count of N bases in the pileup        
#>    NLOD                   A      Float   Normal log 10 likelihood ratio of d...
#>    OCM                    1      Integer Number of alt reads whose original ...
#>    PON                    0      Flag    site found in panel of normals        
#>    POPAF                  A      Float   negative log 10 population allele f...
#>    ROQ                    1      Float   Phred-scaled qualities that alt all...
#>    RPA                    R      Integer Number of times tandem repeat unit ...
#>    RU                     1      String  Tandem repeat unit (bases)            
#>    SEQQ                   1      Integer Phred-scaled quality that alt allel...
#>    STR                    0      Flag    Variant is a short tandem repeat      
#>    STRANDQ                1      Integer Phred-scaled quality of strand bias...
#>    STRQ                   1      Integer Phred-scaled quality that alt allel...
#>    TLOD                   A      Float   Log 10 likelihood ratio score of va...
#> geno(vcf):
#>   List of length 13: GT, AD, AF, DP, F1R2, F2R1, FAD, GQ, PGT, PID, ...
#> geno(header(vcf)):
#>         Number Type    Description                                             
#>    GT   1      String  Genotype                                                
#>    AD   R      Integer Allelic depths for the ref and alt alleles in the ord...
#>    AF   A      Float   Allele fractions of alternate alleles in the tumor      
#>    DP   1      Integer Approximate read depth (reads with MQ=255 or with bad...
#>    F1R2 R      Integer Count of reads in F1R2 pair orientation supporting ea...
#>    F2R1 R      Integer Count of reads in F2R1 pair orientation supporting ea...
#>    FAD  R      Integer Count of fragments supporting each allele.              
#>    GQ   1      Integer Genotype Quality                                        
#>    PGT  1      String  Physical phasing haplotype information, describing ho...
#>    PID  1      String  Physical phasing ID information, where each unique ID...
#>    PL   G      Integer Normalized, Phred-scaled likelihoods for genotypes as...
#>    PS   1      Integer Phasing set (typically the position of the first vari...
#>    SB   4      Integer Per-sample component statistics which comprise the Fi...

In workflow language scripts

ReUseData, as the name suggests, commits to promoting the data reuse. Data (rom dataSearch) can be prepared in standard input formats (toList), e.g., YAML and JSON, to be easily integrated in workflow methods that are locally or cloud-hosted.

Edits may need for the “key” of data value for different workflow scripts.

dataUpdate(dir = workdir)
#> 
#> Updating data record...
#> normal.bam added
#> normal.bam.bai added
#> tumor.bam added
#> tumor.bam.bai added
#> Homo_sapiens_assembly19.dict added
#> Homo_sapiens_assembly19.fasta added
#> Homo_sapiens_assembly19.fasta.fai added
#> Mutect2-exome-panel.vcf added
#> Mutect2-exome-panel.vcf.idx added
#> dataHub with 9 records
#> cache path:  /home/qian/.cache/R/ReUseData 
#> # dataUpdate() to update the local data cache
#> # dataSearch() to query a specific dataset
#> # Additional information can be retrieved using: 
#> # dataNames(), dataParams(), dataNotes(), dataPaths(), dataTag() or mcols()
#> 
#>              name                             
#>   BFC32438 | normal.bam                       
#>   BFC32439 | normal.bam.bai                   
#>   BFC32440 | tumor.bam                        
#>   BFC32441 | tumor.bam.bai                    
#>   BFC32442 | Homo_sapiens_assembly19.dict     
#>   BFC32443 | Homo_sapiens_assembly19.fasta    
#>   BFC32444 | Homo_sapiens_assembly19.fasta.fai
#>   BFC32445 | Mutect2-exome-panel.vcf          
#>   BFC32446 | Mutect2-exome-panel.vcf.idx      
#>            Path                                                          
#>   BFC32438 /tmp/Bioc2023_rcwl/data/normal.bam                            
#>   BFC32439 /tmp/Bioc2023_rcwl/data/normal.bam.bai                        
#>   BFC32440 /tmp/Bioc2023_rcwl/data/tumor.bam                             
#>   BFC32441 /tmp/Bioc2023_rcwl/data/tumor.bam.bai                         
#>   BFC32442 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.dict     
#>   BFC32443 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta    
#>   BFC32444 /tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta.fai
#>   BFC32445 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf          
#>   BFC32446 /tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx
ds_mutect2 <- dataSearch("mutect2")
toList(ds_mutect2, format = "json", file=file.path(workdir, "data.json"))
#> File is saved as: "/tmp/Bioc2023_rcwl/data.json"
#> {
#>   "Homo_sapiens_assembly19.fasta": "/tmp/Bioc2023_rcwl/shareData/Homo_sapiens_assembly19.fasta",
#>   "Mutect2-exome-panel.vcf": "/tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf",
#>   "Mutect2-exome-panel.vcf.idx": "/tmp/Bioc2023_rcwl/shareData/Mutect2-exome-panel.vcf.idx"
#> }

For developers

Data annotation files

By evaluating a ReUseData data recipe with getData, some meta files for the data will be automatically generated for data and tool provenance.

By default, the meta file names for these meta files are paste of recipe name, and input parameter values separated by _, which can be changed by the prefix argument in getData function if needed.

list.files(file.path(workdir, "shareData"), pattern = "cp_gatk_mutect2_b37")  ## meta files
#>  [1] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.cwl"    
#>  [2] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.md5"    
#>  [3] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.sh"     
#>  [4] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.dict_.yml"    
#>  [5] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.cwl"
#>  [6] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.md5"
#>  [7] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.sh" 
#>  [8] "gcp_gatk_mutect2_b37_Homo_sapiens_assembly19.fasta_fai.yml"
#>  [9] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.cwl"      
#> [10] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.md5"      
#> [11] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.sh"       
#> [12] "gcp_gatk_mutect2_b37_Mutect2-exome-panel.vcf_idx.yml"
  • [recipeName_params].cwl: Rcwl object defined in Rcwl for tool or data recipe.
  • [recipeName_params].yml: File containing values for input parameters for both tool and data recipes through recipe evaluation functions (e.g., runCWL, getData). For data recipes, it includes additional meta information for the output files, notes, and date, etc for data tracking purposes, which are added by getData function.
  • [recipeName_params].sh: The command lines in a shell script.
  • [recipeName_params].md5: unique identifier for each dataset.

Create data recipe

Here we use a simple example to show how to create a data recipe. This recipe will do a simple task of just downloading specific files from fixed web source without any input parameter.

## library(ReUseData)

The recipeMake function will wrap the command line script for data downloading/processings into an executable data recipe in R. We will specify the inputs and outputs of the data recipe and use outputGlob to specify the output pattern (for internal check).

script <- '
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/tumor.bam
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/tumor.bam.bai
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/normal.bam
wget https://github.com/hubentu/somatic-snv-test-data/raw/master/normal.bam.bai
'
rcp_expdata <- recipeMake(shscript = script,
                          outputID = "bams",
                          outputGlob = "*.bam*")

We need to assign values to the input parameters if they exist (data inputs for the shell script). Then the data recipe is ready for the evaluation using getData function (see the ReUseData section [FIXME] above).


  1. Roswell Park Comprehensive Cancer Center, ↩︎

  2. Roswell Park Comprehensive Cancer Center↩︎