Chapter 6 RcwlPipelines

6.1 Rcwl recipes and CWL scripts

The R scripts to build the CWL tools and pipelines are now residing in a dedicated GitHub repository, which is intended to be a community effort to collect and contribute Bioinformatics tools and pipelines using Rcwl and CWL. Script names are prefixed with tl_ and pl_ for tools and pipelines respectively.

6.2 RcwlPipelines core functions

6.2.1 cwlUpdate

The cwlUpdate function syncs the current Rcwl recipes and returns a cwlHub object which contains the most updated Rcwl recipes. The mcols() function returns all related information about each available tool or pipeline. Currently, we have integrated 113 command line tools and 26 pipelines.

The recipes will be locally cached, so users don’t need to call cwlUpdate every time unless they want to use a tool/pipeline that is newly added to RcwlPipelines.

atls <- cwlUpdate(branch = "dev") ## sync the tools/pipelines.
atls
## cwlHub with 139 records
## cache path:  ~/Library/Caches/Rcwl 
## # last modified date:  2021-02-22 
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
## 
##             title                      
##   BFC2772 | pl_alignMerge              
##   BFC2773 | pl_AnnPhaseVcf             
##   BFC2774 | pl_BaseRecal               
##   BFC2775 | pl_bwaAlign                
##   BFC2776 | pl_bwaMMRecal              
##   ...       ...                        
##   BFC2906 | tl_VarScan2                
##   BFC2907 | tl_vcf_expression_annotator
##   BFC2908 | tl_vcf_readcount_annotator 
##   BFC2909 | tl_vep                     
##   BFC2910 | tl_vt_decompose            
##           Command                                                            
##   BFC2772 bwaAlign+mergeBamDup                                               
##   BFC2773 VCFvep+dVCFcoverage+rVCFcoverage+VCFexpression+PhaseVcf            
##   BFC2774 BaseRecalibrator+ApplyBQSR+samtools_index+samtools_flagstat+samt...
##   BFC2775 bwa+sam2bam+sortBam+idxBam                                         
##   BFC2776 bwaAlign+mergeBamDup+BaseRecal                                     
##   ...     ...                                                                
##   BFC2906                                                                    
##   BFC2907 vcf-expression-annotator                                           
##   BFC2908 vcf-readcount-annotator                                            
##   BFC2909 vep                                                                
##   BFC2910 vt decompose

Currently, we have integrated 113 command line tools and 26 pipelines.

table(mcols(atls)$Type)

We can also get the commands and docker containers for specific tool or pipeline.

mcols(atls)[, c("Command", "Container")]
## DataFrame with 139 rows and 2 columns
##                    Command              Container
##                <character>            <character>
## 1     bwaAlign+mergeBamDup                     NA
## 2   VCFvep+dVCFcoverage+..                     NA
## 3   BaseRecalibrator+App..                     NA
## 4   bwa+sam2bam+sortBam+..                     NA
## 5   bwaAlign+mergeBamDup..                     NA
## ...                    ...                    ...
## 135                        serge2016/varscan:v0..
## 136 vcf-expression-annot.. griffithlab/vatools:..
## 137 vcf-readcount-annota.. griffithlab/vatools:..
## 138                    vep hubentu/ensembl-vep-..
## 139           vt decompose             hubentu/vt

6.2.2 cwlSearch

We can use (multiple) keywords to search for specific tools/pipelines of interest, which internally search the mcols of “rname”, “rpath”, “fpath”, “Command” and “Containers”. Here we show how to search the alignment tool bwa mem.

t1 <- cwlSearch(c("bwa", "mem"))
t1
## cwlHub with 1 records
## cache path:  ~/Library/Caches/Rcwl 
## # last modified date:  2021-02-22 
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
## 
##             title  Command
##   BFC2815 | tl_bwa bwa mem
mcols(t1)
## DataFrame with 1 row and 14 columns
##           rid       rname         create_time         access_time
##   <character> <character>         <character>         <character>
## 1     BFC2815      tl_bwa 2021-02-26 20:56:55 2021-02-26 20:56:55
##                    rpath       rtype                  fpath last_modified_time
##              <character> <character>            <character>          <numeric>
## 1 /Users/qi31566/Libra..       local /Users/qi31566/Libra..                 NA
##          etag   expires        Type     Command              Container
##   <character> <numeric> <character> <character>            <character>
## 1          NA        NA        tool     bwa mem biocontainers/bwa:v0..
##                 mtime
##           <character>
## 1 2021-02-22 14:45:16

6.2.3 cwlLoad

The last core function cwlLoad loads the Rcwl tool/pipeline into the R working environment. The code below loads the tool with a user-defined name bwa to do the read alignment.

bwa <- cwlLoad(title(t1)[1])  ## "tl_bwa"
bwa <- cwlLoad(mcols(t1)$fpath[1]) ## equivalent to the above. 
bwa
## class: cwlProcess 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: bwa mem 
## requirements:
## - class: DockerRequirement
##   dockerPull: biocontainers/bwa:v0.7.17-3-deb_cv1
## inputs:
##   threads (int): -t 
##   RG (string): -R 
##   Ref (File):  
##   FQ1 (File):  
##   FQ2 (File?):  
## outputs:
## sam:
##   type: File
##   outputBinding:
##     glob: '*.sam'
## stdout: bwaOutput.sam

Now the R tool of bwa is ready to use.

6.3 Tool/pipeline customization

To fit users’ specific needs,the existing tool or pipline can be easily customized. Here we use the rnaseq_Sf pipeline to demonstrate how to access and change the arguments of a specific tool inside a pipeline. This pipeline covers RNA-seq reads quality summary by fastQC, alignment by STAR, quantification by featureCounts and quality control by RSeQC.

rnaseq_Sf <- cwlLoad("pl_rnaseq_Sf")
## fastqc loaded
## STAR loaded
## sortBam loaded
## samtools_index loaded
## samtools_flagstat loaded
## featureCounts loaded
## gtfToGenePred loaded
## genePredToBed loaded
## read_distribution loaded
## geneBody_coverage loaded
## gCoverage loaded
## STAR loaded
plotCWL(rnaseq_Sf)

There are many default arguments defined for the tool of STAR inside the pipeline. Users might want to change some of them. For example, we can change the value for --outFilterMismatchNmax argument from 2 to 5 for longer reads.

arguments(rnaseq_Sf, "STAR")[5:6]
## [[1]]
## [1] "--outFilterMismatchNmax"
## 
## [[2]]
## [1] "2"
arguments(rnaseq_Sf, "STAR")[[6]] <- 5
arguments(rnaseq_Sf, "STAR")[5:6]
## [[1]]
## [1] "--outFilterMismatchNmax"
## 
## [[2]]
## [1] "5"

We can also change the docker image for a specific tool (e.g., to a specific version). First, we search for all available docker images for STAR in biocontainers repository. The Source server could be quay or dockerhub.

searchContainer("STAR", repo = "biocontainers", source = "quay")
## DataFrame with 29 rows and 5 columns
##                  tool          repo        name          last_modified
##           <character>   <character> <character>            <character>
## 2.7.8a--0        STAR biocontainers   2.7.8a--0 Sun, 21 Feb 2021 12:..
## 2.7.7a--0        STAR biocontainers   2.7.7a--0 Tue, 29 Dec 2020 13:..
## 2.7.6a--0        STAR biocontainers   2.7.6a--0 Sun, 20 Sep 2020 09:..
## 2.7.5c--0        STAR biocontainers   2.7.5c--0 Mon, 17 Aug 2020 09:..
## 2.7.5b--0        STAR biocontainers   2.7.5b--0 Sat, 01 Aug 2020 17:..
## ...               ...           ...         ...                    ...
## 2.4.0j--0        STAR biocontainers   2.4.0j--0 Tue, 06 Mar 2018 12:..
## 2.5.4a--0        STAR biocontainers   2.5.4a--0 Fri, 26 Jan 2018 21:..
## 2.5.3a--0        STAR biocontainers   2.5.3a--0 Sat, 18 Mar 2017 11:..
## 2.5.2b--0        STAR biocontainers   2.5.2b--0 Tue, 06 Sep 2016 07:..
## 2.5.1b--0        STAR biocontainers   2.5.1b--0 Wed, 11 May 2016 08:..
##                  size
##           <character>
## 2.7.8a--0     7985763
## 2.7.7a--0     7824392
## 2.7.6a--0     7820491
## 2.7.5c--0     7806635
## 2.7.5b--0     7805390
## ...               ...
## 2.4.0j--0     4734325
## 2.5.4a--0     9225952
## 2.5.3a--0     9119736
## 2.5.2b--0     9086803
## 2.5.1b--0    11291827

Then, we can change the STAR version into 2.7.8a (tag name: 2.7.8a–0).

requirements(rnaseq_Sf, "STAR")[[1]]
## $class
## [1] "DockerRequirement"
## 
## $dockerPull
## [1] "quay.io/biocontainers/star:2.7.3a--0"
requirements(rnaseq_Sf, "STAR")[[1]] <- requireDocker(
    docker = "quay.io/biocontainers/star:2.7.8a--0")
requirements(rnaseq_Sf, "STAR")[[1]]
## $class
## [1] "DockerRequirement"
## 
## $dockerPull
## [1] "quay.io/biocontainers/star:2.7.8a--0"

6.4 Build a pipeline

We can build a pipline using the available tools. Here we demonstrate how to build a simple alignment pipeline with mapping and marking duplicates.

First, we check whether the required tools (bwa, samtools and picard markduplicates) are available in RcwlPipelines.

tls <- cwlSearch("bwa|sam2bam|sortBam|samtools_index|markdup",
                 type = "tool")
tls
## cwlHub with 6 records
## cache path:  ~/Library/Caches/Rcwl 
## # last modified date:  2021-02-22 
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
## 
##             title             Command              
##   BFC2814 | tl_bwa_index      bwa index            
##   BFC2815 | tl_bwa            bwa mem              
##   BFC2851 | tl_markdup        picard MarkDuplicates
##   BFC2877 | tl_sam2bam        samtools view        
##   BFC2881 | tl_samtools_index samtools index       
##   BFC2887 | tl_sortBam        samtools sort

Then we load all the required tools.

bwa <- cwlLoad("tl_bwa")
bwa_index <- cwlLoad("tl_bwa_index")
markdup <- cwlLoad("tl_markdup")
sam2bam <- cwlLoad("tl_sam2bam")
samtools_index <- cwlLoad("tl_samtools_index")
sortBam <- cwlLoad("tl_sortBam")

Next, we will need to define the input parameters for the pipeline (instead of for each tool).

p1 <- InputParam(id = "threads", type = "int")
p2 <- InputParam(id = "RG", type = "string")
p3 <- InputParam(id = "Ref", type = "string")
p4 <- InputParam(id = "FQ1", type = "File")
p5 <- InputParam(id = "FQ2", type = "File?")

Then we define the pipeline steps, to connect the inputs and outputs of each tool to form a pipeline.

## bwa
s1 <- cwlStep(id = "bwa", run = bwa,
              In = list(threads = "threads",
                        RG = "RG",
                        Ref = "Ref",
                        FQ1 = "FQ1",
                        FQ2 = "FQ2"))
## sam to bam
s2 <- cwlStep(id = "sam2bam", run = sam2bam,
              In = list(sam = "bwa/sam"))
## sort bam
s3 <- cwlStep(id = "sortBam", run = sortBam,
              In = list(bam = "sam2bam/bam"))
## mark duplicates
s4 <- cwlStep(id = "markdup", run = markdup,
              In = list(ibam = "sortBam/sbam",
                        obam = list(
                            valueFrom="$(inputs.ibam.nameroot).mdup.bam"),
                        matrix = list(
                            valueFrom="$(inputs.ibam.nameroot).markdup.txt")))
## index bam
s5 <- cwlStep(id = "idxBam", run = samtools_index,
              In = list(bam = "markdup/mBam"))

Last, we will define the pipeline outputs and connect all the above defined steps into a new pipeline.

req1 <- requireStepInputExpression()
req2 <- requireJS()
## outputs
o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam")
o2 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx")
## cwlWorkflow
Align <- cwlWorkflow(requirements = list(req1, req2),
                     inputs = InputParamList(p1, p2, p3, p4, p5),
                     outputs = OutputParamList(o1, o2))
## build pipeline
Align <- Align + s1 + s2 + s3 + s4 + s5

Now the pipeline is successfully built and ready for use. We can visualize the pipeline with plotCWL from the Rcwl package.

plotCWL(Align)