Chapter 6 RcwlPipelines
6.1 Rcwl recipes and CWL scripts
The R scripts to build the CWL tools and pipelines are now residing
in a dedicated GitHub
repository, which is
intended to be a community effort to collect and contribute
Bioinformatics tools and pipelines using Rcwl
and CWL. Script names
are prefixed with tl_
and pl_
for tools and pipelines
respectively.
6.2 RcwlPipelines core functions
6.2.1 cwlUpdate
The cwlUpdate
function syncs the current Rcwl
recipes and returns
a cwlHub
object which contains the most updated Rcwl
recipes. The
mcols()
function returns all related information about each
available tool or pipeline. Currently, we have integrated 113 command
line tools and 26 pipelines.
The recipes will be locally cached, so users don’t need to call
cwlUpdate
every time unless they want to use a tool/pipeline that is
newly added to RcwlPipelines
.
## cwlHub with 139 records
## cache path: ~/Library/Caches/Rcwl
## # last modified date: 2021-02-22
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
##
## title
## BFC2772 | pl_alignMerge
## BFC2773 | pl_AnnPhaseVcf
## BFC2774 | pl_BaseRecal
## BFC2775 | pl_bwaAlign
## BFC2776 | pl_bwaMMRecal
## ... ...
## BFC2906 | tl_VarScan2
## BFC2907 | tl_vcf_expression_annotator
## BFC2908 | tl_vcf_readcount_annotator
## BFC2909 | tl_vep
## BFC2910 | tl_vt_decompose
## Command
## BFC2772 bwaAlign+mergeBamDup
## BFC2773 VCFvep+dVCFcoverage+rVCFcoverage+VCFexpression+PhaseVcf
## BFC2774 BaseRecalibrator+ApplyBQSR+samtools_index+samtools_flagstat+samt...
## BFC2775 bwa+sam2bam+sortBam+idxBam
## BFC2776 bwaAlign+mergeBamDup+BaseRecal
## ... ...
## BFC2906
## BFC2907 vcf-expression-annotator
## BFC2908 vcf-readcount-annotator
## BFC2909 vep
## BFC2910 vt decompose
Currently, we have integrated 113 command line tools and 26 pipelines.
table(mcols(atls)$Type)
We can also get the commands and docker containers for specific tool or pipeline.
## DataFrame with 139 rows and 2 columns
## Command Container
## <character> <character>
## 1 bwaAlign+mergeBamDup NA
## 2 VCFvep+dVCFcoverage+.. NA
## 3 BaseRecalibrator+App.. NA
## 4 bwa+sam2bam+sortBam+.. NA
## 5 bwaAlign+mergeBamDup.. NA
## ... ... ...
## 135 serge2016/varscan:v0..
## 136 vcf-expression-annot.. griffithlab/vatools:..
## 137 vcf-readcount-annota.. griffithlab/vatools:..
## 138 vep hubentu/ensembl-vep-..
## 139 vt decompose hubentu/vt
6.2.2 cwlSearch
We can use (multiple) keywords to search for specific tools/pipelines
of interest, which internally search the mcols
of “rname”, “rpath”,
“fpath”, “Command” and “Containers”. Here we show how to search the
alignment tool bwa mem
.
## cwlHub with 1 records
## cache path: ~/Library/Caches/Rcwl
## # last modified date: 2021-02-22
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
##
## title Command
## BFC2815 | tl_bwa bwa mem
## DataFrame with 1 row and 14 columns
## rid rname create_time access_time
## <character> <character> <character> <character>
## 1 BFC2815 tl_bwa 2021-02-26 20:56:55 2021-02-26 20:56:55
## rpath rtype fpath last_modified_time
## <character> <character> <character> <numeric>
## 1 /Users/qi31566/Libra.. local /Users/qi31566/Libra.. NA
## etag expires Type Command Container
## <character> <numeric> <character> <character> <character>
## 1 NA NA tool bwa mem biocontainers/bwa:v0..
## mtime
## <character>
## 1 2021-02-22 14:45:16
6.2.3 cwlLoad
The last core function cwlLoad
loads the Rcwl
tool/pipeline into
the R working environment. The code below loads the tool with a
user-defined name bwa
to do the read alignment.
bwa <- cwlLoad(title(t1)[1]) ## "tl_bwa"
bwa <- cwlLoad(mcols(t1)$fpath[1]) ## equivalent to the above.
bwa
## class: cwlProcess
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: bwa mem
## requirements:
## - class: DockerRequirement
## dockerPull: biocontainers/bwa:v0.7.17-3-deb_cv1
## inputs:
## threads (int): -t
## RG (string): -R
## Ref (File):
## FQ1 (File):
## FQ2 (File?):
## outputs:
## sam:
## type: File
## outputBinding:
## glob: '*.sam'
## stdout: bwaOutput.sam
Now the R tool of bwa
is ready to use.
6.3 Tool/pipeline customization
To fit users’ specific needs,the existing tool or pipline can be
easily customized. Here we use the rnaseq_Sf
pipeline to demonstrate
how to access and change the arguments of a specific tool inside a
pipeline. This pipeline covers RNA-seq reads quality summary by
fastQC
, alignment by STAR
, quantification by featureCounts
and
quality control by RSeQC
.
## fastqc loaded
## STAR loaded
## sortBam loaded
## samtools_index loaded
## samtools_flagstat loaded
## featureCounts loaded
## gtfToGenePred loaded
## genePredToBed loaded
## read_distribution loaded
## geneBody_coverage loaded
## gCoverage loaded
## STAR loaded
There are many default arguments defined for the tool of STAR
inside
the pipeline. Users might want to change some of them. For example, we
can change the value for --outFilterMismatchNmax
argument from 2 to
5 for longer reads.
## [[1]]
## [1] "--outFilterMismatchNmax"
##
## [[2]]
## [1] "2"
## [[1]]
## [1] "--outFilterMismatchNmax"
##
## [[2]]
## [1] "5"
We can also change the docker image for a specific tool (e.g., to a
specific version). First, we search for all available docker images
for STAR
in biocontainers repository. The Source server could be
quay or dockerhub.
## DataFrame with 29 rows and 5 columns
## tool repo name last_modified
## <character> <character> <character> <character>
## 2.7.8a--0 STAR biocontainers 2.7.8a--0 Sun, 21 Feb 2021 12:..
## 2.7.7a--0 STAR biocontainers 2.7.7a--0 Tue, 29 Dec 2020 13:..
## 2.7.6a--0 STAR biocontainers 2.7.6a--0 Sun, 20 Sep 2020 09:..
## 2.7.5c--0 STAR biocontainers 2.7.5c--0 Mon, 17 Aug 2020 09:..
## 2.7.5b--0 STAR biocontainers 2.7.5b--0 Sat, 01 Aug 2020 17:..
## ... ... ... ... ...
## 2.4.0j--0 STAR biocontainers 2.4.0j--0 Tue, 06 Mar 2018 12:..
## 2.5.4a--0 STAR biocontainers 2.5.4a--0 Fri, 26 Jan 2018 21:..
## 2.5.3a--0 STAR biocontainers 2.5.3a--0 Sat, 18 Mar 2017 11:..
## 2.5.2b--0 STAR biocontainers 2.5.2b--0 Tue, 06 Sep 2016 07:..
## 2.5.1b--0 STAR biocontainers 2.5.1b--0 Wed, 11 May 2016 08:..
## size
## <character>
## 2.7.8a--0 7985763
## 2.7.7a--0 7824392
## 2.7.6a--0 7820491
## 2.7.5c--0 7806635
## 2.7.5b--0 7805390
## ... ...
## 2.4.0j--0 4734325
## 2.5.4a--0 9225952
## 2.5.3a--0 9119736
## 2.5.2b--0 9086803
## 2.5.1b--0 11291827
Then, we can change the STAR
version into 2.7.8a (tag name: 2.7.8a–0).
## $class
## [1] "DockerRequirement"
##
## $dockerPull
## [1] "quay.io/biocontainers/star:2.7.3a--0"
requirements(rnaseq_Sf, "STAR")[[1]] <- requireDocker(
docker = "quay.io/biocontainers/star:2.7.8a--0")
requirements(rnaseq_Sf, "STAR")[[1]]
## $class
## [1] "DockerRequirement"
##
## $dockerPull
## [1] "quay.io/biocontainers/star:2.7.8a--0"
6.4 Build a pipeline
We can build a pipline using the available tools. Here we demonstrate how to build a simple alignment pipeline with mapping and marking duplicates.
First, we check whether the required tools (bwa, samtools and picard
markduplicates) are available in RcwlPipelines
.
## cwlHub with 6 records
## cache path: ~/Library/Caches/Rcwl
## # last modified date: 2021-02-22
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
##
## title Command
## BFC2814 | tl_bwa_index bwa index
## BFC2815 | tl_bwa bwa mem
## BFC2851 | tl_markdup picard MarkDuplicates
## BFC2877 | tl_sam2bam samtools view
## BFC2881 | tl_samtools_index samtools index
## BFC2887 | tl_sortBam samtools sort
Then we load all the required tools.
bwa <- cwlLoad("tl_bwa")
bwa_index <- cwlLoad("tl_bwa_index")
markdup <- cwlLoad("tl_markdup")
sam2bam <- cwlLoad("tl_sam2bam")
samtools_index <- cwlLoad("tl_samtools_index")
sortBam <- cwlLoad("tl_sortBam")
Next, we will need to define the input parameters for the pipeline (instead of for each tool).
p1 <- InputParam(id = "threads", type = "int")
p2 <- InputParam(id = "RG", type = "string")
p3 <- InputParam(id = "Ref", type = "string")
p4 <- InputParam(id = "FQ1", type = "File")
p5 <- InputParam(id = "FQ2", type = "File?")
Then we define the pipeline steps, to connect the inputs and outputs of each tool to form a pipeline.
## bwa
s1 <- cwlStep(id = "bwa", run = bwa,
In = list(threads = "threads",
RG = "RG",
Ref = "Ref",
FQ1 = "FQ1",
FQ2 = "FQ2"))
## sam to bam
s2 <- cwlStep(id = "sam2bam", run = sam2bam,
In = list(sam = "bwa/sam"))
## sort bam
s3 <- cwlStep(id = "sortBam", run = sortBam,
In = list(bam = "sam2bam/bam"))
## mark duplicates
s4 <- cwlStep(id = "markdup", run = markdup,
In = list(ibam = "sortBam/sbam",
obam = list(
valueFrom="$(inputs.ibam.nameroot).mdup.bam"),
matrix = list(
valueFrom="$(inputs.ibam.nameroot).markdup.txt")))
## index bam
s5 <- cwlStep(id = "idxBam", run = samtools_index,
In = list(bam = "markdup/mBam"))
Last, we will define the pipeline outputs and connect all the above defined steps into a new pipeline.
req1 <- requireStepInputExpression()
req2 <- requireJS()
## outputs
o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam")
o2 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx")
## cwlWorkflow
Align <- cwlWorkflow(requirements = list(req1, req2),
inputs = InputParamList(p1, p2, p3, p4, p5),
outputs = OutputParamList(o1, o2))
## build pipeline
Align <- Align + s1 + s2 + s3 + s4 + s5
Now the pipeline is successfully built and ready for use. We can
visualize the pipeline with plotCWL
from the Rcwl
package.