Chapter 9 Bulk RNAseq
This pipeline covers RNA-seq reads quality summary by fastQC
,
alignment by STAR
, quantification by featureCounts
and quality
control by RSeQC
.
## fastqc loaded
## STAR loaded
## sortBam loaded
## samtools_index loaded
## samtools_flagstat loaded
## featureCounts loaded
## gtfToGenePred loaded
## genePredToBed loaded
## read_distribution loaded
## geneBody_coverage loaded
## gCoverage loaded
## STAR loaded
The pipeline includes 10 steps, each step runs a single command as follows:
## List of length 10
## names(10): fastqc STAR sortBam ... genePredToBed r_distribution gCoverage
fastqc
: to run quality summary for raw fastqs with base commandfastqc
.STAR
: to align fastqs withSTAR
.sortBam
: to sort bam files withsamtools
.samtools_index
: to index aligned bam file withsamtools
.samtools_flagstat
: to summarize alignment flags withsamtools
.featureCounts
: to quantify gene abundance withfeatureCounts
.gtfToGenePred
: to convert GTF annotation to ‘genePred’ format withRSeQC
.genePredToBed
: to convert ‘genePred’ annotation to ‘bed’ format withRSeQC
.r_distribution
: to run read distribution over genome features withRSeQC
.gCoverage
: to summarize read coverage over gene body withRSeQC
.
The rnaseq_Sf
pipepine output includes the QC result from fastqc
step, indexed bam files from samtools_index
step, log and read
counts from STAR
step, flag summary from samtools_flagstat
step,
feature counts from featureCounts
step, alignment QC results from
RSeQC
steps.
## outputs:
## out_fastqc:
## type: File[]
## outputSource: fastqc/QCfile
## out_BAM:
## type: File
## outputSource: samtools_index/idx
## out_Log:
## type: File
## outputSource: STAR/outLog
## out_Count:
## type: File
## outputSource: STAR/outCount
## out_stat:
## type: File
## outputSource: samtools_flagstat/flagstat
## out_count:
## type: File
## outputSource: featureCounts/Count
## out_distribution:
## type: File
## outputSource: r_distribution/distOut
## out_gCovP:
## type: File
## outputSource: gCoverage/gCovPDF
## out_gCovT:
## type: File
## outputSource: gCoverage/gCovTXT
9.1 Prepare data
An RNASeq test data with paired-end fastq files for 6 samples can be downloaded from genomedata. Create a local directory and follow the code below to download and uncompress.
9.2 Submit parallel jobs
Powered by BiocParallel
,Rcwl
supports parallel job running for
multiple samples using the runCWLBatch
function.
The BPPARAM
argument in runCWLBatch()
defines the parallel
parameters. It can be defined by BiocParallel::BatchtoolsParam
function, where the cluster
argument takes different values for
different cluster job manager, such as “multicore”, “sge” and
“slurm”. More details about available options can be checked by
?BiocParallel::BatchtoolsParam
.
In the following example, we are using “multicore” for the parallel running.
When submitting parallel jobs using runCWLBatch
function, two other
arguments: inputList
and paramList
, need to be defined.
The inputList
argument is required to be a list of input parameter
values for samples that are to be computed parallelly. NOTE that
the names of the list must be consistent with the ids of input
parameters. In this example, they are:
in_seqfiles
: A list with the fastq files of each sample in each element. The names of the list need to be defined and can be the sample IDs. The length of the list will be the same as the number of samples, so that the list of samples can be assigned to different nodes for parallel computing. Here we only use 2 samples for demonstration purposes.in_prefix
: A list of sample IDs.
files <- normalizePath(list.files("data/RNAseq/", ".gz",
full.names = TRUE))[1:4]
files <- tapply(files, substring(basename(files), 1, 8), as.list)
inputList <- list(in_seqfiles = files,
in_prefix = as.list(names(files)))
The paramList
argument is required to be a list of input parameter
values that are to be shared for all parallelly running samples. In
this example, they are:
in_genomeDir
: The reference genome indexes for STAR.in_GTFfile
: The gene annotation file in GTF format.in_runThreadN
: The number of threads to run for each job.
paramList <- list(
in_genomeDir = "data/resources/GRCh38_chr22",
in_GTFfile = "data/resources/GRCh38_chr22/gencode.v32.annotation_chr22.gtf",
in_runThreadN = 2
)
Here we can also modify the default argument values in some steps of a pipeline. For example,
## [[1]]
## [1] "--outFilterMultimapNmax"
##
## [[2]]
## [1] "3"
## [[1]]
## [1] "--outFilterMultimapNmax"
##
## [[2]]
## [1] "2"
Now that the fastqc files of each sample will be submitted to different nodes to run the whole pipeline parallelly.
res <- runCWLBatch(cwl = rnaseq_Sf,
outdir = "output/RNAseq_bulk",
inputList = inputList,
paramList = paramList,
BPPARAM = bpparam,
showLog = TRUE)
Pipeline results are collected in the output directory (defined in
outdir
) for each sample.
## [1] "HBR_Rep1" "HBR_Rep2"
9.3 QC Summary
The tool multiqc
can aggregate results from the multiple outputs of
the pipeline and generate a single page report, which also was
implemented in the RcwlPipelines
package:
We can also run the tool using Rcwl
locally with the option docker = TRUE
:
9.4 Abundances summary
Here we use the R/Bioconductor package edgeR
functions to
calculate the RPKM and CPM abundances.
countfiles <- list.files("output/RNAseq_bulk", "featureCounts.txt$",
recursive = TRUE, full.names = TRUE)
samples <- basename(dirname(countfiles))
rExp <- function(countfile){
count1 <- read.table(countfile, header = TRUE)[, c(1,6,7)]
rpkm1 <- edgeR::rpkm(count1[,3,drop=F], gene.length = count1$Length)
cpm1 <- edgeR::cpm(count1[,3])
count1 <- data.frame(count1, rpkm1, cpm1)
colnames(count1)[3:5] <- c("count", "rpkm", "cpm")
return(count1)
}
## Geneid Length count rpkm cpm
## 1 ENSG00000223972.5 1735 0 0 0
## 2 ENSG00000227232.5 1351 0 0 0
## 3 ENSG00000278267.1 68 0 0 0
## 4 ENSG00000243485.5 1021 0 0 0
## 5 ENSG00000284332.1 138 0 0 0
## 6 ENSG00000237613.2 1219 0 0 0
We combine the files into one file, and then the data is ready for
statistical analysis using R/Bioconductor packages, such as DESeq2
or edgeR
.
9.5 transcriptome quantification
There are many tools available for transcriptome quantification, such
as kallisto, StringTie, salmon, and Trinity. Here show the usage of
Kallisto
and salmon
.
9.5.1 Kallisto
The kallisto is a tool to quantify transcript abundance with raw fastq
reads and indexed reference transcriptomes. The Rcwl
tool of
kallisto_index
below is used to build an index file from reference
transcriptome fasta.
inputs(kallisto_index)
kallisto_index$fasta <- "data/resources/gencode.v33.transcripts.fa"
kallisto_index$index <- "gencode.v33.transcripts_kallisto"
runCWL(kallisto_index, outdir = "data/resources/kallisto", showLog = TRUE)
The Rcwl
tool kallisto_quant
runs the quantification algorithm to
estimate the transcripts expression abundances.
inputList <- list(fastq = files)
paramList <- list(index = "data/resources/gencode.v33.transcripts_kallisto",
threads = 16)
bpparam <- BatchtoolsParam(workers = length(samples),
cluster = "multicore")
res2 <- runCWLBatch(kallisto_quant, outdir = "output/RNAseq_bulk/kallisto",
inputList, paramList,
BPPARAM = bpparam,
log = TRUE, logdir = ".")
Then the tool results are collected here:
## character(0)
9.5.2 salmon
To be added.