The aim of Rmmquant is to quantify the expression of the genes. It is
similar to featureCounts (Liao, Smyth, and Shi 2014) and Rsubread,
or HTSeq-counts (Anders, Pyl, and Huber 2015) and the
countOverlaps
of GenomicRanges.
The main difference with other approaches is that Rmmquant explicitely handles multi-mapping reads, in a way described by (Robert and Watson 2015).
Rmmquant is the R port of the C++ mmquant, which has been published previously (Zytnicki 2017).
The easiest method to get the expression of the genes stored in a GTF file, using RNA-Seq data stored in a BAM file is:
dir <- system.file("extdata", package="Rmmquant", mustWork = TRUE)
gtfFile <- file.path(dir, "test.gtf")
bamFile <- file.path(dir, "test.bam")
output <- RmmquantRun(gtfFile, bamFile)
In this example, the output is a SummarizedExperiment.
The matrix of counts can be accessed through:
## test
## geneA 1
Rmmquant expects at least two kinds of data:
Many aspects of Rmmquant can be controlled with parameters.
All these notions are detailed hereafter.
The Rmmquant package mainly consists in one function,
RmmquantRun
function, that supports the options described
hereafter,
Rmmquant also internally implements an S4 class,
RmmquantClass
, that stores all the inputs and the outputs
of the RmmquantRun
function, as well as a validator (that
checks the consistency of the inputs).
The annotation file should be in GTF. GFF might work too. The tool only uses the gene/transcript/exon types.
Alternatively, the structure can be given using GenomicRanges, or GenomicRangesList.
When a GenomicRanges is provided, the quantification will be performed on each element of the GenomicRanges.
When a GenomicRangesList is provided, the quantification will be performed on each element too, i.e. on each GenomicRanges. Implicitely, each GenomicRanges of a GenomicRangesList is interpreted as a series of exons, and only the transcript (or the gene) will be quantified.
One or several reads files can be given.
The reads should be given in SAM or BAM format, and preferably be sorted by position, but it is not compulsory. The reads can be single end or paired-end (or a mixture thereof).
You can use the samtools (Li et al. 2009) or the Rsamtools
to sort them. Rmmquant uses the NH
flag (that provides the
number of hits for each read, see the SAM format
specification), so be sure that your mapping tool sets it adequately
(yes, TopHat2 (Kim et al. 2013) and STAR (Dobin et al. 2013) do it fine). You should also
check how your mapping tool handles multi-mapping reads (this can
usually be tuned using the appropriate parameters).
The output SummarizedExperiment contains:
The count table can be accessed through the assays
method of SummarizedExperiment.
This methods returns a list, with only one element:
counts
.
## test
## geneA 1
The columns are the samples (here, only test.bam
), and
the rows the gene counts (here, only gene_A
).
The count table can be used by DESeq2
(Love, Huber, and Anders 2014), for
instance, using the DESeqDataSetFromMatrix
function.
If the user provided \(n\) reads files, the output will contain \(n\) columns:
Gene | sample_1 | sample_2 | … |
---|---|---|---|
gene_A | … | … | … |
gene_B | … | … | … |
gene_B–gene_C | … | … | … |
The row names are the ID of the genes/features. The column names are
the sample names. If a read maps several genes (say, gene_B
and gene_C
), a new feature is added to the matrix,
gene_B--gene_C
. The reads that can be mapped to these genes
will be counted there (but not in the gene_B
nor
gene_C
lines).
The statistics can be accessed using the colData
method
of SummarizedExperiment:
## DataFrame with 1 row and 5 columns
## n.hits n.uniquely.mapped.reads n.ambiguously.mapped.hits
## <numeric> <numeric> <numeric>
## test 2 1 0
## n.non.uniquely.mapped.hits n.unassigned.hits
## <numeric> <numeric>
## test 0 1
This is a DataFrame
(defined in S4Vectors),
with one column per sample. The content of each column is:
n.hits
: the number of hitsn.uniquely.mapped.reads
: the number of uniquely mapped
readsn.non.uniquely.mapped.hits
: the number of non-uniquely
mapped hitsn.ambiguously.mapped.hits
: the number of hits with
several corresponding featuresn.unassigned.hits
: the number of hits with no
corresponding featureHere, a hit is a possible mapping for a read. When a read maps several times, it means that several hits correspond to a read. A read may have zero, one, or several hits, which correspond to unmapped, uniquely mapped, and non-uniquely mapped reads.
If printGeneName
is set to TRUE
, the row
names of the count table are the gene names, instead of the gene IDs. If
two different genes have the same name, the systematic name is added,
like: Mat2a (ENSMUSG00000053907)
.
The gene IDs and gene names should be given in the GTF file after the
gene_id
and gene_name
tags respectively.
The column names of the count matrix should be given in the
sampleNames
. If not given, the column names are inferred
from the file names of the SAM/BAM files.
The library types can be specified using the strands
parameter. This parameter can be:
F
: the reads are single-end, and the forward strand is
sequenced,R
: the reads are single-end, and the reverse strand is
sequenced,FR
: the reads are paired-end, the forward strand is
sequenced first, then the reverse strand,RF
, FF
, FF
: other similar
cases,U
: the reads are single- or pair-ends, and the strand
is unknown.The strands
parameters expect one value per SAM/BAM
file. If only one value is given, it is recycled.
The format is usually inferred from the file name, but you can
mention it using the formats
option.
This parameters expect one value per SAM/BAM file. If only one value is given, it is recycled.
The way a read \(R\) is mapped to a
gene \(A\) depends on the
overlap=
\(n\) value:
if \(n\) is | then \(R\) is mapped to \(A\) iff |
---|---|
a negative value | \(R\) is included in \(A\) |
a positive integer | they have at least \(n\) nucleotides in common |
a float value (0, 1) | \(n\)% of the nucleotides of \(R\) are shared with \(A\) |
We will suppose here that the overlap=1
strategy is used
(i.e. a read is attributed to a gene as soon as at least 1 nucleotide
overlap). The example can be extended to other strategies as well.
If a read (say, of size 100), maps unambiguously and overlaps with
gene A and B, it will be counted as 1 for the new “gene”
gene_A--gene_B
. However, suppose that only 1 nucleotide
overlaps with gene A, whereas 100 nucleotides overlap with gene B (yes,
genes A and B overlap). You probably would like to attribute the read to
gene B.
The options nOverlapDiff
and pcOverlapDiff
control this. We compute the number of overlapping nucleotides between a
read and the overlapping genes. If a read overlaps “significantly” more
with one gene than with all the other genes, they will attribute the
read to the former gene only.
The option nOverlapDiff=
\(n\) computes the differences of overlapping
nucleotides. Let us name \(N_A\) and
\(N_B\) the number of overlapping
nucleotides with genes A and B respectively. If \(N_A \geq N_B + n\), then the read will be
attributed to gene A only.
The option pcOverlapDiff=
\(m\) compares the ratio of overlapping
nucleotides. If \(N_A / N_B \geq m\%\),
then the read will be attributed to gene A only.
If both option nOverlapDiff=
\(n\) and pcOverlapDiff=
\(m\) are used, then the read will be
attributed to gene A only iff both \(N_A \geq
N_B + n\) and \(N_A / N_B \geq
m\%\).
If the maximum number of reads for a gene is less than
countThreshold
(a non-negative integer), then the
corresponding line is discarded.
Sometimes, there are very few reads that can be mapped unambiguously to a gene A, because it is very similar to gene B.
Gene | sample_1 |
---|---|
gene_A | \(x\) |
gene_B | \(y\) |
gene_A–gene_B | \(z\) |
In the previous example, suppose that \(x
\ll z\). In this case, you can move all the reads from
gene_A
to gene_A--gene_B
, using the
mergeThreshold=
\(t\), a
float in (0, 1). If \(x < t \cdot
y\), then the reads are transferred.
The next examples uses data generated in TBX20BamSubset, where the expression of the genes is compared between the wild type and the TXB20 knock-out mice. The data have been mapped to the mm9 reference, and restricted to chromosome 19.
You can extract the annotation a BioConductor package, such as TxDb.Mmusculus.UCSC.mm9.knownGene.
The default gene IDs are Entrez ID (Maglott et al. 2015). If you prefer Ensembl IDs (Zerbino et al. 2018), you can modify the GenomicRanges accordingly. Notice that Entrez IDs may have zero, one, or more Ensembl counterparts.
DESeq2 can be executed right after Rmmquant.
output <- RmmquantRun(genomicRanges=gr, readsFiles=bamFiles,
sampleNames=sampleNames, sorts=FALSE)
coldata <- data.frame(condition=factor(c(rep("control", 3), rep("KO", 3)),
levels=c("control", "KO")),
row.names=sampleNames)
dds <- DESeqDataSetFromMatrix(countData=assays(output)$counts,
colData =coldata,
design =~ condition)
dds <- DESeq(dds)
res <- lfcShrink(dds, coef=2)
res$padj <- ifelse(is.na(res$padj), 1, res$padj)
res[res$padj < 0.05, ]
## log2 fold change (MAP): condition KO vs control
## Wald test p-value: condition KO vs control
## DataFrame with 1 row and 5 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSMUSG00000024997 1083.54 -2.01713 0.33904 3.30589e-10 2.64471e-09
While installing the package, if the compiler complains and says
#error This file requires compiler and library support for the ISO C++ 2011 standard.
This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
Add this line
Sys.setenv("PKG_CXXFLAGS"="-std=c++11")
before installing the package.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R Under development (unstable) (2024-10-21 r87258)
## os Ubuntu 24.04.1 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate C
## ctype en_US.UTF-8
## tz America/New_York
## date 2024-10-29
## pandoc 3.1.3 @ /usr/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-8 2024-09-12 [2] CRAN (R 4.5.0)
## AnnotationDbi * 1.69.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## apeglm 1.29.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## bbmle 1.0.25.1 2023-12-09 [2] CRAN (R 4.5.0)
## bdsmatrix 1.3-7 2024-03-02 [2] CRAN (R 4.5.0)
## Biobase * 2.67.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## BiocGenerics * 0.53.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## BiocIO 1.17.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## BiocManager 1.30.25 2024-08-28 [2] CRAN (R 4.5.0)
## BiocParallel 1.41.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## BiocStyle * 2.35.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## Biostrings * 2.75.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## bit 4.5.0 2024-09-20 [2] CRAN (R 4.5.0)
## bit64 4.5.2 2024-09-22 [2] CRAN (R 4.5.0)
## bitops 1.0-9 2024-10-03 [2] CRAN (R 4.5.0)
## blob 1.2.4 2023-03-17 [2] CRAN (R 4.5.0)
## bslib 0.8.0 2024-07-29 [2] CRAN (R 4.5.0)
## cachem 1.1.0 2024-05-16 [2] CRAN (R 4.5.0)
## cli 3.6.3 2024-06-21 [2] CRAN (R 4.5.0)
## coda 0.19-4.1 2024-01-31 [2] CRAN (R 4.5.0)
## codetools 0.2-20 2024-03-31 [3] CRAN (R 4.5.0)
## colorspace 2.1-1 2024-07-26 [2] CRAN (R 4.5.0)
## crayon 1.5.3 2024-06-20 [2] CRAN (R 4.5.0)
## curl 5.2.3 2024-09-20 [2] CRAN (R 4.5.0)
## DBI 1.2.3 2024-06-02 [2] CRAN (R 4.5.0)
## DelayedArray 0.33.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## DESeq2 * 1.47.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## devtools 2.4.5 2022-10-11 [2] CRAN (R 4.5.0)
## digest 0.6.37 2024-08-19 [2] CRAN (R 4.5.0)
## dplyr 1.1.4 2023-11-17 [2] CRAN (R 4.5.0)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.5.0)
## emdbook 1.3.13 2023-07-03 [2] CRAN (R 4.5.0)
## evaluate 1.0.1 2024-10-10 [2] CRAN (R 4.5.0)
## fansi 1.0.6 2023-12-08 [2] CRAN (R 4.5.0)
## fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.5.0)
## fs 1.6.4 2024-04-25 [2] CRAN (R 4.5.0)
## generics 0.1.3 2022-07-05 [2] CRAN (R 4.5.0)
## GenomeInfoDb * 1.43.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## GenomeInfoDbData 1.2.13 2024-10-23 [2] Bioconductor
## GenomicAlignments 1.43.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## GenomicFeatures * 1.59.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## GenomicRanges * 1.59.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## ggplot2 3.5.1 2024-04-23 [2] CRAN (R 4.5.0)
## glue 1.8.0 2024-09-30 [2] CRAN (R 4.5.0)
## gtable 0.3.6 2024-10-25 [2] CRAN (R 4.5.0)
## htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.5.0)
## htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.5.0)
## httpuv 1.6.15 2024-03-26 [2] CRAN (R 4.5.0)
## httr 1.4.7 2023-08-15 [2] CRAN (R 4.5.0)
## IRanges * 2.41.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.5.0)
## jsonlite 1.8.9 2024-09-20 [2] CRAN (R 4.5.0)
## KEGGREST 1.47.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## knitr * 1.48 2024-07-07 [2] CRAN (R 4.5.0)
## later 1.3.2 2023-12-06 [2] CRAN (R 4.5.0)
## lattice 0.22-6 2024-03-20 [3] CRAN (R 4.5.0)
## lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.5.0)
## locfit 1.5-9.10 2024-06-24 [2] CRAN (R 4.5.0)
## magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.5.0)
## MASS 7.3-61 2024-06-13 [3] CRAN (R 4.5.0)
## Matrix 1.7-1 2024-10-18 [3] CRAN (R 4.5.0)
## MatrixGenerics * 1.19.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## matrixStats * 1.4.1 2024-09-08 [2] CRAN (R 4.5.0)
## memoise 2.0.1 2021-11-26 [2] CRAN (R 4.5.0)
## mime 0.12 2021-09-28 [2] CRAN (R 4.5.0)
## miniUI 0.1.1.1 2018-05-18 [2] CRAN (R 4.5.0)
## munsell 0.5.1 2024-04-01 [2] CRAN (R 4.5.0)
## mvtnorm 1.3-1 2024-09-03 [2] CRAN (R 4.5.0)
## numDeriv 2016.8-1.1 2019-06-06 [2] CRAN (R 4.5.0)
## org.Mm.eg.db * 3.20.0 2024-10-24 [2] Bioconductor
## pillar 1.9.0 2023-03-22 [2] CRAN (R 4.5.0)
## pkgbuild 1.4.5 2024-10-28 [2] CRAN (R 4.5.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.5.0)
## pkgload 1.4.0 2024-06-28 [2] CRAN (R 4.5.0)
## plyr 1.8.9 2023-10-02 [2] CRAN (R 4.5.0)
## png 0.1-8 2022-11-29 [2] CRAN (R 4.5.0)
## profvis 0.4.0 2024-09-20 [2] CRAN (R 4.5.0)
## promises 1.3.0 2024-04-05 [2] CRAN (R 4.5.0)
## purrr 1.0.2 2023-08-10 [2] CRAN (R 4.5.0)
## R6 2.5.1 2021-08-19 [2] CRAN (R 4.5.0)
## Rcpp 1.0.13 2024-07-17 [2] CRAN (R 4.5.0)
## RCurl 1.98-1.16 2024-07-11 [2] CRAN (R 4.5.0)
## remotes 2.5.0 2024-03-17 [2] CRAN (R 4.5.0)
## restfulr 0.0.15 2022-06-16 [2] CRAN (R 4.5.0)
## rjson 0.2.23 2024-09-16 [2] CRAN (R 4.5.0)
## rlang 1.1.4 2024-06-04 [2] CRAN (R 4.5.0)
## rmarkdown * 2.28 2024-08-17 [2] CRAN (R 4.5.0)
## Rmmquant * 1.25.0 2024-10-29 [1] Bioconductor 3.21 (R 4.5.0)
## Rsamtools * 2.23.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## RSQLite 2.3.7 2024-05-27 [2] CRAN (R 4.5.0)
## rtracklayer 1.67.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## S4Arrays 1.7.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## S4Vectors * 0.45.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## sass 0.4.9 2024-03-15 [2] CRAN (R 4.5.0)
## scales 1.3.0 2023-11-28 [2] CRAN (R 4.5.0)
## sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.5.0)
## shiny 1.9.1 2024-08-01 [2] CRAN (R 4.5.0)
## SparseArray 1.7.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## SummarizedExperiment * 1.37.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## TBX20BamSubset * 1.41.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## tibble 3.2.1 2023-03-20 [2] CRAN (R 4.5.0)
## tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.5.0)
## TxDb.Mmusculus.UCSC.mm9.knownGene * 3.2.2 2024-10-25 [2] Bioconductor
## UCSC.utils 1.3.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## urlchecker 1.0.1 2021-11-30 [2] CRAN (R 4.5.0)
## usethis 3.0.0 2024-07-29 [2] CRAN (R 4.5.0)
## utf8 1.2.4 2023-10-22 [2] CRAN (R 4.5.0)
## vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.5.0)
## xfun 0.48 2024-10-03 [2] CRAN (R 4.5.0)
## XML 3.99-0.17 2024-06-25 [2] CRAN (R 4.5.0)
## xtable 1.8-4 2019-04-21 [2] CRAN (R 4.5.0)
## XVector * 0.47.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
## yaml 2.3.10 2024-07-26 [2] CRAN (R 4.5.0)
## zlibbioc 1.53.0 2024-10-29 [2] Bioconductor 3.21 (R 4.5.0)
##
## [1] /tmp/RtmpLAgTaw/Rinst18d71cd5c37ca
## [2] /home/biocbuild/bbs-3.21-bioc/R/site-library
## [3] /home/biocbuild/bbs-3.21-bioc/R/library
##
## ──────────────────────────────────────────────────────────────────────────────