This vignette describes the classes implemented in package. It is intended as a starting point for developers or users who would like to learn more or further develop/extend mass spectrometry and proteomics data structures.
MSnbase 2.33.0
This software is free and open-source software. If you use it, please support the project by citing it in publications:
Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.
MSnbase
, efficient and elegant R-based processing and visualisation of raw mass spectrometry data. Laurent Gatto, Sebastian Gibb, Johannes Rainer. bioRxiv 2020.04.29.067868; doi: https://doi.org/10.1101/2020.04.29.067868
For bugs, typos, suggestions or other questions, please file an issue
in our tracking system (https://github.com/lgatto/MSnbase/issues)
providing as much information as possible, a reproducible example and
the output of sessionInfo()
.
If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.
NB This document is going to be updated based on current major
development plans in MSnbase
.
This document is not a replacement for the individual manual pages, that document the slots of the MSnbase classes. It is a centralised high-level description of the package design.
MSnbase aims at being compatible with the
Biobase infrastructure (Gentleman et al. 2004). Many meta data
structures that are used in eSet
and associated classes are also
used here. As such, knowledge of the Biobase development and the new
eSet vignette would be beneficial; the vignette can directly be
accessed with vignette("BiobaseDevelopment", package="Biobase")
.
The initial goal is to use the MSnbase infrastructure
for MS2 labelled (iTRAQ (Ross et al. 2004) and TMT (Thompson et al. 2003)) and
label-free (spectral counting, index and abundance) quantitation
- see the documentation for the quantify
function for details. The
infrastructure is currently extended to support a wider range of
technologies, including metabolomics.
MSnbase
follows the Bioconductor style
guide. In
particular
.
when naming symbols..
can be used for hidden/local functions or variables.##
to start full-line comments.#'
is preferred, although ##'
is tolerated.=
in function arguments or class definition:
f(a = 1, b = 2)
.a, b, c
.a + b
.# no wrap at 80
someVeryLongVariableName <- someVeryLongFunctionName(withSomeEvenLongerFunctionArgumentA = 1, withSomeEvenLongerFunctionArgumentB = 2)
and should be wrapped as shown below:
# alternative 1
someVeryLongVariableName <-
someVeryLongFunctionName(withSomeEvenLongerFunctionArgumentA = 1,
withSomeEvenLongerFunctionArgumentB = 2)
# alternative 2
someVeryLongVariableName <- someVeryLongFunctionName(
withSomeEvenLongerFunctionArgumentA = 1,
withSomeEvenLongerFunctionArgumentB = 2)
All classes have a .__classVersion__
slot, of class Versioned
from
the Biobase package. This slot documents the class
version for any instance to be used for debugging and object update
purposes. Any change in a class implementation should trigger a
version change.
pSet
: a virtual class for raw mass spectrometry data and meta dataThis virtual class is the main container for mass spectrometry data,
i.e spectra, and meta data. It is based on the eSet
implementation
for genomic data. The main difference with eSet
is that the
assayData
slot is an environment containing any number of
Spectrum
instances (see the Spectrum
section).
One new slot is introduced, namely processingData
, that contains one
MSnProcess
instance (see the MSnProcess
section).
and the experimentData
slot is now expected to contain MIAPE
data.
The annotation
slot has not been implemented, as no prior feature
annotation is known in shotgun proteomics.
getClass("pSet")
## Virtual Class "pSet" [package "MSnbase"]
##
## Slots:
##
## Name: assayData phenoData featureData
## Class: environment AnnotatedDataFrame AnnotatedDataFrame
##
## Name: experimentData protocolData processingData
## Class: MIAxE AnnotatedDataFrame MSnProcess
##
## Name: .cache .__classVersion__
## Class: environment Versions
##
## Extends: "Versioned"
##
## Known Subclasses:
## Class "MSnExp", directly
## Class "OnDiskMSnExp", by class "MSnExp", distance 2, with explicit coerce
MSnExp
: a class for MS experimentsMSnExp
extends pSet
to store MS experiments. It does not add any
new slots to pSet
. Accessors and setters are all inherited from
pSet
and new ones should be implemented for pSet
. Methods that
manipulate actual data in experiments are implemented for
MSnExp
objects.
getClass("MSnExp")
## Class "MSnExp" [package "MSnbase"]
##
## Slots:
##
## Name: assayData phenoData featureData
## Class: environment AnnotatedDataFrame AnnotatedDataFrame
##
## Name: experimentData protocolData processingData
## Class: MIAxE AnnotatedDataFrame MSnProcess
##
## Name: .cache .__classVersion__
## Class: environment Versions
##
## Extends:
## Class "pSet", directly
## Class "Versioned", by class "pSet", distance 2
##
## Known Subclasses:
## Class "OnDiskMSnExp", directly, with explicit coerce
OnDiskMSnExp
: a on-disk implementation of the MSnExp
classThe OnDiskMSnExp
class extends MSnExp
and inherits all of its
functionality but is aimed to use as little memory as possible based
on a balance between memory demand and performance. Most of the
spectrum-specific data, like retention time, polarity, total ion
current are stored within the object’s featureData
slot. The actual
M/Z and intensity values from the individual spectra are, in contrast
to MSnExp
objects, not kept in memory (in the assayData
slot), but
are fetched from the original files on-demand. Because mzML files are
indexed, using the mzR package to read the relevant
spectrum data is fast and only moderately slower than for in-memory
MSnExp
1 The benchmarking vignette compares data size and operation speed of the two implementations..
To keep track of data manipulation steps that are applied to spectrum
data (such as performed by methods removePeaks
or clean
) a lazy
execution framework was implemented. Methods that manipulate or
subset a spectrum’s M/Z or intensity values can not be applied
directly to a OnDiskMSnExp
object, since the relevant data is not
kept in memory. Thus, any call to a processing method that changes or
subset M/Z or intensity values are added as ProcessingStep
items to
the object’s spectraProcessingQueue
. When the spectrum data is then
queried from an OnDiskMSnExp
, the spectra are read in from the file
and all these processing steps are applied on-the-fly to the spectrum
data before being returned to the user.
The operations involving extracting or manipulating spectrum data are
applied on a per-file basis, which enables parallel processing. Thus,
all corresponding method implementations for OnDiskMSnExp
objects
have an argument BPPARAM
and users can set a PARALLEL_THRESH
option flag2 see ?MSnbaseOptions
for details. that enables to
define how and when parallel processing should be performed (using the
BiocParallel package).
Note that all data manipulations that are not applied to M/Z or
intensity values of a spectrum (e.g. sub-setting by retention time
etc) are very fast as they operate directly to the object’s
featureData
slot.
getClass("OnDiskMSnExp")
## Class "OnDiskMSnExp" [package "MSnbase"]
##
## Slots:
##
## Name: spectraProcessingQueue backend assayData
## Class: list character environment
##
## Name: phenoData featureData experimentData
## Class: AnnotatedDataFrame AnnotatedDataFrame MIAxE
##
## Name: protocolData processingData .cache
## Class: AnnotatedDataFrame MSnProcess environment
##
## Name: .__classVersion__
## Class: Versions
##
## Extends:
## Class "MSnExp", directly
## Class "pSet", by class "MSnExp", distance 2
## Class "Versioned", by class "MSnExp", distance 3
The distinction between MSnExp
and OnDiskMSnExp
is often not
explicitly stated as it should not matter, from a user’s perspective,
which data structure they are working with, as both behave in
equivalent ways. Often, they are referred to as in-memory and
on-disk MSnExp
implementations.
MSnSet
: a class for quantitative proteomics dataThis class stores quantitation data and meta data after running
quantify
on an MSnExp
object or by creating an MSnSet
instance
from an external file, as described in the MSnbase-io vignette and
in ?readMSnSet
, readMzTabData
, etc. The quantitative data is in
form of a n by p matrix, where n is the number of
features/spectra originally in the MSnExp
used as parameter in
quantify
and p is the number of reporter ions. If read from an
external file, n corresponds to the number of features (protein
groups, proteins, peptides, spectra) in the file and \(p\) is the number
of columns with quantitative data (samples) in the file.
This prompted to keep a similar implementation as the ExpressionSet
class, while adding the proteomics-specific annotation slot introduced
in the pSet
class, namely processingData
for objects of class
MSnProcess
.
getClass("MSnSet")
## Class "MSnSet" [package "MSnbase"]
##
## Slots:
##
## Name: experimentData processingData qual
## Class: MIAPE MSnProcess data.frame
##
## Name: assayData phenoData featureData
## Class: AssayData AnnotatedDataFrame AnnotatedDataFrame
##
## Name: annotation protocolData .__classVersion__
## Class: character AnnotatedDataFrame Versions
##
## Extends:
## Class "eSet", directly
## Class "VersionedBiobase", by class "eSet", distance 2
## Class "Versioned", by class "eSet", distance 3
The MSnSet
class extends the virtual eSet
class to provide
compatibility for ExpressionSet
-like behaviour. The experiment
meta-data in experimentData
is also of class MIAPE
. The
annotation
slot, inherited from eSet
is not used. As a result, it
is easy to convert ExpressionSet
data from/to MSnSet
objects with
the coersion method as
.
data(msnset)
class(msnset)
## [1] "MSnSet"
## attr(,"package")
## [1] "MSnbase"
class(as(msnset, "ExpressionSet"))
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
data(sample.ExpressionSet)
class(sample.ExpressionSet)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
class(as(sample.ExpressionSet, "MSnSet"))
## [1] "MSnSet"
## attr(,"package")
## [1] "MSnbase"
MSnProcess
: a class for logging processing meta dataThis class aims at recording specific manipulations applied to
MSnExp
or MSnSet
instances. The processing
slot is a character
vector that describes major
processing. Most other slots are of class logical
that
indicate whether the data has been centroided, smoothed,
although many of the functionality is not implemented yet. Any new
processing that is implemented should be documented and logged here.
It also documents the raw data file from which the data originates
(files
slot) and the MSnbase version that was in
use when the MSnProcess
instance, and hence the
MSnExp
/MSnSet
objects, were originally created.
getClass("MSnProcess")
## Class "MSnProcess" [package "MSnbase"]
##
## Slots:
##
## Name: files processing merged cleaned
## Class: character character logical logical
##
## Name: removedPeaks smoothed trimmed normalised
## Class: character logical numeric logical
##
## Name: MSnbaseVersion .__classVersion__
## Class: character Versions
##
## Extends: "Versioned"
MIAPE
: Minimum Information About a Proteomics ExperimentThe Minimum Information About a Proteomics Experiment
(Chris F. Taylor et al. 2007; Chris F. Taylor et al. 2008) MIAPE
class describes the experiment,
including contact details, information about the mass spectrometer and
control and analysis software.
getClass("MIAPE")
## Class "MIAPE" [package "MSnbase"]
##
## Slots:
##
## Name: title url
## Class: character character
##
## Name: abstract pubMedIds
## Class: character character
##
## Name: samples preprocessing
## Class: list list
##
## Name: other dateStamp
## Class: list character
##
## Name: name lab
## Class: character character
##
## Name: contact email
## Class: character character
##
## Name: instrumentModel instrumentManufacturer
## Class: character character
##
## Name: instrumentCustomisations softwareName
## Class: character character
##
## Name: softwareVersion switchingCriteria
## Class: character character
##
## Name: isolationWidth parameterFile
## Class: numeric character
##
## Name: ionSource ionSourceDetails
## Class: character character
##
## Name: analyser analyserDetails
## Class: character character
##
## Name: collisionGas collisionPressure
## Class: character numeric
##
## Name: collisionEnergy detectorType
## Class: character character
##
## Name: detectorSensitivity .__classVersion__
## Class: character Versions
##
## Extends:
## Class "MIAxE", directly
## Class "Versioned", by class "MIAxE", distance 2
Spectrum
et al.: classes for MS spectraSpectrum
is a virtual class that defines common attributes to all
types of spectra. MS1 and MS2 specific attributes are defined in the
Spectrum1
and Spectrum2
classes, that directly extend Spectrum
.
getClass("Spectrum")
## Virtual Class "Spectrum" [package "MSnbase"]
##
## Slots:
##
## Name: msLevel peaksCount rt acquisitionNum
## Class: integer integer numeric integer
##
## Name: scanIndex tic mz intensity
## Class: integer numeric numeric numeric
##
## Name: fromFile centroided smoothed polarity
## Class: integer logical logical integer
##
## Name: .__classVersion__
## Class: Versions
##
## Extends: "Versioned"
##
## Known Subclasses: "Spectrum2", "Spectrum1"
getClass("Spectrum1")
## Class "Spectrum1" [package "MSnbase"]
##
## Slots:
##
## Name: msLevel peaksCount rt acquisitionNum
## Class: integer integer numeric integer
##
## Name: scanIndex tic mz intensity
## Class: integer numeric numeric numeric
##
## Name: fromFile centroided smoothed polarity
## Class: integer logical logical integer
##
## Name: .__classVersion__
## Class: Versions
##
## Extends:
## Class "Spectrum", directly
## Class "Versioned", by class "Spectrum", distance 2
getClass("Spectrum2")
## Class "Spectrum2" [package "MSnbase"]
##
## Slots:
##
## Name: merged precScanNum precursorMz
## Class: numeric integer numeric
##
## Name: precursorIntensity precursorCharge collisionEnergy
## Class: numeric integer numeric
##
## Name: msLevel peaksCount rt
## Class: integer integer numeric
##
## Name: acquisitionNum scanIndex tic
## Class: integer integer numeric
##
## Name: mz intensity fromFile
## Class: numeric numeric integer
##
## Name: centroided smoothed polarity
## Class: logical logical integer
##
## Name: .__classVersion__
## Class: Versions
##
## Extends:
## Class "Spectrum", directly
## Class "Versioned", by class "Spectrum", distance 2
Chromatogram
and MChromatograms
: classes to handle chromatographic dataThe Chromatogram
class represents chromatographic MS data, i.e. retention time
and intensity duplets for one file/sample. The MChromatograms
class (Matrix of
Chromatograms) allows to arrange multiple Chromatogram
instances in a
two-dimensional grid, with columns supposed to represent different samples and
rows two-dimensional areas in the plane spanned by the m/z and retention time
dimensions from which the intensities are extracted (e.g. an extracted ion
chromatogram for a specific ion). The MChromatograms
class extends the base
matrix
class. MChromatograms
objects can be extracted from an MSnExp
or
OnDiskMSnExp
object using the chromatogram
method.
getClass("Chromatogram")
## Class "Chromatogram" [package "MSnbase"]
##
## Slots:
##
## Name: rtime intensity mz filterMz
## Class: numeric numeric numeric numeric
##
## Name: precursorMz productMz fromFile aggregationFun
## Class: numeric numeric integer character
##
## Name: msLevel .__classVersion__
## Class: integer Versions
##
## Extends: "Versioned"
getClass("MChromatograms")
## Class "MChromatograms" [package "MSnbase"]
##
## Slots:
##
## Name: .Data phenoData featureData
## Class: matrix AnnotatedDataFrame AnnotatedDataFrame
##
## Extends:
## Class "matrix", from data part
## Class "array", by class "matrix", distance 2
## Class "structure", by class "matrix", distance 3
## Class "matrix_OR_array_OR_table_OR_numeric", by class "matrix", distance 3
## Class "vector", by class "matrix", distance 4, with explicit coerce
## Class "vector_OR_factor", by class "matrix", distance 5, with explicit coerce
## Class "vector_OR_Vector", by class "matrix", distance 5, with explicit coerce
MSnSet
instancesWhen several MSnSet
instances are related to each other and should
be stored together as different objects, they can be grouped as a list
into and MSnSetList
object. In addition to the actual list
slot,
this class also has basic logging functionality and enables iteration
over the MSnSet
instances using a dedicated lapply
methods.
getClass("MSnSetList")
## Class "MSnSetList" [package "MSnbase"]
##
## Slots:
##
## Name: x log featureData .__classVersion__
## Class: list list DataFrame Versions
##
## Extends: "Versioned"
Methods that process raw data, i.e. spectra should be implemented for
Spectrum
objects first and then eapply
ed (or similar) to the
assayData
slot of an MSnExp
instance in the specific method.
sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] microbenchmark_1.5.0 pryr_0.1.6 magrittr_2.0.3
## [4] gplots_3.2.0 msdata_0.45.0 pRoloc_1.47.0
## [7] BiocParallel_1.41.0 MLInterfaces_1.87.0 cluster_2.1.6
## [10] annotate_1.85.0 XML_3.99-0.17 AnnotationDbi_1.69.0
## [13] IRanges_2.41.0 pRolocdata_1.43.3 Rdisop_1.67.0
## [16] zoo_1.8-12 MSnbase_2.33.0 ProtGenerics_1.39.0
## [19] S4Vectors_0.45.0 mzR_2.41.0 Rcpp_1.0.13
## [22] Biobase_2.67.0 BiocGenerics_0.53.0 ggplot2_3.5.1
## [25] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.5.0 bitops_1.0-9
## [3] filelock_1.0.3 tibble_3.2.1
## [5] hardhat_1.4.0 preprocessCore_1.69.0
## [7] pROC_1.18.5 rpart_4.1.23
## [9] lifecycle_1.0.4 httr2_1.0.5
## [11] doParallel_1.0.17 globals_0.16.3
## [13] lattice_0.22-6 MASS_7.3-61
## [15] MultiAssayExperiment_1.33.0 dendextend_1.18.1
## [17] limma_3.63.0 plotly_4.10.4
## [19] sass_0.4.9 rmarkdown_2.28
## [21] jquerylib_0.1.4 yaml_2.3.10
## [23] lobstr_1.1.2 MsCoreUtils_1.19.0
## [25] DBI_1.2.3 RColorBrewer_1.1-3
## [27] lubridate_1.9.3 multcomp_1.4-26
## [29] abind_1.4-8 zlibbioc_1.53.0
## [31] GenomicRanges_1.59.0 purrr_1.0.2
## [33] mixtools_2.0.0 AnnotationFilter_1.31.0
## [35] TH.data_1.1-2 nnet_7.3-19
## [37] sandwich_3.1-1 rappdirs_0.3.3
## [39] ipred_0.9-15 lava_1.8.0
## [41] GenomeInfoDbData_1.2.13 listenv_0.9.1
## [43] parallelly_1.38.0 ncdf4_1.23
## [45] codetools_0.2-20 DelayedArray_0.33.0
## [47] xml2_1.3.6 tidyselect_1.2.1
## [49] farver_2.1.2 UCSC.utils_1.3.0
## [51] viridis_0.6.5 matrixStats_1.4.1
## [53] BiocFileCache_2.15.0 jsonlite_1.8.9
## [55] caret_6.0-94 e1071_1.7-16
## [57] survival_3.7-0 iterators_1.0.14
## [59] foreach_1.5.2 segmented_2.1-3
## [61] tools_4.5.0 progress_1.2.3
## [63] glue_1.8.0 prodlim_2024.06.25
## [65] gridExtra_2.3 SparseArray_1.7.0
## [67] mgcv_1.9-1 xfun_0.48
## [69] MatrixGenerics_1.19.0 GenomeInfoDb_1.43.0
## [71] dplyr_1.1.4 withr_3.0.2
## [73] BiocManager_1.30.25 fastmap_1.2.0
## [75] fansi_1.0.6 caTools_1.18.3
## [77] digest_0.6.37 timechange_0.3.0
## [79] R6_2.5.1 colorspace_2.1-1
## [81] gtools_3.9.5 lpSolve_5.6.21
## [83] biomaRt_2.63.0 RSQLite_2.3.7
## [85] utf8_1.2.4 tidyr_1.3.1
## [87] generics_0.1.3 hexbin_1.28.4
## [89] data.table_1.16.2 recipes_1.1.0
## [91] FNN_1.1.4.1 class_7.3-22
## [93] prettyunits_1.2.0 PSMatch_1.11.0
## [95] httr_1.4.7 htmlwidgets_1.6.4
## [97] S4Arrays_1.7.0 ModelMetrics_1.2.2.2
## [99] pkgconfig_2.0.3 gtable_0.3.6
## [101] timeDate_4041.110 blob_1.2.4
## [103] impute_1.81.0 XVector_0.47.0
## [105] htmltools_0.5.8.1 bookdown_0.41
## [107] MALDIquant_1.22.3 clue_0.3-65
## [109] scales_1.3.0 png_0.1-8
## [111] gower_1.0.1 knitr_1.48
## [113] reshape2_1.4.4 coda_0.19-4.1
## [115] nlme_3.1-166 curl_5.2.3
## [117] proxy_0.4-27 cachem_1.1.0
## [119] stringr_1.5.1 KernSmooth_2.23-24
## [121] parallel_4.5.0 mzID_1.45.0
## [123] vsn_3.75.0 pillar_1.9.0
## [125] vctrs_0.6.5 pcaMethods_1.99.0
## [127] randomForest_4.7-1.2 dbplyr_2.5.0
## [129] xtable_1.8-4 evaluate_1.0.1
## [131] magick_2.8.5 tinytex_0.53
## [133] mvtnorm_1.3-1 cli_3.6.3
## [135] compiler_4.5.0 rlang_1.1.4
## [137] crayon_1.5.3 future.apply_1.11.3
## [139] labeling_0.4.3 LaplacesDemon_16.1.6
## [141] mclust_6.1.1 QFeatures_1.17.0
## [143] affy_1.85.0 plyr_1.8.9
## [145] stringi_1.8.4 viridisLite_0.4.2
## [147] munsell_0.5.1 Biostrings_2.75.0
## [149] lazyeval_0.2.2 Matrix_1.7-1
## [151] hms_1.1.3 bit64_4.5.2
## [153] future_1.34.0 KEGGREST_1.47.0
## [155] statmod_1.5.0 highr_0.11
## [157] SummarizedExperiment_1.37.0 kernlab_0.9-33
## [159] igraph_2.1.1 memoise_2.0.1
## [161] affyio_1.77.0 bslib_0.8.0
## [163] sampling_2.10 bit_4.5.0