waddR
packageThe waddR
package offers statistical tests based on the
2-Wasserstein distance for detecting and characterizing differences
between two distributions given in the form of samples. Functions for
calculating the 2-Wasserstein distance and testing for differential
distributions are provided, as well as a specifically tailored test for
differential expression in single-cell RNA sequencing data.
waddR
provides tools to address the following tasks,
each described in a separate vignette:
Two-sample tests to check for differences between two distributions,
Detection of differential gene expression distributions in single-cell RNA sequencing (scRNAseq) data.
These are bundled into one package, because they are internally dependent: The procedure for detecting differential distributions in scRNAseq data is an adaptation of the general two-sample test, which itself uses the 2-Wasserstein distance to compare two distributions.
The 2-Wasserstein distance is a metric to describe the distance
between two distributions, representing e.g. two diferent conditions
\(A\) and \(B\). The waddR
package
specifically considers the squared 2-Wasserstein distance which can be
decomposed into location, size, and shape terms, thus providing a
characterization of potential differences.
The waddR
package offers three functions to calculate
the (squared) 2-Wasserstein distance, which are implemented in C++ and
exported to R with Rcpp for faster computation. The function
wasserstein_metric
is a Cpp reimplementation of the
wasserstein1d
function from the R package
transport
. The functions squared_wass_approx
and squared_wass_decomp
compute approximations of the
squared 2-Wasserstein distance, with squared_wass_decomp
also returning the decomposition terms for location, size, and
shape.
See ?wasserstein_metric
,
?squared_wass_aprox
, and ?squared_wass_decomp
for more details.
The waddR
package provides two testing procedures using
the 2-Wasserstein distance to test whether two distributions \(F_A\) and \(F_B\) given in the form of samples are
different by testing the null hypothesis \(H_0: F_A = F_B\) against the alternative
hypothesis \(H_1: F_A != F_B\).
The first, semi-parametric (SP), procedure uses a permutation-based test combined with a generalized Pareto distribution approximation to estimate small p-values accurately.
The second procedure uses a test based on asymptotic theory (ASY) which is valid only if the samples can be assumed to come from continuous distributions.
See ?wasserstein.test
for more details.
The waddR
package provides an adaptation of the
semi-parametric testing procedure based on the 2-Wasserstein distance
which is specifically tailored to identify differential distributions in
scRNAseq data. In particular, a two-stage (TS) approach is implemented
that takes account of the specific nature of scRNAseq data by separately
testing for differential proportions of zero gene expression (using a
logistic regression model) and differences in non-zero gene expression
(using the semiparametric 2-Wasserstein distance-based test) between two
conditions.
See ?wasserstein.sc
and ?testZeroes
for
more details.
To install waddR
from Bioconductor, use
BiocManager
with the following commands:
Using BiocManager
, the package can also be installed
from GitHub directly:
The package waddR
can then be used in R:
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] waddR_1.21.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.37.0 xfun_0.48
#> [3] bslib_0.8.0 Biobase_2.67.0
#> [5] lattice_0.22-6 vctrs_0.6.5
#> [7] tools_4.5.0 generics_0.1.3
#> [9] stats4_4.5.0 curl_5.2.3
#> [11] parallel_4.5.0 tibble_3.2.1
#> [13] fansi_1.0.6 RSQLite_2.3.7
#> [15] blob_1.2.4 pkgconfig_2.0.3
#> [17] Matrix_1.7-1 arm_1.14-4
#> [19] dbplyr_2.5.0 S4Vectors_0.45.0
#> [21] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
#> [23] compiler_4.5.0 codetools_0.2-20
#> [25] eva_0.2.6 GenomeInfoDb_1.43.0
#> [27] htmltools_0.5.8.1 sass_0.4.9
#> [29] yaml_2.3.10 nloptr_2.1.1
#> [31] pillar_1.9.0 crayon_1.5.3
#> [33] jquerylib_0.1.4 MASS_7.3-61
#> [35] BiocParallel_1.41.0 SingleCellExperiment_1.29.0
#> [37] DelayedArray_0.33.0 cachem_1.1.0
#> [39] boot_1.3-31 abind_1.4-8
#> [41] nlme_3.1-166 tidyselect_1.2.1
#> [43] digest_0.6.37 purrr_1.0.2
#> [45] dplyr_1.1.4 splines_4.5.0
#> [47] fastmap_1.2.0 grid_4.5.0
#> [49] SparseArray_1.7.0 cli_3.6.3
#> [51] magrittr_2.0.3 S4Arrays_1.7.0
#> [53] utf8_1.2.4 withr_3.0.2
#> [55] filelock_1.0.3 UCSC.utils_1.3.0
#> [57] bit64_4.5.2 rmarkdown_2.28
#> [59] XVector_0.47.0 httr_1.4.7
#> [61] matrixStats_1.4.1 lme4_1.1-35.5
#> [63] bit_4.5.0 coda_0.19-4.1
#> [65] memoise_2.0.1 evaluate_1.0.1
#> [67] knitr_1.48 GenomicRanges_1.59.0
#> [69] IRanges_2.41.0 BiocFileCache_2.15.0
#> [71] rlang_1.1.4 Rcpp_1.0.13
#> [73] glue_1.8.0 DBI_1.2.3
#> [75] BiocGenerics_0.53.0 minqa_1.2.8
#> [77] jsonlite_1.8.9 R6_2.5.1
#> [79] MatrixGenerics_1.19.0 zlibbioc_1.53.0