The waddR package

Introduction

The waddR package offers statistical tests based on the 2-Wasserstein distance for detecting and characterizing differences between two distributions given in the form of samples. Functions for calculating the 2-Wasserstein distance and testing for differential distributions are provided, as well as a specifically tailored test for differential expression in single-cell RNA sequencing data.

waddR provides tools to address the following tasks, each described in a separate vignette:

Calculation of the 2-Wasserstein distance,
Two-sample tests to check for differences between two distributions,
Detection of differential gene expression distributions in single-cell RNA sequencing (scRNAseq) data.

These are bundled into one package, because they are internally dependent: The procedure for detecting differential distributions in scRNAseq data is an adaptation of the general two-sample test, which itself uses the 2-Wasserstein distance to compare two distributions.

2-Wasserstein distance functions

The 2-Wasserstein distance is a metric to describe the distance between two distributions, representing e.g. two diferent conditions \(A\) and \(B\). The waddR package specifically considers the squared 2-Wasserstein distance which can be decomposed into location, size, and shape terms, thus providing a characterization of potential differences.

The waddR package offers three functions to calculate the (squared) 2-Wasserstein distance, which are implemented in C++ and exported to R with Rcpp for faster computation. The function wasserstein_metric is a Cpp reimplementation of the wasserstein1d function from the R package transport. The functions squared_wass_approx and squared_wass_decomp compute approximations of the squared 2-Wasserstein distance, with squared_wass_decomp also returning the decomposition terms for location, size, and shape.

See ?wasserstein_metric, ?squared_wass_aprox, and ?squared_wass_decomp for more details.

Testing for differences between two distributions

The waddR package provides two testing procedures using the 2-Wasserstein distance to test whether two distributions \(F_A\) and \(F_B\) given in the form of samples are different by testing the null hypothesis \(H_0: F_A = F_B\) against the alternative hypothesis \(H_1: F_A != F_B\).

The first, semi-parametric (SP), procedure uses a permutation-based test combined with a generalized Pareto distribution approximation to estimate small p-values accurately.

The second procedure uses a test based on asymptotic theory (ASY) which is valid only if the samples can be assumed to come from continuous distributions.

See ?wasserstein.test for more details.

Testing for differences between two distributions in the context of scRNAseq data

The waddR package provides an adaptation of the semi-parametric testing procedure based on the 2-Wasserstein distance which is specifically tailored to identify differential distributions in scRNAseq data. In particular, a two-stage (TS) approach is implemented that takes account of the specific nature of scRNAseq data by separately testing for differential proportions of zero gene expression (using a logistic regression model) and differences in non-zero gene expression (using the semiparametric 2-Wasserstein distance-based test) between two conditions.

See ?wasserstein.sc and ?testZeroes for more details.

Installation

To install waddR from Bioconductor, use BiocManager with the following commands:

if (!requireNamespace("BiocManager"))
 install.packages("BiocManager")
BiocManager::install("waddR")

Using BiocManager, the package can also be installed from GitHub directly:

BiocManager::install("goncalves-lab/waddR")

The package waddR can then be used in R:

library("waddR")

Session info

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] waddR_1.21.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.37.0 xfun_0.48                  
#>  [3] bslib_0.8.0                 Biobase_2.67.0             
#>  [5] lattice_0.22-6              vctrs_0.6.5                
#>  [7] tools_4.5.0                 generics_0.1.3             
#>  [9] stats4_4.5.0                curl_5.2.3                 
#> [11] parallel_4.5.0              tibble_3.2.1               
#> [13] fansi_1.0.6                 RSQLite_2.3.7              
#> [15] blob_1.2.4                  pkgconfig_2.0.3            
#> [17] Matrix_1.7-1                arm_1.14-4                 
#> [19] dbplyr_2.5.0                S4Vectors_0.45.0           
#> [21] lifecycle_1.0.4             GenomeInfoDbData_1.2.13    
#> [23] compiler_4.5.0              codetools_0.2-20           
#> [25] eva_0.2.6                   GenomeInfoDb_1.43.0        
#> [27] htmltools_0.5.8.1           sass_0.4.9                 
#> [29] yaml_2.3.10                 nloptr_2.1.1               
#> [31] pillar_1.9.0                crayon_1.5.3               
#> [33] jquerylib_0.1.4             MASS_7.3-61                
#> [35] BiocParallel_1.41.0         SingleCellExperiment_1.29.0
#> [37] DelayedArray_0.33.0         cachem_1.1.0               
#> [39] boot_1.3-31                 abind_1.4-8                
#> [41] nlme_3.1-166                tidyselect_1.2.1           
#> [43] digest_0.6.37               purrr_1.0.2                
#> [45] dplyr_1.1.4                 splines_4.5.0              
#> [47] fastmap_1.2.0               grid_4.5.0                 
#> [49] SparseArray_1.7.0           cli_3.6.3                  
#> [51] magrittr_2.0.3              S4Arrays_1.7.0             
#> [53] utf8_1.2.4                  withr_3.0.2                
#> [55] filelock_1.0.3              UCSC.utils_1.3.0           
#> [57] bit64_4.5.2                 rmarkdown_2.28             
#> [59] XVector_0.47.0              httr_1.4.7                 
#> [61] matrixStats_1.4.1           lme4_1.1-35.5              
#> [63] bit_4.5.0                   coda_0.19-4.1              
#> [65] memoise_2.0.1               evaluate_1.0.1             
#> [67] knitr_1.48                  GenomicRanges_1.59.0       
#> [69] IRanges_2.41.0              BiocFileCache_2.15.0       
#> [71] rlang_1.1.4                 Rcpp_1.0.13                
#> [73] glue_1.8.0                  DBI_1.2.3                  
#> [75] BiocGenerics_0.53.0         minqa_1.2.8                
#> [77] jsonlite_1.8.9              R6_2.5.1                   
#> [79] MatrixGenerics_1.19.0       zlibbioc_1.53.0