Improving NCBI GEO submissions of scRNA-seq data

Code and full report

https://github.com/rnabioco/someta

The problem

To improve reproducibility and usability, we suggest the Gene Expression Omnibus (GEO) improve its submission guidelines for processed data stemming from single-cell RNA sequencing.

Current GEO guidelines require that a count matrix is included with a submission. But because the current GEO requirements surrounding supplemental files are vague, many studies do not include downstream analysis results (e.g., cluster assignments, UMAP/t-SNE projection coordinates, and cell type classification metadata), which vary due to algorithm parameter settings and non-deterministic steps in the processing pipeline.

For single-cell RNA-seq, cell type assignments are not labeled by sample names, as would be the case in bulk RNA-seq data. Hence, a count matrix is simply an intermediate file format, similar to other intermediate formats for bulk RNA-seq experiments (BAM/SAM/BED), which also do not enable reproduction of conclusions. This is counter to GEO’s own stance on processed data, “defined as the data on which the conclusions in the related manuscript are based”. The lack of a clear GEO guideline leads to reproducibility issues, as studies that only provide count matrices require significant domain-specific expertise to associate per-cell mRNA counts with a cell type or phenotype.

We suggest that in addition to count matrices, a metadata table containing relevant information (e.g., sample, clustering, cell-type, pseudo-time information) should be a required component of single-cell RNA-seq submissions. These metadata tables are already generated by commonly used single-cell RNA-seq analysis workflows (Seurat, SingleCellExperiment, scanpy) and could easily be included with other submission components.

For example, we compare the usability and reproducibility of two single-cell mRNA-seq submissions in GEO:

GEO accession GSE137710 contains a metadata file for each sample (e.g., “GSE137710_human_melanoma_cell_metadata_9315x14.tsv.gz”), enabling rapid downstream analyses to find expression patterns and markers for newly described cell types.

In contrast, GEO accession GSE124494 only includes count matrix information, requiring a non-expert to recreate (i.e., guess) the published associations between cell-type and gene expression. This is also a particularly challenging example as the associated publication used now-outdated versions of Seurat and its integration algorithm, requiring even more work to reproduce. This metadata file was certainly created and could have easily been attached to the submission.

Our proposal

Moving forward, we suggest two remedies to this issue:

  1. Updating the GEO submission guidelines to require metadata file submissions. This would be done on https://www.ncbi.nlm.nih.gov/geo/info/seq.html by updating the “Processed data files” with a section specifically outlining required data types for single-cell mRNA-seq submissions.

  2. For single-cell mRNA-seq data, in addition to standard count matrices (genes-by-cells), we expect users to deposit metadata annotations generated during the course of analysis.

Encouraging previous depositors of single-cell mRNA-seq data to update their GEO records with metadata, if it was not included in the original submission.

Session info

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] SingleCellExperiment_1.29.0 SummarizedExperiment_1.37.0
#>  [3] Biobase_2.67.0              GenomicRanges_1.59.0       
#>  [5] GenomeInfoDb_1.43.0         IRanges_2.41.0             
#>  [7] S4Vectors_0.45.0            BiocGenerics_0.53.0        
#>  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> [11] cowplot_1.1.3               ggplot2_3.5.1              
#> [13] clustifyr_1.19.0            BiocStyle_2.35.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] rlang_1.1.4             magrittr_2.0.3          clue_0.3-65            
#>  [4] GetoptLong_1.0.5        compiler_4.5.0          png_0.1-8              
#>  [7] vctrs_0.6.5             pkgconfig_2.0.3         shape_1.4.6.1          
#> [10] crayon_1.5.3            fastmap_1.2.0           magick_2.8.5           
#> [13] XVector_0.47.0          labeling_0.4.3          utf8_1.2.4             
#> [16] rmarkdown_2.28          UCSC.utils_1.3.0        purrr_1.0.2            
#> [19] xfun_0.48               zlibbioc_1.53.0         cachem_1.1.0           
#> [22] jsonlite_1.8.9          highr_0.11              DelayedArray_0.33.0    
#> [25] BiocParallel_1.41.0     parallel_4.5.0          cluster_2.1.6          
#> [28] R6_2.5.1                bslib_0.8.0             RColorBrewer_1.1-3     
#> [31] parallelly_1.38.0       jquerylib_0.1.4         Rcpp_1.0.13            
#> [34] bookdown_0.41           iterators_1.0.14        knitr_1.48             
#> [37] future.apply_1.11.3     Matrix_1.7-1            tidyselect_1.2.1       
#> [40] abind_1.4-8             yaml_2.3.10             doParallel_1.0.17      
#> [43] codetools_0.2-20        listenv_0.9.1           lattice_0.22-6         
#> [46] tibble_3.2.1            withr_3.0.2             evaluate_1.0.1         
#> [49] future_1.34.0           circlize_0.4.16         pillar_1.9.0           
#> [52] BiocManager_1.30.25     foreach_1.5.2           generics_0.1.3         
#> [55] sp_2.1-4                munsell_0.5.1           scales_1.3.0           
#> [58] globals_0.16.3          glue_1.8.0              tools_4.5.0            
#> [61] data.table_1.16.2       fgsea_1.33.0            dotCall64_1.2          
#> [64] fastmatch_1.1-4         grid_4.5.0              Cairo_1.6-2            
#> [67] tidyr_1.3.1             colorspace_2.1-1        GenomeInfoDbData_1.2.13
#> [70] cli_3.6.3               spam_2.11-0             fansi_1.0.6            
#> [73] S4Arrays_1.7.0          ComplexHeatmap_2.23.0   dplyr_1.1.4            
#> [76] gtable_0.3.6            sass_0.4.9              digest_0.6.37          
#> [79] progressr_0.15.0        SparseArray_1.7.0       rjson_0.2.23           
#> [82] SeuratObject_5.0.2      farver_2.1.2            entropy_1.3.1          
#> [85] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
#> [88] GlobalOptions_0.1.2