library(PRONE)
Here, we are directly working with the SummarizedExperiment data. For more information on how to create the SummarizedExperiment from a proteomics data set, please refer to the “Get Started” vignette.
The example TMT data set originates from (Biadglegne et al. 2022).
data("tuberculosis_TMT_se")
se <- tuberculosis_TMT_se
As we have seen in the Preprocessing phase, that samples “1.HC_Pool1” and “1.HC_Pool2” have been removed from the data set due to their high amount of missing values (more than 80% of NAs per sample), before imputing the data we will here remove these two samples.
se <- remove_samples_manually(se, "Label", c("1.HC_Pool1", "1.HC_Pool2"))
#> 2 samples removed.
Since proteomics data is often affected by missing values and some statistical tests do not allow a high amount of missingness in the data, people have to options to reduce the amount of missingness in the data: (1) remove proteins with missing values or (2) impute missing values.
(1): this point is already shown in the Preprocessing tutorial, where we removed samples with a high amount of missing values using a predefined threshold.
(2): this point will be discussed here.
Since the initial focus of PRONE was on the evaluation of the performance of normalization methods and a selection of methods was made based on an extensive literature review, the imputation methods are currently still limited. However, to ensure that PRONE offers all steps of a typical proteomics analysis workflow, we have included a basic imputation method since in some cases imputation is favored over removing a high amount of proteins.
So currently, there is only a mixed imputation method available in PRONE: k-nearest neighbor imputation for proteins with missing values at random and a left-shifted Gaussian distribution for proteins with missing values not at random. Imputation can be performed on a selection of normalized data sets using the “ain” parameter in the impute_SE
function. The default is to impute all assays (ain = NULL).
se <- impute_se(se, ain = NULL)
#> Condition of SummarizedExperiment used!
#> All assays of the SummarizedExperiment will be used.
#> Imputing along margin 1 (features/rows).
#> Imputing along margin 1 (features/rows).
#> Imputing along margin 1 (features/rows).
#> Imputing along margin 1 (features/rows).
#> Imputing along margin 1 (features/rows).
#> Imputing along margin 1 (features/rows).
ATTENTION:
Please note that imputation can introduce bias in the data and should be used with caution. After imputing your data, have a look at the exploratory data analysis plots (such as boxplots, PCA plots, etc.) to see if the imputation method has skewed the distributions of your samples and introduced biases in your data. These visualizations options are already shown in the “Normalization” tutorial.
utils::sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] PRONE_1.1.0 SummarizedExperiment_1.37.0
#> [3] Biobase_2.67.0 GenomicRanges_1.59.0
#> [5] GenomeInfoDb_1.43.0 IRanges_2.41.0
#> [7] S4Vectors_0.45.0 BiocGenerics_0.53.0
#> [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
#>
#> loaded via a namespace (and not attached):
#> [1] rlang_1.1.4 magrittr_2.0.3
#> [3] GetoptLong_1.0.5 clue_0.3-65
#> [5] compiler_4.5.0 png_0.1-8
#> [7] vctrs_0.6.5 reshape2_1.4.4
#> [9] stringr_1.5.1 ProtGenerics_1.39.0
#> [11] shape_1.4.6.1 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0
#> [15] magick_2.8.5 XVector_0.47.0
#> [17] labeling_0.4.3 utf8_1.2.4
#> [19] rmarkdown_2.28 UCSC.utils_1.3.0
#> [21] preprocessCore_1.69.0 purrr_1.0.2
#> [23] xfun_0.48 MultiAssayExperiment_1.33.0
#> [25] zlibbioc_1.53.0 cachem_1.1.0
#> [27] jsonlite_1.8.9 highr_0.11
#> [29] DelayedArray_0.33.0 BiocParallel_1.41.0
#> [31] parallel_4.5.0 cluster_2.1.6
#> [33] R6_2.5.1 RColorBrewer_1.1-3
#> [35] bslib_0.8.0 stringi_1.8.4
#> [37] ComplexUpset_1.3.3 limma_3.63.0
#> [39] jquerylib_0.1.4 Rcpp_1.0.13
#> [41] bookdown_0.41 iterators_1.0.14
#> [43] knitr_1.48 Matrix_1.7-1
#> [45] igraph_2.1.1 tidyselect_1.2.1
#> [47] abind_1.4-8 yaml_2.3.10
#> [49] doParallel_1.0.17 ggtext_0.1.2
#> [51] codetools_0.2-20 affy_1.85.0
#> [53] lattice_0.22-6 tibble_3.2.1
#> [55] plyr_1.8.9 withr_3.0.2
#> [57] evaluate_1.0.1 xml2_1.3.6
#> [59] circlize_0.4.16 pillar_1.9.0
#> [61] affyio_1.77.0 BiocManager_1.30.25
#> [63] DT_0.33 foreach_1.5.2
#> [65] MSnbase_2.33.0 MALDIquant_1.22.3
#> [67] ncdf4_1.23 generics_0.1.3
#> [69] ggplot2_3.5.1 munsell_0.5.1
#> [71] scales_1.3.0 glue_1.8.0
#> [73] lazyeval_0.2.2 tools_4.5.0
#> [75] data.table_1.16.2 mzID_1.45.0
#> [77] QFeatures_1.17.0 vsn_3.75.0
#> [79] mzR_2.41.0 XML_3.99-0.17
#> [81] Cairo_1.6-2 grid_4.5.0
#> [83] impute_1.81.0 tidyr_1.3.1
#> [85] crosstalk_1.2.1 MsCoreUtils_1.19.0
#> [87] colorspace_2.1-1 patchwork_1.3.0
#> [89] GenomeInfoDbData_1.2.13 PSMatch_1.11.0
#> [91] cli_3.6.3 fansi_1.0.6
#> [93] S4Arrays_1.7.0 ComplexHeatmap_2.23.0
#> [95] dplyr_1.1.4 AnnotationFilter_1.31.0
#> [97] pcaMethods_1.99.0 gtable_0.3.6
#> [99] sass_0.4.9 digest_0.6.37
#> [101] SparseArray_1.7.0 htmlwidgets_1.6.4
#> [103] rjson_0.2.23 farver_2.1.2
#> [105] htmltools_0.5.8.1 lifecycle_1.0.4
#> [107] httr_1.4.7 GlobalOptions_0.1.2
#> [109] statmod_1.5.0 gridtext_0.1.5
#> [111] MASS_7.3-61