Short DNA sequence motifs provide key information for interpreting the instructions in DNA, for example by providing binding sites for proteins or altering the structure of the double-helix. A less studied but important feature of DNA sequence motifs is their periodicity. A famous example is the 10-bp periodicity of many k-mers in nucleosome positioning (reviewed in Travers et al. 2010 or in Struhl and Segal 2013).
periodicDNA provides a framework to quantify the periodicity of k-mers of interest in DNA sequences. For a chosen k-mer, periodicDNA can identify which periods are statistically enriched in a set of sequences, by using a randomized shuffling approach to compute an empirical p-value. It can also generate continuous linear tracks of k-mer periodicity strength over genomic loci.
To estimate the periodicity strength of a given k-mer in one or several sequences, periodicDNA performs the following steps:
The main goal of periodicDNA is to quantify periodicity of a given
k-mer in a set of sequences. For instance, one can assess the
periodicity of TT dinucleotides in sequences around TSSs of ubiquitous
promoters using getPeriodicity()
.
In the following example, getPeriodicity()
is directly
ran using a GRanges object, specifying from which genome this GRanges
comes from.
library(ggplot2)
library(magrittr)
library(periodicDNA)
#
data(ce11_TSSs)
periodicity_result <- getPeriodicity(
ce11_TSSs[['Ubiq.']][1:500],
genome = 'BSgenome.Celegans.UCSC.ce11',
motif = 'TT',
BPPARAM = setUpBPPARAM(1)
)
#> - Mapping k-mers.
#> - 523903 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
The main output of getPeriodicity()
is a table of power
spectral density (PSD) values associated with discrete frequencies,
computed using a Fast Fourier Transform. For a given frequency, a high
PSD score indicates a high periodicity of the k-mer of interest.
In the previous example, TT dinucleotides in sequences around TSSs of ubiquitous promoters are highly periodic, with a periodicity of 10 bp.
head(periodicity_result$PSD)
#> freq period PSD
#> 1 0.005 200.00000 6.256976e-08
#> 2 0.010 100.00000 2.204282e-08
#> 3 0.015 66.66667 2.215522e-09
#> 4 0.020 50.00000 1.108237e-08
#> 5 0.025 40.00000 4.649689e-09
#> 6 0.030 33.33333 2.661198e-08
subset(periodicity_result$periodicityMetrics, Period == 10)
#> Freq Period PSD
#> 20 0.1 10 3.633071e-06
Graphical output of getPeriodicity()
can be obtained
using the plotPeriodicityResults()
function:
The first plot shows the raw distribution of distances between pairs of ‘TT’ in the sequences of the provided GRanges. The second plot shows the decay-normalised distribution. Finally, the third plot shows the PSD scores of the ‘TT’ k-mer, measured from the normalised distribution.
periodicDNA provides an approach to compare the periodicity of a
given k-mer in a set of sequences compared to background. For a given
k-mer at a period T in a set of input sequences, the fold-change over
background of its PSD is estimated by iteratively shuffling the input
sequences and estimating the resulting PSD values.
Eventually, the log2 fold-change (l2FC) between the observed PSD and the
median of the PSD values measured after shuffling is computed as
follows:
l2FC = log2(observed PSD / median(shuffled PSDs)).
periodicity_result <- getPeriodicity(
ce11_TSSs[['Ubiq.']][1:500],
genome = 'BSgenome.Celegans.UCSC.ce11',
motif = 'TT',
n_shuffling = 5
)
#> - Calculating observed PSD
#> - Mapping k-mers.
#> - 523903 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
#> - Shuffling 1/5
#> - Shuffling 2/5
#> - Shuffling 3/5
#> - Shuffling 4/5
#> - Shuffling 5/5
#> Only 5 shufflings. Cannot compute accurate empirical p-values. To compute empirical p-values, set up n_shuffling to at least 100. Only l2FC values are returned
head(periodicity_result$periodicityMetrics)
#> Freq Period PSD_observed l2FC pval fdr
#> 1 0.005 200.00000 6.26e-08 -1.25607689 NA NA
#> 2 0.010 100.00000 2.20e-08 -0.61529441 NA NA
#> 3 0.015 66.66667 2.22e-09 -1.72189601 NA NA
#> 4 0.020 50.00000 1.11e-08 -0.06968687 NA NA
#> 5 0.025 40.00000 4.65e-09 -1.64472237 NA NA
#> 6 0.030 33.33333 2.66e-08 2.03437616 NA NA
subset(periodicity_result$periodicityMetrics, Period == 10)
#> Freq Period PSD_observed l2FC pval fdr
#> 20 0.1 10 3.63e-06 7.680109 NA NA
plotPeriodicityResults(periodicity_result)
#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
#> of ggplot2 3.3.4.
#> ℹ The deprecated feature was likely used in the periodicDNA package.
#> Please report the issue at <https://github.com/js2264/periodicDNA/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
If n_shuffling >= 100
, an associated empirical
p-value is calculated as well (North et al 2002). This metric indicates,
for each individual period T, whether the observed PSD is significantly
greater than the PSD values measured after shuffling the input
sequences. Note that empirical p-values are only an estimation of the
real p-value. Notably, small p-values are systematically under-estimated
(North et al 2002).
getPeriodicity()
can also be ran directly on a set of
sequences of interest as follows:
data(ce11_proms_seqs)
periodicity_result <- getPeriodicity(
ce11_proms_seqs,
motif = 'TT',
BPPARAM = setUpBPPARAM(1)
)
#> - Mapping k-mers.
#> - 117630 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
subset(periodicity_result$periodicityMetrics, Period == 10)
#> Freq Period PSD
#> 20 0.1 10 1.16806e-06
The other aim of periodicDNA is to generate continuous linear tracks
of k-mer periodicity strength over genomic loci of interest.
getPeriodicityTrack()
can be used to generate suck genomic
tracks. In the following example,
WW_10bp_track <- getPeriodicityTrack(
genome = 'BSgenome.Celegans.UCSC.ce11',
granges = ce11_proms,
motif = 'WW',
period = 10,
BPPARAM = setUpBPPARAM(1),
bw_file = 'WW-10-bp-periodicity_over-proms.bw'
)
When plotted over sets of ubiquitous, germline or somatic TSSs, the resulting track clearly shows increase of WW 10-bp periodicity above the ubiquitous and germline TSSs, whereas somatic TSSs do not show such increase.
sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BSgenome.Celegans.UCSC.ce11_1.4.2 periodicDNA_1.17.0
#> [3] BiocParallel_1.41.0 BSgenome_1.75.0
#> [5] rtracklayer_1.67.0 BiocIO_1.17.0
#> [7] Biostrings_2.75.0 XVector_0.47.0
#> [9] magrittr_2.0.3 ggplot2_3.5.1
#> [11] GenomicRanges_1.59.0 GenomeInfoDb_1.43.0
#> [13] IRanges_2.41.0 S4Vectors_0.45.0
#> [15] BiocGenerics_0.53.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.37.0 gtable_0.3.6
#> [3] rjson_0.2.23 xfun_0.48
#> [5] bslib_0.8.0 Biobase_2.67.0
#> [7] lattice_0.22-6 vctrs_0.6.5
#> [9] tools_4.5.0 bitops_1.0-9
#> [11] generics_0.1.3 curl_5.2.3
#> [13] parallel_4.5.0 tibble_3.2.1
#> [15] fansi_1.0.6 highr_0.11
#> [17] pkgconfig_2.0.3 Matrix_1.7-1
#> [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13
#> [21] farver_2.1.2 compiler_4.5.0
#> [23] Rsamtools_2.23.0 munsell_0.5.1
#> [25] codetools_0.2-20 htmltools_0.5.8.1
#> [27] sass_0.4.9 RCurl_1.98-1.16
#> [29] yaml_2.3.10 pillar_1.9.0
#> [31] crayon_1.5.3 jquerylib_0.1.4
#> [33] cachem_1.1.0 DelayedArray_0.33.0
#> [35] abind_1.4-8 tidyselect_1.2.1
#> [37] digest_0.6.37 dplyr_1.1.4
#> [39] restfulr_0.0.15 labeling_0.4.3
#> [41] cowplot_1.1.3 fastmap_1.2.0
#> [43] grid_4.5.0 SparseArray_1.7.0
#> [45] colorspace_2.1-1 cli_3.6.3
#> [47] S4Arrays_1.7.0 XML_3.99-0.17
#> [49] utf8_1.2.4 withr_3.0.2
#> [51] scales_1.3.0 UCSC.utils_1.3.0
#> [53] rmarkdown_2.28 httr_1.4.7
#> [55] matrixStats_1.4.1 zoo_1.8-12
#> [57] evaluate_1.0.1 knitr_1.48
#> [59] rlang_1.1.4 glue_1.8.0
#> [61] jsonlite_1.8.9 R6_2.5.1
#> [63] MatrixGenerics_1.19.0 GenomicAlignments_1.43.0
#> [65] zlibbioc_1.53.0