The CEMiTool
R
package provides users with
an easy-to-use method to automatically run gene co-expression analyses.
In addition, it performs gene set enrichment analysis and over
representation analysis for the gene modules returned by the
analysis.
For the most basic usage of CEMiTool
only a
data.frame
containing expression data with gene symbols in
the rows and sample names in the columns is needed, as following:
BiocManager::install("CEMiTool")
library("CEMiTool")
# load expression data
data(expr0)
head(expr0[,1:4])
## X1913_d0 X1913_d3 X1913_d7 X1911_d0
## XIST 13.061894 13.290272 13.360468 13.178729
## DDX3Y 3.410819 3.164874 3.599792 3.400613
## RPS4Y1 6.326861 5.915121 6.341564 5.905167
## USP9Y 3.237749 3.362508 3.320674 3.365530
## CYorf15B 3.980988 4.201731 4.235020 4.046716
## EIF1AY 3.379857 3.229973 3.150274 3.196610
In this usage, the cemitool
function receives the
expression data, performs the co-expression modules analysis and returns
a CEMiTool
object:
cem <- cemitool(expr0)
To see a summary of the slots inside the CEMiTool
, just
call cem
cem
## CEMiTool Object
## - Number of modules: 4
## - Modules: (data.frame: 257x2):
## genes modules
## 1 HBA1 Not.Correlated
## 2 RPS26 Not.Correlated
## 3 LYZ Not.Correlated
## - Expression file: data.frame with 4000 genes and 45 samples
## - Selected data: 257 genes selected
## - Gene Set Enrichment Analysis: null
## - Over Representation Analysis: null
## - Profile plot: ok
## - Enrichment plot: null
## - ORA barplot: null
## - Beta x R2 plot: null
## - Mean connectivity plot: null
The cemitool()
function automatically executes some
common analyses, depending on the input data. The following sections
describes how to perform each of these analyses separately. Details on
how to perform all analyses together are at the end of this
vignette.
As a default, the cemitool
function first performs a
filtering of the gene expression data before running the remaining
analyses. This filtering is done in accordance to gene variance. In this
example the filtering step has reduced the gene number to 257.
The module analysis has produced 4 modules and the allocation of
genes to modules can be seen with the module_genes
function:
# inspect modules
nmodules(cem)
## [1] 4
head(module_genes(cem))
## genes modules
## 1 HBA1 Not.Correlated
## 2 RPS26 Not.Correlated
## 3 LYZ Not.Correlated
## 4 PPBP M3
## 5 abParts35 Not.Correlated
## 6 NAMPT M2
Genes that are allocated to Not.Correlated
are genes
that are not clustered into any module.
If you wish to adjust the module definition parameters of your
CEMiTool
object, use find_modules(cem)
.
You can use the get_hubs
function to identify the top
n
genes with the highest connectivity in each module:
hubs <- get_hubs(cem,n)
. A summary statistic of the
expression data within each module (either the module mean or eigengene)
can be obtained using: summary <- mod_summary(cem)
The information generated by CEMiTool, including tables and images
can be accessed by generating a report of the CEMiTool
object:
generate_report(cem)
Also, you can create tables with the analyses results using:
write_files(cem)
Plots containing analysis results can be saved using:
save_plots(cem, "all")
## $M1
##
## $M2
##
## $M3
##
## $Not.Correlated
##
## $beta_r2_plot
##
## $mean_k_plot
## [[1]]
## png
## 2
##
## [[2]]
## png
## 2
##
## [[3]]
## png
## 2
##
## [[4]]
## png
## 2
##
## [[5]]
## png
## 2
##
## [[6]]
## png
## 2
##
## [[7]]
## png
## 2
More information can be included in CEMiTool to build a more complete
object and generate richer reports about the expression data. Sample
annotation can be supplied in a data.frame
that specifies a
class for each sample. Classes can represent different conditions,
phenotypes, cell lines, time points, etc. In this example, classes are
defined as the time point at which the samples were collected.
# load your sample annotation data
data(sample_annot)
head(sample_annot)
## SampleName Class
## 1 X1913_d0 g0
## 2 X1911_d0 g0
## 3 X1908_d0 g0
## 4 X1909_d0 g0
## 5 X1910_d0 g0
## 6 X1912_d0 g0
Now you can construct a CEMiTool
object with both
expression data and sample annotation:
# run cemitool with sample annotation
cem <- cemitool(expr0, sample_annot)
The sample annotation of your CEMiTool object can be retrieved and
reassigned using the sample_annotation(cem)
function. This
function can also be used to define the columns with sample names and
sample groupings (which are “SampleName” and “Class”, by default):
sample_annotation(cem,
sample_name_column="SampleName",
class_column="Class") <- sample_annot
When sample annotation is provided, the cemitool
function will automatically evaluate how the modules are up or down
regulated between classes. This is performed using the gene set
enrichment analysis function from the fgsea
package.
You can generate a plot of how the enrichment of the modules varies
across classes with the plot_gsea
function. The size and
intensity of the circles in the figure correspond to the Normalised
Enrichment Score (NES), which is the enrichment score for a module in
each class normalised by the number of genes in the module. This
analysis is automatically run by the cemitool
function, but
it can be independently run with the function
mod_gsea(cem)
.
# generate heatmap of gene set enrichment analysis
cem <- mod_gsea(cem)
cem <- plot_gsea(cem)
show_plot(cem, "gsea")
## $enrichment_plot
You can generate a plot that displays the expression of each gene
within a module using the plot_profile
function:
# plot gene expression within each module
cem <- plot_profile(cem)
plots <- show_plot(cem, "profile")
plots[1]
## $M1
CEMiTool can determine which biological functions are associated with the modules by performing an over representation analysis (ORA). To do this you must provide a pathway list in the form of GMT file. CEMiTool will then analyze how these pathways are represented in the modules.
You can read in a pathway list formatted as a GMT file using the
read_gmt function. This example uses a GMT file that comes as part of
the CEMiTool
example data:
# read GMT file
gmt_fname <- system.file("extdata", "pathways.gmt", package = "CEMiTool")
gmt_in <- read_gmt(gmt_fname)
You can then perform ORA analysis on the modules in your
CEMiTool
object with the mod_ora
function:
# perform over representation analysis
cem <- mod_ora(cem, gmt_in)
The numerical results of the analysis can be accessed with the
ora_data
function. In order to visualise this, use
plot_ora
to add ORA plots to your CEMiTool
object. The plots can be accessed with the show_plot
function.
# plot ora results
cem <- plot_ora(cem)
plots <- show_plot(cem, "ora")
plots[1]
## $M1
## $M1$pl
##
## $M1$numsig
## [1] 9
Interaction data, such as protein-protein interactions can be added
to the CEMiTool
object in order to generate annotated
module graphs. Interaction files are formatted as a
data.frame
or matrix
containing two columns
for interacting pairs of genes.
# read interactions
int_fname <- system.file("extdata", "interactions.tsv", package = "CEMiTool")
int_df <- read.delim(int_fname)
head(int_df)
## gene1symbol gene2symbol
## 1 DBH REPIN1
## 2 RBFOX2 HERC5
## 3 ZNF460 CCDC22
## 4 SNRNP40 OAZ3
## 5 SRSF6 OAZ3
## 6 SPTAN1 ARL8A
You can add the interaction data to your CEMiTool
object
using the interactions_data
function and generate the plots
with plot_interactions
. Once again, the plots can be seen
with the show_plot
function:
# plot interactions
library(ggplot2)
interactions_data(cem) <- int_df # add interactions
cem <- plot_interactions(cem) # generate plot
plots <- show_plot(cem, "interaction") # view the plot for the first module
plots[1]
## $M1
Finally, a CEMiTool
object with all of the components
mentioned above can also be constructed using just the
cemitool
function:
# run cemitool
library(ggplot2)
cem <- cemitool(expr0, sample_annot, gmt_in, interactions=int_df,
filter=TRUE, plot=TRUE, verbose=TRUE)
# create report as html document
generate_report(cem, directory="./Report")
# write analysis results into files
write_files(cem, directory="./Tables")
# save all plots
save_plots(cem, "all", directory="./Plots")
Sometimes, CEMiTool analyses can fail, usually due to problems in the
input data. We provide a function, diagnostic_report
which
aims to try to assist in resolving these issues.
diagnostic_report(cem)
The function will return six different plots inside the report, all of which can be used to assess problems in the data. We will briefly discuss how each plot can be used to evaluate data problems.
This plot aims to show if there are closely related groups within samples. If a sample annotation file is given, the plot will show different colors for each sample group, and any numerical data given in other columns as a heatmap. This information can be used in order to see how homogeneous/heterogeneous the input data are. Highly heterogeneous sample groups may be the cause of batch effects, which should be removed.
In this plot, the mean and the variance of each gene in the
expression file is plotted as the x and y coordinates of the
scatterplot, and a line is plotted in order to show the relationship
between the two. Particularly for RNAseq data, if a strong R-squared
value is found, one should set the apply_vst
argument in
the cemitool()
function to TRUE
in order to
remove this correspondance.
These plots are intended to highlight the distribution of expression
values. A Q-Q plot is a mathematical approach to determine if data
possibly arose from a theoretical distribution such as the normal
distribution. This information can be used to guide the selection of an
appropriate correlation coeffiecient for analyses, which can be changed
via the cor_method
argument of the cemitool()
function. Currently accepted coefficients are “pearson” and
“spearman”.
These plots can only be generated if the
diagnostic_report()
function is run after the
cemitool()
function. The Beta x R2 plot is used to
visualize the selection of the soft-thresholding parameter Beta and its
corresponding adherence to the scale-free topology model. Selected Beta
values will be shown in red, unless NA. The Mean connectivity plot is
intended to show the tradeoff between the network’s underlying
connectivity and a higher adherence to the scale-free topology model
(via higher values of the soft-threshold Beta parameter).