library(HDF5Array)
library(edgeR)
library(SummarizedExperiment)
source("./rnaseq_functions.R")
se <- load_atlas_se("E-GEOD-72806")
logcounts <- get_normalized_counts(se)
coldata <- as.data.frame(colData(se)[, c("environmental_stress"), drop = FALSE])10 Principle Component Analysis and Differential Gene Expression
10.1 Dimensionality reduction
We looked at clustering the samples based on the expression of 2 genes, which worked pretty good when this was based on the two most variable genes. With two genes, we could also plot the samples in a 2 dimensional figure and ‘see’ the clusters. However, if we take all genes into account, this will not work anymore: with three genes we reach the limit of the number of dimensions we can plot and that third dimension is already less easy to interpret than the first two. Still, in exploratory data analysis, being able to ‘look’ at your data can be very informative.
A common way to address this is to perform dimensionality reduction, where the data are transformed into a lower dimensional space, while trying to preserve the main characteristics of the original data. Plotting the samples using the two most variable genes could be considered dimensionality reduction (through feature selection). Dimensionality reduction methods can be separated into linear and non-linear approaches. In this session we will look at a specific method for linear dimensionality reduction called principle component analysis or PCA, which works quite well for log-transformed bulk RNA-seq data. If the structure in de data becomes more complex, for instance in single-cell RNA-seq data which can consist of dozens of different cell types that each has its own unique expression profile, non-linear methods like t-SNE and UMAP are popular alternatives. It is good to realize that flattening the data points to only two dimensions will unavoidably lead to a simplification of the overall structure, which may lead to wrong conclusions.
10.2 Principle Component Analysis
To understand the algorithm for doing PCA involves leaning about matrix calculations like eigenvectors and eigenvalues, which is beyond the scope of this course, and fortunately this is not essential for developing a basic understanding of PCA and an intuition to interpret the results.
In this session we continue with the heat and salt stress gene expression data set from https://www.ebi.ac.uk/gxa/experiments/E-GEOD-72806/
First load the proper libraries and filter/transform the data like before
Let’s plot the samples based on the expression three genes in three dimensions. For that we will make use of the plotly library that offers a plethora of interactive graphs. Here we will use the scatter3d plot. First install and load plotly:
if(!requireNamespace("plotly", quietly = TRUE))
install.packages("plotly")
library(plotly)We extend the gene list with a third, and create a data frame with the counts and sample annotation.
genes3 <- c("AT4G27670", "AT1G56600", "AT4G09020")
logcounts_genes3 <- t(logcounts[genes3, , drop = FALSE]) # transpose to have samples as rows and genes as columns
expr_df <- as.data.frame(logcounts_genes3)
expr_df$sample <- rownames(expr_df)
# add sample annotation
expr_df$environmental_stress <- coldata[expr_df$sample, "environmental_stress"]
p3 <- plot_ly(
expr_df,
x = ~ .data[[genes3[1]]],
y = ~ .data[[genes3[2]]],
z = ~ .data[[genes3[3]]],
color = ~ environmental_stress,
colors = "Set2",
type = "scatter3d",
mode = "markers",
marker = list(size = 6)
) %>%
layout(
title = "Samples in 3-gene expression space",
scene = list(
xaxis = list(title = genes3[1]),
yaxis = list(title = genes3[2]),
zaxis = list(title = genes3[3])
)
)
p3Drag/rotate the gene space in such a way that the samples are best spread out in the 2 dimensions that you effectively see. This is what PCA tries to accomplish. To see how that works in on this data, we first use the prcomp() function:
pca <- prcomp(logcounts_genes3) # the PCA should be performed on the counts
summary(pca)In the summary you see the three principle components (PCs) of the data and the proportion of the total variance they explain. The first PC (PC1) already explains most of the variance, with a little left for PC2. We can visualize this as a barplot. This type of plot is called a screeplot.
screeplot(pca)To see how the samples separate in the space defined by the first two PCs, we can make a so called biplot.
biplot(pca)This is a bit hard to make sense of, so let us make a more pretty plot including the sample annotations. As before we can use ggplot(), we need to store the data in a data frame.
pca_df <- as.data.frame(pca$x) # ggplot needs a data.frame
pca_df$sample <- rownames(pca$x) # create a sample column to merge on
coldata$sample <- rownames(coldata) # create a sample column to merge on
pca_df <- merge(pca_df, # add sample annotation
coldata,
by = "sample",
all.x = TRUE)
var_expl <- summary(pca)$importance[2, 1:2] * 100 # variance explained by PC1 and PC2
p <- ggplot(pca_df, aes(PC1, PC2, color = environmental_stress)) +
geom_point(size = 4) +
labs(title = "PCA of Arabidopsis heat/salt stress RNA-seq",
x = paste0("PC1 (", round(var_expl[1], 1), "%)"),
y = paste0("PC2 (", round(var_expl[2], 1), "%)")) +
theme_bw()
pTo see how the 3D gene space was actually rotated, we can look at the loadings of the genes on the PCs. These indicate how much each gene contributes to each PC. The loadings are stored in the rotation component of the PCA object.
# loadings: variables (genes) x PCs
loadings <- as.data.frame(pca$rotation[, 1:2])
loadingsThese loadings per PC represent a so-called unit vector (a line in the gene-space of length one) originating from the center (0,0), projecting the original gene axes. PC1 is strongly aligned with the first gene, while PC2 is largely determined by gene 2. We can add these vectors to the plot,
loadings$gene <- rownames(loadings)
p + geom_segment( # p still contains the PCA plot from above
data = loadings,
aes(
x = 0, y = 0,
xend = PC1, yend = PC2
),
arrow = arrow(length = unit(0.25, "cm")),
color = "red",
inherit.aes = FALSE
) +
# loading labels
geom_text(
data = loadings,
aes(
x = PC1,
y = PC2,
label = gene
),
color = "red",
size = 4,
vjust = 0,
hjust = -0.2,
inherit.aes = FALSE
)Looking at the direction of the AT4G27670 arrow, it seems this gene contributes equally to PC1 and PC2?
Exercise 10.1 Check this with the loadings for AT4G27670 for PC1 and PC2. What is misleading in the axes of this plot?
Based on these three genes, the PCs separate the samples quite well. However, the results can be quite different if we change the way the data are preprocessed. For instance, if we do not log-transform the data, or if we center and/or scale the data before performing PCA, this can have a big impact on the results. The prcomp() function has options for centering and scaling the data, if you do not set them, the defaults are taken.
Exercise 10.2 To see the effect of centering and scaling, complete the pca_plot() function that performs PCA and plots the samples in PC1 and PC2 for different combinations of centering and scaling. Then plot it using the patchwork library to arrange the four plots in a 2x2 grid.
Add the pca_plot() function to your rnaseq_functions.R script
pca_plot <- function(logcounts, coldata, center, scale) {
# add code here to construct pca_df for the given center and scale
p <- ggplot(pca_df, aes(PC1, PC2, color = environmental_stress)) +
geom_point(size = 4) +
labs(title = paste0("Center:", center, " Scale:", scale),
x = paste0("PC1 (", round(var_expl[1], 1), "%)"),
y = paste0("PC2 (", round(var_expl[2], 1), "%)")) +
theme_bw()
return(p)
}
# Build the 4 plots for the different combinations of centering and scaling, to be plotted in a 2x2 grid
pFF <- make_pca_plot(logcounts_genes3, coldata, center = F, scale = F)
pTF <- make_pca_plot(logcounts_genes3, coldata, center = T, scale = F)
pFT <- make_pca_plot(logcounts_genes3, coldata, center = F, scale = T)
pTT <- make_pca_plot(logcounts_genes3, coldata, center = T, scale = T)
# Arrange in a 2x2 grid using the patchwork library
if (!requireNamespace("patchwork", quietly = TRUE))
install.packages("patchwork")
library(patchwork)
(pFF | pFT) / (pTF | pTT) # patchwork magicIf you see 4 different plots, great! If the plots are all the same, check the code of your function and make sure you are actually using the center and scale arguments in the prcomp() function. You should see that centering has a big impact on the results, while scaling does not have much effect. This is because the data are already on the same scale (log-transformed counts). PCA analyzes variance around the mean of the data. The principal component axes always pass through the origin (0,0,…). By centering the data, we ensure that the origin corresponds to the mean of the data.
Which combination of center/scale gives the best separation of the samples? In general, it is good practice to center the data before performing PCA, while scaling is not always necessary, especially if the data are already on a similar scale. How does this match the defaults of prcomp()?
Next, we can see what happens if we base the PCA on all genes (so not just on logcounts_genes3).
pca <- prcomp(t(logcounts))
summary(pca)
pca_df <- as.data.frame(pca$x)
pca_df$sample <- rownames(pca$x)
pca_df <- merge(pca_df,coldata,
by = "sample",
all.x = TRUE) # add sample annotation
var_expl <- summary(pca)$importance[2, 1:2] * 100 # variance explained by PC1 and PC2This gives us more PCs, but the first one still explain most of the variance.
To see how the samples separate now, plot them again using PC1 and PC2 of this PCA.
What do you conclude about the separation of the samples based on all genes compared to the three genes we selected? Do you think this is a better representation of the data? Why or why not?
We can again look at the loadings of the genes for the PCs, but since we included all genes in the analyses and have many PCs we cannot plot all of them. We can focus on the genes that contribute most to PC1, which are the ones with the highest absolute loadings.
pc = "PC1"
loadings <- pca$rotation[, pc]
top50 <- names(sort(abs(loadings), decreasing = TRUE))[1:50]
logcounts_top50 <- logcounts[top50,]
annotation_col <- coldata[,"environmental_stress",drop=FALSE]
library(pheatmap)
pheatmap( logcounts_top50,
annotation_col = annotation_col,
show_rownames=TRUE,
show_colnames=FALSE,
scale="none",
main=paste0("Top 50 genes by absolute ",pc," loading" ))We can use the PCs also for clustering the samples. This gives us the opportunity to select the most important aspects of the data. Starting with the top 10 PCs to do the K-means clustering gives a pretty good result.
pca_scores <- pca$x[, 1:10]
set.seed(42)
km_pca <- kmeans(pca_scores, centers = 4, nstart=250)
pca_df$cluster <- factor(km_pca$cluster[match(pca_df$sample, rownames(pca_scores))])
ggplot(pca_df, aes(PC1, PC2, color = environmental_stress, shape = cluster)) +
geom_point(size = 4) +
labs(
title = "PCA of Arabidopsis heat/salt stress RNA-seq",
x = paste0("PC1 (", round(var_expl[1], 1), "%)"),
y = paste0("PC2 (", round(var_expl[2], 1), "%)"),
shape = "k-means cluster",
color = "Stress"
) +
theme_bw()Exercise 10.3 Try out what happens if you try different numbers of top PCs than 10, does the clustering still match the treatments?
10.3 Differential Gene Expression
Up to now we have been doing true exploratory data analysis (EDA), looking at the data without a strong hypothesis. This is what is typically called unsupervised analysis. We only used the treatment information to interpret the results, not to guide the analysis. Now we will do a supervised analysis, where we take the sample information into account to do the analysis. The question we will address is: For which genes does the expression significantly change in response to the stress treatments?. As already mentioned on Monday, these genes are called Differentially Expressed Genes or DEG, and the analysis is called Differential Gene Expression or DGE. For this we group samples based on their treatment. For instance, all three samples that were only treated with heat are replicates of each other where we expect the same genes to be responding to the treatment. You may have noticed the term significantly, which implies we do a statistical test and get a p-value. The t-test is not appropriate for RNA-seq data, because these are discrete counts instead of continuous values. RNA-seq counts for a gene are generally assumed to follow a negative binomial distribution. Libraries like edgeR are specifically developed to use the count information from the replicates to predict the mean and variance for the expression of a gene and use these in a statistical test comparing different treatments or time points. Below the code for this analysis, not need to understand every step.
counts <- assay(se) # we start with the raw counts
group <- factor(coldata$environmental_stress)
levels(group) <- make.names(levels(group))
group <- relevel(group, ref = "none") # compare all treatments to the untreated samples
dge <- DGEList(counts = counts, group = group) # create the DGEList object
keep <- filterByExpr(dge, group = group) # using the edgeR count filtering
dge <- dge[keep, , keep.lib.sizes = FALSE] # remove genes based on the keep logical vector
dge <- calcNormFactors(dge) # for normalization
design <- model.matrix(~ 0 + group) # design of the experiement
colnames(design) <- levels(group)
fit <- glmQLFit(dge, design) # # fit model
heat_contrast <- makeContrasts( # extract results for heat treatment
heat_stress = `heat.stress` - none,
levels = design
)
qlf_heat <- glmQLFTest(fit, contrast = heat_contrast) # perform the statistical tests
res_heat <- topTags(qlf_heat, n = Inf)$table
salt_contrast <- makeContrasts( # extract results for salt treatment
salt_stress = `salt.stress` - none,
levels = design
)
qlf_salt <- glmQLFTest(fit, contrast = salt_contrast) # perform the statistical tests
res_salt <- topTags(qlf_salt, n = Inf)$table
salt_heat_contrast <- makeContrasts( # extract results for salt + heat treatment
heat_stress = `salt.and.heat.stress` - none,
levels = design
)
qlf_salt_heat <- glmQLFTest(fit, contrast = salt_heat_contrast)
res_salt_heat <- topTags(qlf_salt_heat, n = Inf)$table
res_list <- list( # combine in one list
heat = res_heat,
salt = res_salt,
salt_heat = res_salt_heat
)
sig_counts <- sapply(res_list, function(x) sum(x$FDR < 0.05))
sig_counts # significant counts per stressThis shows the number of genes that have significantly different expression comparing samples that did not experience stress with samples that did. The counts include both genes that have higher expression (upregulated) as well as genes that have lower expression (downregulated) due to the stress treatment. We can zoom into the upregulated genes and check the overlap between the different stress treatments using a Venn diagram.
if(!requireNamespace("VennDiagram", quietly = TRUE))
BiocManager::install(c("VennDiagram"))
library(VennDiagram)
library(grid)
# upregulated genes
deg_heat <- rownames(res_heat)[res_heat$FDR < 0.05 & res_heat$logFC >= 0]
deg_salt <- rownames(res_salt)[res_salt$FDR < 0.05 & res_salt$logFC >= 0]
deg_salt_heat <- rownames(res_salt_heat)[res_salt_heat$FDR < 0.05 & res_salt_heat$logFC >= 0]
draw.triple.venn(
area1 = length(deg_heat),
area2 = length(deg_salt),
area3 = length(deg_salt_heat),
n12 = length(intersect(deg_heat, deg_salt)),
n13 = length(intersect(deg_heat, deg_salt_heat)),
n23 = length(intersect(deg_salt, deg_salt_heat)),
n123 = length(Reduce(intersect, list(deg_heat, deg_salt, deg_salt_heat))),
category = c("Heat", "Salt", "Salt+Heat")
)Exercise 10.4 Look at figure 3A in the Suzuki et al paper.
How do the Venn diagrams from the paper compare to this one?
What would be a good next comparison to make?