8 Gene expression analysis: salt and heat stress in Arabidopsis

Author

Harm Nijveen

8.1 Introduction

Gene expression analyses can shed light on all kinds of biological processes and environmental responses, through identifying genes that are activated or silenced. In this session we will analyze RNA-seq gene expression data from Arabidopsis thaliana plants exposed to salt and heat stress and a combination of salt and heat stress.

The data originate from the study:
ABA Is Required for Plant Acclimation to a Combination of Salt and Heat Stress
https://doi.org/10.1371/journal.pone.0147625

We could start with the raw sequencing data, but that is beyond the scope of this course. Fortunately, the gene-level counts are already available through the EMBL-EBI Expression Atlas at https://www.ebi.ac.uk/gxa/experiments/E-GEOD-72806/

We will use this data set for:

exploratory RNA-seq analysis
clustering and principle component analysis
differential gene expression (DGE)

8.2 Installing and loading required packages

First we need to install and load some packages:

if (!requireNamespace("BiocManager", quietly = TRUE)) 
  install.packages("BiocManager")

if (!requireNamespace("ExpressionAtlas", quietly = TRUE)) 
  BiocManager::install("ExpressionAtlas")

if (!requireNamespace("SummarizedExperiment", quietly = TRUE)) 
  BiocManager::install("SummarizedExperiment")

if (!requireNamespace("edgeR", quietly = TRUE)) 
  BiocManager::install("edgeR")

if (!requireNamespace("pheatmap", quietly = TRUE)) 
  BiocManager::install("pheatmap")

if(!requireNamespace("HDF5Array", quietly = TRUE)) 
  BiocManager::install("HDF5Array")

if(!requireNamespace("GenomeInfoDb", quietly = TRUE)) 
  BiocManager::install("GenomeInfoDb")

library(ExpressionAtlas)
library(SummarizedExperiment)
library(tidyverse)

Installing these packages can take quite some time. If you encounter errors during installation, these can be challenging to solve. It is usually a good start to properly read the error and look for clues. If the error is a version conflict, for instance on the rlang or xfun packages, it can help to re-install these: install.packages("xfun") install.packages("rlang")

Clearing the workspace (under the Session menu in RStudio) and restarting the R session can help.

If you get asked whether you want to update certain packages, you can select a for all.

8.3 Loading RNA-seq expression data

we can download the gene expression data directly from the EMBL-EBI Expression Atlas website using the getaAtlasExperiment method in the ExpressionAtlas library:

exp <- getAtlasExperiment("E-GEOD-72806")

This returns a list (actually a SimpleList) of data structures, in this case only one, let’s extract that:

  se <- exp[[1]]

If the Expression Atlas site does not work, you can also download a copy from our website as a zip file, unzip it and load the data from there.

download.file(
  url = "https://www.bioinformatics.nl/courses/BIF-20806/E-GEOD-72806.zip",
  destfile = "./E-GEOD-72806.zip",
  mode = "wb")
unzip("E-GEOD-72806.zip")
library(HDF5Array)
se = loadHDF5SummarizedExperiment(dir="E-GEOD-72806")

The gene expression data are in this se object.

Exercise 8.1 To learn more about the resulting data in se use the class() function to see what kind we are dealing with.

To further explore the structure of se, use the str() function to get more information about the content. str stands for “structure of an R object”.

It is a RangedSummarizedExperiment from the package SummarizedExperiment. Try this to get more information about these:

?RangedSummarizedExperiment
?SummarizedExperiment

8.4 Storing the expression data in Hierarchical Data Format files

The Expression Atlas website can be unresponsive at times, so best to store the data structure locally. Because it is a complex data set, we cannot simply save it as a csv file. Instead, we can use the Hierarchical Data Format (HDF) file format. In the HDF5Array library there are specific functions to store and load SummarizedExperiment files:

library(HDF5Array)
saveHDF5SummarizedExperiment(se,dir="E-GEOD-72806")

Using the dir() function you can check that a directory was created called E-GEOD-72806.

From now on we can load the data as:

se = loadHDF5SummarizedExperiment(dir="E-GEOD-72806")

Exercise 8.2 Create a function called load_atlas_se in your rnaseq_functions.R script that returns a RangedSummarizedExperiment object either from disk or directly from the EMBL-EBI Expression Atlas. The function should take the accession as input and a Boolean (TRUE/FALSE) to determine whether to load the data from disk or from the website. The default value for the Boolean should be TRUE.

load_atlas_se <- function(accession,from_disk = TRUE) {
  ...
  return(se)
}

The different kinds of data in the se object are stored in different slots. To see which these are, you can use the slotNames() method on the se object. You can access these slots use the @ character. For instance, to see how the raw sequencing data were filtered you can look in the metadata slot, which returns a list, and then the filtering element like this:

se@metadata$filtering 
# or 
se@metadata[["filtering"]]

Notice that for accessing an element of a list you do not use an @ character, but instead a dollar sign or double square brackets.

However, accessing slots directly is like opening the bakery’s display case and taking a cake without going through the counter. It might work, but you are bypassing the process that keeps inventory, presentation, and handling consistent. The recommended way to access data in a complex data structure like this is via accessor functions:

metadata(se)$filtering

To find the proper accessor function, the help text for the class is usually the best place to look.

8.5 Exploring the count data

To get the actual expression data in the RangedSummarizedExperiment, we can use the assay() function:

counts <- assay(se) # genes x samples
counts

This returns a matrix with rows representing genes and columns representing samples. The class of counts is DelayedMatrix, which behaves like a regular numeric matrix, but computations are performed lazily. This means for instance that operations like log-transformation or scaling are not executed immediately; instead, they are applied only when the data are accessed. This allows efficient handling of large datasets that may reside on disk rather than in computer memory. To view the counts matrix in RStudio, try the following commands:

View(counts)
View(as.matrix(counts))

Let’s look at the dimensions of counts:

dim(counts)

There are more than 32000 genes and 12 samples with various treatments. The columns are the samples, to get information about the samples we can use the colData() function:

coldata <- colData(se) # sample level information
coldata

The samples have non-descriptive names like “SRR2302908”, but the environment_stress column tells us which stress the sample was subjected to. The other columns do not help to differentiate between the samples. We can clean up the coldata DataFrame to only keep the enviromental_stress column. You probably did not notice it, but coldata is a Bioconductor DataFrame object and not a base R data.frame. We can fix that in the same command:

coldata <- as.data.frame(coldata[, "environmental_stress", drop = FALSE])

In case you wonder about the drop = FALSE in coldata[, "environmental_stress", drop = FALSE]: By selecting only the “environmental_stress” column, we get a data.frame with only one column. Default R behavior is to ‘simplify’ this to a vector c(), which means we lose the row names which represent the sample labels. drop = FALSE prevents this default behavior, so the result is still a data.frame.

Now to get some idea of the gene expression values (the counts) for the RNA-seq experiment, we can first sum the counts per sample (so per column). We can use the apply() function for this, by allowing us to apply the sum() function to each column. DelayedArray objects also have a specific function called colSums() to do the same.

apply(counts,2,sum) # or colSums(counts)

apply (1)

The apply(X, MARGIN, FUN) function performs the function given by FUN on all rows or columns (indicated by the MARGIN value: 1 for rows and 2 for column) of the matrix X. So in the code above, we calculated the sum() per column of counts.

In the Suzuki et al. paper we can read that they generated on average 14 million sequencing reads per sample, the sums you see here are after filtering to remove bad quality reads. Do the sums match the number from the paper?

One sequencing read corresponds to one mRNA (fragment), so they can be used to determine the abundance of the different mRNAs and quantify the expression of each Arabidopsis gene.

To get an idea of the overall properties of the counts, we can plot their distribution. For this we look at the counts of all genes and samples together, so let’s flatten the counts matrix into one vector:

expr <- as.vector(counts)

And then plot this as a histogram with 100 bins:

df_expr <- data.frame(expr = expr) # ggplot needs a data.frame

library(ggplot2)

ggplot(df_expr, aes(x = expr)) +
  geom_histogram(bins = 100) +
  labs(title = "Distribution of raw gene counts",
  x = "Counts", y = "Frequency")

This is not very informative, is it? That is because there is very large difference in expression between genes. A common trick to bring values in the same range is to performa a log transformation of the counts. We can plot the log10 transformed counts with a simple addition:

ggplot(df_expr, aes(x = expr)) +
  geom_histogram(bins = 100) +
  scale_x_log10() +
  labs(title = "Distribution of raw gene counts (log10 x-axis)",
    x = "Counts", y = "Frequency")

Exercise 8.3 You may see a warning that the scale_x_log10() introduced infinite values, can you explain why and for which expression values this happened?

We can avoid the “infinite values” warning by using the the log1p method:

ggplot(df_expr, aes(x = log1p(expr))) +
  geom_histogram(bins = 100) +
  labs(title = "Distribution of log1p-transformed gene counts",
    x = "log1p(count)", y = "Frequency")

Exercise 8.4 What is the difference with the previous plot? In which two ways does log1p() differ from log10()?

For further analyses, we are not interested in genes that have little to no expression, so let’s filter these out:

keep <- rowSums(counts >= 10) >= 3
counts_filt <- as.matrix(counts[keep, ])

Let’s unpack what happens here: the counts >= 10 results in a matrix with Boolean/logical (TRUE or FALSE) values. rowSums() counts a TRUE as 1 and a FALSE as 0. The resulting keep variable is a vector of logical values, which is used to select rows (genes) of the counts matrix.

Exercise 8.5 How many genes are left? Describe which criteria genes have to meet to be kept.

The differences in expression level per gene can vary a lot. If we create a line plot of the (sorted) mean expression values per gene, we can see that a small proportion of genes are very highly expressed:

gene_means <- rowMeans(counts_filt)
gene_means_sorted <- sort(gene_means)

plot(
  gene_means_sorted,
  type = "l", #line
  xlab = "Genes (sorted by mean expression)",
  ylab = "Mean expression",
  main = "Sorted gene mean expression"
)

By now it should be clear that for plotting gene expression values a log transformation is advised to not let these highly expressed genes dominate the analyses

logcounts = log1p(counts_filt)

8.6 Top 10 most highly expressed genes

Now let’s look at the top 10 most highly expressed genes, which are actually the bottom 10 of the sorted list that we just plotted.

top_n <- 10
top_genes <- names(tail(gene_means_sorted,top_n))
mat_top <- logcounts[top_genes,] # select the rows by the row names of logcounts

To plot the expression of these genes in all samples we can use a fancy heatmap that also allows us to add annotation from the metadata of the samples.

library(pheatmap)

anno_col <- coldata  # sample annotations

pheatmap(
  mat_top,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  annotation_col = anno_col,
  show_colnames = FALSE,
  fontsize_row = 7,
  scale = "none",
  main = paste("Top", top_n, "most highly expressed genes")
)

Exercise 8.6 In this plot you can already discern a pattern, at least for the 2 or 3 most highly expressed genes. Still the plot can be improved in a number of ways:

the most highly expressed genes still take most of the attention, the scale argument can improve that by converting the values to z-scores per row or column. Set the scale argument to the appropriate direction (row or column) to see the relative expression over the different samples for the less highly expressed genes.
with clustering you can group samples and/or genes based on how similar their expression values are, more about that tomorrow in the clustering session. Try out what happens if you cluster genes and samples in this plot.

anno_col <- coldata

pheatmap(
  mat_top,
  cluster_rows = FALSE,
  cluster_cols = FALSE,
  annotation_col = anno_col,
  show_colnames = FALSE,
  fontsize_row = 7,
  scale = "none",
  main = paste("Top", top_n, "most highly expressed genes")
)

8.7 Normalization

So we have looked at the expression at the gene level, let us get back to the total number counts per sample and address the question do all samples have the same number of counts? The total number of counts per sample is also called library size:

libsize <- colSums(counts_filt)

barplot(
  libsize,
  las = 2,
  ylab = "Total counts",
  main = "Library size per sample"
)

The total counts are pretty similar but not identical. If we compare the sample with the most counts and the sample with the fewest counts, there is a 20% difference:

min(libsize)/max(libsize)

The total number of counts (reads) per sample is more a technical than a biological aspect of the data, you typically ask the DNA sequencing facility for a certain number of reads per sample. Deviation from this requested number is usually due to variation in sample quality and/or sequencing efficiency.

If we look at the correlation between the counts for the top 10 most highly expressed genes with the library size, some are very strongly correlated.

apply(mat_top, 1, cor, y = libsize, method = "pearson")

apply (2)

This use of apply() is a bit more complicated than before. That is because the cor() function requires more information than just the rows or columns of the given matrix. The additional arguments of apply() are passed on to cor(). So for every row (second argument is a 1) of mat_top, the function calculates the Pearson correlation with the libsize vector.

We just discovered something uncomfortable: different samples have quite different total numbers of reads. That means our analyses so far were biased. How do we deal with this? A simple and reasonably effective method is to just correct the counts for the total number of reads per sample. But just dividing each count by the total counts per sample would create very small numbers, so the result after dividing is commonly multiplied by a million to generate so-called Counts Per Million or CPM.

\[ CPM_{i} = \frac {counts_{i}}{\sum_{n=1}^{n}counts_n}\times10^6 \]

Summing all CPM normalized counts in a sample would then give a uniform total of $10^6$

In practice, this does not remove all bias, so more sophisticated normalization methods are used in RNA-seq analysis. To understand what the problem is with the simple method, think of the following scenario: a treatment causes a very strong increase in the expression of only the top 10 most highly expressed genes, but does not affect any other gene. The top 10 genes now take an even larger proportion of the total counts, which effectively lowers the relative counts for all the other genes making it seem like their expression decreased due to the treatment. The problem is that the simple scaling-based normalization is sensitive to a change in composition.

A normalization method that addresses this is called TMM, or “Trimmed Mean of M-values”. This specifically ignores genes that change a lot between different samples in calculating a scaling factor. It is good to realize that TMM only works well if the majority of genes do not change much in expression.

We can perform TMM normalization step-by-step, but we can also use an R library that was designed for RNA-seq analyses called edgeR.

library(edgeR)

dge <- DGEList(counts = counts_filt) # create a DGEList object
dge <- calcNormFactors(dge) # calculate the normalization factors and include these in the DGEList object

cpm <- cpm(dge, log = FALSE) # extract the CPM normalized counts from the DGEList object

barplot(
  colSums(cpm),
  las = 2,
  main = "Effective library sizes after TMM",
  ylab = "Normalized library size"
)

Did that help?

We will come back to edgeR on Wednesday when we will use it to find which genes are significantly up- or down-regulated in response to the stress treatments. These genes are called Differentially Expressed Genes or DEG, and the analysis is called Differential Gene Expression or DGE.

Exercise 8.7 Create a function called get_normalized_counts in your rnaseq_functions.R script that does the following:

extract the counts from a RangedSummarizedExperiment
removes lowly expressed genes
normalizes the counts using edgeR’s cpm() function
optionally log transforms the counts

It should take as input a RangedSummarizedExperiment object, a logical value log indicating whether the counts should be log2 transformed (default TRUE) and two numeric values that are used for filtering: min_count and min_samples. These should have default values 10 and 3. It should return the cpm values:

get_normalized_counts<- function(se, log = TRUE, min_count = 10, min_samples = 3) {
  ...
  return(cpm)
}

--- title: "Gene expression analysis: salt and heat stress in Arabidopsis" author: "Harm Nijveen" format: html: toc: true toc-depth: 3 code-fold: false code-tools: true link-external-newwindow: true link-external-icon: true editor: visual execute: eval: false warning: false message: false --- ## Introduction Gene expression analyses can shed light on all kinds of biological processes and environmental responses, through identifying genes that are activated or silenced. In this session we will analyze RNA-seq gene expression data from *Arabidopsis thaliana* plants exposed to salt and heat stress and a combination of salt and heat stress. The data originate from the study:\ **ABA Is Required for Plant Acclimation to a Combination of Salt and Heat Stress**\ <https://doi.org/10.1371/journal.pone.0147625> We could start with the [raw sequencing data](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72806), but that is beyond the scope of this course. Fortunately, the gene-level counts are already available through the **EMBL-EBI Expression Atlas** at <https://www.ebi.ac.uk/gxa/experiments/E-GEOD-72806/> We will use this data set for: - exploratory RNA-seq analysis - clustering and principle component analysis - differential gene expression (DGE) ------------------------------------------------------------------------ ## Installing and loading required packages First we need to install and load some packages: ```{r} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("ExpressionAtlas", quietly = TRUE)) BiocManager::install("ExpressionAtlas") if (!requireNamespace("SummarizedExperiment", quietly = TRUE)) BiocManager::install("SummarizedExperiment") if (!requireNamespace("edgeR", quietly = TRUE)) BiocManager::install("edgeR") if (!requireNamespace("pheatmap", quietly = TRUE)) BiocManager::install("pheatmap") if(!requireNamespace("HDF5Array", quietly = TRUE)) BiocManager::install("HDF5Array") if(!requireNamespace("GenomeInfoDb", quietly = TRUE)) BiocManager::install("GenomeInfoDb") library(ExpressionAtlas) library(SummarizedExperiment) library(tidyverse) ``` :::{.callout-note appearance='simple'} Installing these packages can take quite some time. If you encounter errors during installation, these can be challenging to solve. It is usually a good start to properly read the error and look for clues. If the error is a version conflict, for instance on the `rlang` or `xfun` packages, it can help to re-install these: `install.packages("xfun")` `install.packages("rlang")` Clearing the workspace (under the Session menu in RStudio) and restarting the R session can help. If you get asked whether you want to update certain packages, you can select **a** for all. ::: ## Loading RNA-seq expression data we can download the gene expression data directly from the EMBL-EBI Expression Atlas website using the getaAtlasExperiment method in the ExpressionAtlas library: ```{r} exp <- getAtlasExperiment("E-GEOD-72806") ``` This returns a list (actually a `SimpleList`) of data structures, in this case only one, let's extract that: ```{r} se <- exp[[1]] ``` :::{.callout-note appearance='simple'} If the Expression Atlas site does not work, you can also download a copy from our website as a zip file, unzip it and load the data from there. ``` {r} download.file( url = "https://www.bioinformatics.nl/courses/BIF-20806/E-GEOD-72806.zip", destfile = "./E-GEOD-72806.zip", mode = "wb") unzip("E-GEOD-72806.zip") library(HDF5Array) se = loadHDF5SummarizedExperiment(dir="E-GEOD-72806") ``` ::: The gene expression data are in this `se` object. ::: {#exr-class_structure} To learn more about the resulting data in `se` use the `class()` function to see what kind we are dealing with. ```{r} ``` To further explore the structure of `se`, use the `str()` function to get more information about the content. `str` stands for "**st**ructure of an **R** object". ```{r} ``` ::: It is a `RangedSummarizedExperiment` from the package [SummarizedExperiment](https://bioconductor.org/help/course-materials/2019/BSS2019/04_Practical_CoreApproachesInBioconductor.html). Try this to get more information about these: ```{r} ?RangedSummarizedExperiment ?SummarizedExperiment ``` ## Storing the expression data in Hierarchical Data Format files The Expression Atlas website can be unresponsive at times, so best to store the data structure locally. Because it is a complex data set, we cannot simply save it as a csv file. Instead, we can use the [Hierarchical Data Format (HDF)](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) file format. In the `HDF5Array` library there are specific functions to store and load `SummarizedExperiment` files: ```{r} library(HDF5Array) saveHDF5SummarizedExperiment(se,dir="E-GEOD-72806") ``` Using the `dir()` function you can check that a directory was created called **E-GEOD-72806**. From now on we can load the data as: ```{r} se = loadHDF5SummarizedExperiment(dir="E-GEOD-72806") ``` ::: {#exr-create_load_function} Create a function called `load_atlas_se` in your **rnaseq_functions.R** script that returns a `RangedSummarizedExperiment` object either from disk or directly from the EMBL-EBI Expression Atlas. The function should take the accession as input and a Boolean (TRUE/FALSE) to determine whether to load the data from disk or from the website. The default value for the Boolean should be TRUE. ```{r} load_atlas_se <- function(accession,from_disk = TRUE) { ... return(se) } ``` ::: The different kinds of data in the **se** object are stored in different slots. To see which these are, you can use the **slotNames()** method on the **se** object. You can access these slots use the **\@** character. For instance, to see how the raw sequencing data were filtered you can look in the **metadata** slot, which returns a list, and then the **filtering** element like this: ```{r} se@metadata$filtering # or se@metadata[["filtering"]] ``` Notice that for accessing an element of a list you do not use an **\@** character, but instead a dollar sign or double square brackets. However, accessing slots directly is like opening the bakery’s display case and taking a cake without going through the counter. It might work, but you are bypassing the process that keeps inventory, presentation, and handling consistent. The recommended way to access data in a complex data structure like this is via accessor functions: ```{r} metadata(se)$filtering ``` To find the proper accessor function, the help text for the class is usually the best place to look. ## Exploring the count data To get the actual expression data in the `RangedSummarizedExperiment`, we can use the `assay()` function: ```{r} counts <- assay(se) # genes x samples counts ``` This returns a matrix with rows representing genes and columns representing samples. The class of counts is `DelayedMatrix`, which behaves like a regular numeric matrix, but computations are performed lazily. This means for instance that operations like log-transformation or scaling are not executed immediately; instead, they are applied only when the data are accessed. This allows efficient handling of large datasets that may reside on disk rather than in computer memory. To view the counts matrix in RStudio, try the following commands: ```{r} View(counts) View(as.matrix(counts)) ``` Let's look at the dimensions of counts: ```{r} dim(counts) ``` There are more than 32000 genes and 12 samples with various treatments. The columns are the samples, to get information about the samples we can use the `colData()` function: ```{r} coldata <- colData(se) # sample level information coldata ``` The samples have non-descriptive names like "SRR2302908", but the **environment_stress** column tells us which stress the sample was subjected to. The other columns do not help to differentiate between the samples. We can clean up the coldata DataFrame to only keep the enviromental_stress column. You probably did not notice it, but coldata is a Bioconductor `DataFrame` object and not a base R `data.frame`. We can fix that in the same command: ```{r} coldata <- as.data.frame(coldata[, "environmental_stress", drop = FALSE]) ``` ::: {.callout-tip appearance="simple"} In case you wonder about the `drop = FALSE` in `coldata[, "environmental_stress", drop = FALSE]`: By selecting only the "environmental_stress" column, we get a `data.frame` with only one column. Default R behavior is to 'simplify' this to a vector `c()`, which means we lose the row names which represent the sample labels. `drop = FALSE` prevents this default behavior, so the result is still a `data.frame`. ::: Now to get some idea of the gene expression values (the counts) for the RNA-seq experiment, we can first sum the counts per sample (so per column). We can use the `apply()` function for this, by allowing us to apply the `sum()` function to each column. `DelayedArray` objects also have a specific function called `colSums()` to do the same. ```{r} apply(counts,2,sum) # or colSums(counts) ``` ::: {.callout-tip appearance="simple"} ### apply (1) The `apply(X, MARGIN, FUN)` function performs the function given by FUN on all rows or columns (indicated by the MARGIN value: 1 for rows and 2 for column) of the matrix X. So in the code above, we calculated the `sum()` per column of `counts`. ::: In the Suzuki et al. paper we can read that they generated on average 14 million sequencing reads per sample, the sums you see here are after filtering to remove bad quality reads. Do the sums match the number from the paper? One sequencing read corresponds to one mRNA (fragment), so they can be used to determine the abundance of the different mRNAs and quantify the expression of each Arabidopsis gene. To get an idea of the overall properties of the counts, we can plot their distribution. For this we look at the counts of all genes and samples together, so let's flatten the `counts` matrix into one vector: ```{r} expr <- as.vector(counts) ``` And then plot this as a histogram with 100 bins: ```{r} df_expr <- data.frame(expr = expr) # ggplot needs a data.frame library(ggplot2) ggplot(df_expr, aes(x = expr)) + geom_histogram(bins = 100) + labs(title = "Distribution of raw gene counts", x = "Counts", y = "Frequency") ``` This is not very informative, is it? That is because there is very large difference in expression between genes. A common trick to bring values in the same range is to performa a log transformation of the counts. We can plot the log10 transformed counts with a simple addition: ```{r} ggplot(df_expr, aes(x = expr)) + geom_histogram(bins = 100) + scale_x_log10() + labs(title = "Distribution of raw gene counts (log10 x-axis)", x = "Counts", y = "Frequency") ``` ::: {#exr-log_warnings} You may see a warning that the `scale_x_log10()` introduced infinite values, can you explain why and for which expression values this happened? ::: We can avoid the "infinite values" warning by using the the `log1p` method: ```{r} ggplot(df_expr, aes(x = log1p(expr))) + geom_histogram(bins = 100) + labs(title = "Distribution of log1p-transformed gene counts", x = "log1p(count)", y = "Frequency") ``` ::: {#exr-log1p} What is the difference with the previous plot? In which two ways does `log1p()` differ from `log10()`? ::: For further analyses, we are not interested in genes that have little to no expression, so let's filter these out: ```{r} keep <- rowSums(counts >= 10) >= 3 counts_filt <- as.matrix(counts[keep, ]) ``` Let's unpack what happens here: the `counts >= 10` results in a matrix with Boolean/logical (TRUE or FALSE) values. `rowSums()` counts a TRUE as 1 and a FALSE as 0. The resulting `keep` variable is a vector of logical values, which is used to select rows (genes) of the counts matrix. ::: {#exr-filter} How many genes are left? Describe which criteria genes have to meet to be kept. ::: The differences in expression level per gene can vary a lot. If we create a line plot of the (sorted) mean expression values per gene, we can see that a small proportion of genes are very highly expressed: ```{r} gene_means <- rowMeans(counts_filt) gene_means_sorted <- sort(gene_means) plot( gene_means_sorted, type = "l", #line xlab = "Genes (sorted by mean expression)", ylab = "Mean expression", main = "Sorted gene mean expression" ) ``` By now it should be clear that for plotting gene expression values a log transformation is advised to not let these highly expressed genes dominate the analyses ```{r} logcounts = log1p(counts_filt) ``` ## Top 10 most highly expressed genes Now let's look at the top 10 most highly expressed genes, which are actually the bottom 10 of the sorted list that we just plotted. ```{r} top_n <- 10 top_genes <- names(tail(gene_means_sorted,top_n)) mat_top <- logcounts[top_genes,] # select the rows by the row names of logcounts ``` To plot the expression of these genes in all samples we can use a fancy heatmap that also allows us to add annotation from the metadata of the samples. ```{r} library(pheatmap) anno_col <- coldata # sample annotations pheatmap( mat_top, cluster_rows = FALSE, cluster_cols = FALSE, annotation_col = anno_col, show_colnames = FALSE, fontsize_row = 7, scale = "none", main = paste("Top", top_n, "most highly expressed genes") ) ``` ::: {#exr-improve_pheatmap} In this plot you can already discern a pattern, at least for the 2 or 3 most highly expressed genes. Still the plot can be improved in a number of ways: - the most highly expressed genes still take most of the attention, the *scale* argument can improve that by converting the values to z-scores per row or column. Set the scale argument to the appropriate direction (row or column) to see the relative expression over the different samples for the less highly expressed genes. - with clustering you can group samples and/or genes based on how similar their expression values are, more about that tomorrow in the clustering session. Try out what happens if you cluster genes and samples in this plot. ```{r} anno_col <- coldata pheatmap( mat_top, cluster_rows = FALSE, cluster_cols = FALSE, annotation_col = anno_col, show_colnames = FALSE, fontsize_row = 7, scale = "none", main = paste("Top", top_n, "most highly expressed genes") ) ``` ::: ## Normalization So we have looked at the expression at the gene level, let us get back to the total number counts per sample and address the question do all samples have the same number of counts? The total number of counts per sample is also called *library size*: ```{r} libsize <- colSums(counts_filt) barplot( libsize, las = 2, ylab = "Total counts", main = "Library size per sample" ) ``` The total counts are pretty similar but not identical. If we compare the sample with the most counts and the sample with the fewest counts, there is a 20% difference: ```{r} min(libsize)/max(libsize) ``` The total number of counts (reads) per sample is more a technical than a biological aspect of the data, you typically ask the DNA sequencing facility for a certain number of reads per sample. Deviation from this requested number is usually due to variation in sample quality and/or sequencing efficiency. If we look at the correlation between the counts for the top 10 most highly expressed genes with the library size, some are very strongly correlated. ```{r} apply(mat_top, 1, cor, y = libsize, method = "pearson") ``` ::: {.callout-tip appearance="simple"} ### apply (2) This use of `apply()` is a bit more complicated than before. That is because the `cor()` function requires more information than just the rows or columns of the given matrix. The additional arguments of `apply()` are passed on to `cor()`. So for every row (second argument is a `1`) of `mat_top`, the function calculates the Pearson correlation with the `libsize` vector. ::: We just discovered something uncomfortable: different samples have quite different total numbers of reads. That means our analyses so far were biased. How do we deal with this? A simple and reasonably effective method is to just correct the counts for the total number of reads per sample. But just dividing each count by the total counts per sample would create very small numbers, so the result after dividing is commonly multiplied by a million to generate so-called **Counts Per Million** or CPM. $$ CPM_{i} = \frac {counts_{i}}{\sum_{n=1}^{n}counts_n}\times10^6 $$ Summing all CPM normalized counts in a sample would then give a uniform total of $10^6$ In practice, this does not remove all bias, so more sophisticated normalization methods are used in RNA-seq analysis. To understand what the problem is with the simple method, think of the following scenario: a treatment causes a very strong increase in the expression of only the top 10 most highly expressed genes, but does not affect any other gene. The top 10 genes now take an even larger proportion of the total counts, which effectively lowers the relative counts for all the other genes making it seem like their expression decreased due to the treatment. The problem is that the simple scaling-based normalization is sensitive to a change in composition. A normalization method that addresses this is called TMM, or "Trimmed Mean of M-values". This specifically ignores genes that change a lot between different samples in calculating a scaling factor. It is good to realize that TMM only works well if the majority of genes do not change much in expression. We can perform TMM normalization step-by-step, but we can also use an R library that was designed for RNA-seq analyses called **edgeR**. ```{r} library(edgeR) dge <- DGEList(counts = counts_filt) # create a DGEList object dge <- calcNormFactors(dge) # calculate the normalization factors and include these in the DGEList object cpm <- cpm(dge, log = FALSE) # extract the CPM normalized counts from the DGEList object barplot( colSums(cpm), las = 2, main = "Effective library sizes after TMM", ylab = "Normalized library size" ) ``` Did that help? We will come back to `edgeR` on Wednesday when we will use it to find which genes are significantly up- or down-regulated in response to the stress treatments. These genes are called Differentially Expressed Genes or **DEG**, and the analysis is called Differential Gene Expression or **DGE**. ::: {#exr-create_function_get_normalized_counts} Create a function called **get_normalized_counts** in your **rnaseq_functions.R** script that does the following: - extract the counts from a `RangedSummarizedExperiment` - removes lowly expressed genes - normalizes the counts using edgeR's `cpm()` function - optionally log transforms the counts It should take as input a **RangedSummarizedExperiment** object, a logical value **log** indicating whether the counts should be log2 transformed (default TRUE) and two numeric values that are used for filtering: **min_count** and **min_samples**. These should have default values 10 and 3. It should return the cpm values: ```{r} get_normalized_counts<- function(se, log = TRUE, min_count = 10, min_samples = 3) { ... return(cpm) } ``` :::