2 From messy to clean

Author

Mark Sterken

Published

February 5, 2026

2.1 From messy to clean

Introduction to day 2

Today we will continue with goal 1 (Built basic knowledge on using R and R studio) and start with goal 2 (Be able to recognize and load common data formats) and goal 3 (Apply common statistical tools for inspecting and analyzing data)

We will make use of a fragmented simulated dataset. You need to puzzle it together using a couple of functions. Therafter you can analyze it further.

Potato cyst nematodes, virulent Globodera pallida within North-West Europe

One of the current projects that is running at the Laboratory of Nematology, is an investigation into the virulence of the potato cyst nematode Globodera pallida. Although plant susceptibility is a complex traits, there are ‘Mendelian nuggets’ to be found that control these parasites. These ‘Mendelian nuggets’ are loci (e.g. one or a couple of genes) that have a strong effect on a trait. In this case, a particular resistance gene that was bred into potato and which strongly controls the reproduction of G. pallida.

A PhD in our group, Arno Schaveling (who studied Plant Science at WUR) found out that the Resistance gene GpaV_vrn was broadly introduced into potato. This led to massive, parallel selection for virulence of G. pallida in the field¹. Currently resulting in a virulence outbreak in North-West Europe².

In particular, you can select for virulence in G. pallida over subsequent generations in one potato cultivar. Thereafter the ability of the resulting populations to reproduce in many potato cultivars is enhanced. This result meant that the resistance base in potato was relatively narrow.

Figure 2.1: Selection on the GpaV_vrn-containing potato variety Seresta also increases virulence on other commercial potato varieties. (a) The selection experiment that was carried out with four generations of selection on Seresta. The colours indicate the potato variety used (Red Desiree, Orange Seresta), and the circles with the numbers indicate which propagation was used in the standard potato cyst nematode (PCN) resistance test. The numbers indicate the generation of selection. Due to the amount of material required for the standard PCN resistance test, Generation 0 and Generation 1 could not be derived from the selection line Generations 2–4 were on. (b) The outcome of the standard PCN resistance test for selected Generations 1–4. Each dot represents the reproduction (P_f/P_i) measured in one 2-l pot inoculated with 10000 larvae. The lines are linear regressions added as a visual aid. Colours indicate varieties, where Seresta and Innovator belong to Cl_SER and Libero, Avarna and Festien belong to Cl_FES. The R₂ is the amount of variance explained by generation as derived from a PERMANOVA.

Next, we were able to identify a region on the G. pallida genome that was selected, ultimately identifying the gene Gp-pat-1. This gene is currently used as the target for a molecular assay to detect virulence in agricultural fields.

Figure 2.2: Single genetic locus is selected after repeated selection on Seresta. (a) Overview of the AMPOP02 (dots) and AMPOP10 (triangles) samples generated in the selection experiment and how many were successfully sequenced. As the larvae from which we isolated DNA underwent either a selection on Desiree (red) or Seresta (orange), we took the final number of generations selected on Seresta along in the analysis. (b) The number of variants that significantly correlate with virulence per scaffold plotted against the scaffold size in million bases (Mb). Dots indicate the variants found in AMPOP02, and triangles indicate variants found in AMPOP10. Text indicates which scaffold of the Rookmaker genome the datapoint belongs to. (c) The significance of the variants on scaffold 28 associated with generation in AMPOP10. The variants are plotted on their location on scaffold 28 (in Mb) vs the significance in −log₁₀(P). The variants where the frequency of the alternative allele is decreasing over generations are coloured red, while those where the alternative allele is increasing are coloured blue. Grey variants are not significant. The dashed horizontal line indicates the Bonferroni-corrected threshold. The dashed vertical lines indicate the candidate locus. Note that the y-axis has been cut off at −log₁₀(P) < 2.

Set up your R environment

For the basics of setting up your environment, you can check Chapter 1. The Markdown file to use for the exercises and documentation of your code is here.

Loading data into R

In this tutorial, we will start with loading different types of data files into R. This is an essential skill for working with real-world data. We’ll cover a couple of file-types that are common. Each of you has their data in these formats, you can find these via the fileshare on Brightspace. If you are familiar with R, this part might be easy.

Excel files (.xlsx)
Comma-separated values (.csv)
Semicolon-separated CSV files (.csv)
Tab-delimited text files (.txt or .tsv) (you encountered this filetype yesterday)

To make this part of the course realistic, we have provided you with data in these four formats. The reason being that this is also how it works in practice. Finding out the format of the data and loading it into memory is a crucial step in using any data-analysis pipeline.

You now need to load in data from all four data formats.

Exercise 2.1 (Loading data from an excel file (.xlsx)) To read Excel files, we use the read_excel() function from the readxl package. Excel files are very common but also very tricky. Namely, these files are mostly made by persons and people do not automatically follow conventions that are clear for a programming language. Therefore, always check the excel file before loading it in.

    ### Package
    install.packages("readxl")    # For reading Excel files
    library(readxl) # first activate the package

    # Replace with the path to your Excel file
    data_excel <- readxl::read_excel("data/YOURFILENAME.xlsx")
    
    ### Check after loading
    head(data_excel)

Exercise 2.2 (Loading data from a CSV File (Comma-separated)) CSV files are very common as format for data. Here each column is separated by a comma, rows are typically observations. We use the read.csv() function to load the data.

    # Replace with the path to your csv file
    data_csv <- read.csv("data/YOURFILENAME.csv")

    ### Check after loading
    head(data_csv)

Exercise 2.3 (Loading data from a CSV file (semicolon-separated)) Some CSV files (especially from European systems) use a semicolon (;) instead of a comma. Use read.csv2() for these files

    # Replace with the path to your csv file
    data_csv2 <- read.csv2("data/YOURFILENAME.csv")

    ### Check after loading
    head(data_csv2)

Exercise 2.4 (Loading a Tab-delimited File) These files use tabs to separate values. They often have a .txt or .tsv extension.

    # Replace with the path to your csv file
    data_tsv <- read.delim("data/YOURFILENAME.txt")

    ### Check after loading
    head(data_tsv)

2.2 Analysis of (a)virulence of Globodera pallida

This set of instructions follows the data analysis cycle, so it also includes setting up the R environment as well as loading in the data. If all is well, you already prepared this when going through the code above.

Setup R

Before we start, make sure you have set the work directory and have the following packages ready: {tidyverse} and {readxl}. Of course you also need to activate them. If you’re not sure how, check Chapter 1.

    ### type the code to install the packages

    ### type the code to activate the packages

Load data

Load your data for the tutorial into R, see the exercise above.

Inspecting data part 1

Using functions

Last time we used head(), tail(), and summarise(). Again apply these functions. But now we also want to check which piece is which. For that the functions length, ncol(), nrow(), and dim() are handy.

    # Check the first 6 rows
    head(data_csv)

    # Check the last 6 rows
    tail(data_csv)
    
    # get a summary of the data
    summary(data_csv)
 
    # get the length of a vector
    length(data_csv)
    
    # get the number of columns
    ncol(data_csv)
    
    # get the number of rows
    nrow(data_csv)
    
    # get the dimensions of an object
    dim(data_csv)

Exercise 2.5 (Which file is which puzzle piece?) Inspect the four files you have using dim(), nrow(), and ncol(). Which piece is which?

    ###Type the code to check these three properties of each file

The puzzle pieces and the functions you will need to combine them further on in the excercise.

Answer the following questions (note this is random for each personal dataset!). You can deduce which piece is which via their dimensions.

Which 2 files are piece 1 and 2?
which file is piece 3?
Which file is piece 4?

Cleaning and preparing the data

Now you have the dimensions of the puzzle pieces. You can combine them in only one way.

Exercise 2.6 (Combine the data) First you need to join the two pieces (1 and 2) with the equal number of rows using cbind(). Second, you need to add the piece (3) that now has the same number of columns using rbind(). Finally, you need to add the last piece (4) of data using merge().

Complete the code below

    ### bind objects together over columns
    pheno_data <- cbind(YOURPIECE1,YOURPIECE2)
    
    ### binds objects together over rows
    pheno_data <- rbind(pheno_data,YOURPIECE3)
    
    ### merges objects, seeks for the same identitier, or you can provide it using by.x and by.y
    pheno_data <- merge(pheno_data,YOURPIECE4)

Inspecting data using histograms

After combining, it is smart to inspect the data again to check if everything went well. You can use the same functions as before or, for instance, use geom_histogram().

    ggplot2::ggplot(pheno_data,aes(x=Gpal_vir)) +
    geom_histogram()

    ggplot2::ggplot(pheno_data,aes(x=Gpal_avir)) +
    geom_histogram()

Analyzing

You now have the data complete. As you can see we have two measurements linked to one plant ID. The reason is that this data represents various inbred lines of potato plants tested for both virulent and avirulent G. pallida populations.

Normal distribution

Same as before, let’s figure out whether the data is normally distributed. For this we can use the functions stat_qq() and stat_qq_line(). Find the code to do this in Chapter 1 and check the data in the Gpal_vir and Gpal_avir columns.

    ### Add the code to make the qqplot for Gpal_vir

    ### Add the code to make the qqplot for Gpal_avir

Of course, you can also use a Shapiro-Wilk test (shapiro.test), this calculates the likelyhood whether the data follows a normal distribution.

Code

    shapiro.test(pheno_data$Gpal_vir)

    shapiro.test(pheno_data$Gpal_avir)

Exercise 2.7 (Does the data follow a normal distribution?) You now conducted two types of analyses that tell you whether the data is normally distributed. (1) what do you see in the qqplots, and what does it mean? (2) what are the p-values from the Shapiro-Wilk test? (3) what does a p-value mean?

Wilcoxon test

As the data is not exactly normally distributed, let’s deal with that in the proper way. Now we can use a Wilcoxon Rank Sum test to compare between two samples. We first need to transform the dataset, as we do not have a single observation per row currently. For this you can use the gather() function from the {tidyr} package. Use functions like head() or tail() to check what happened.

    data.test <- tidyr::gather(pheno_data,key="pallida",value="juveniles",-ID,-Experiment)

Now we can test how the potato plants performed against virulent versus avirulent G. pallida. Using wilcox.test()

    wilcox.test(juveniles~pallida,data=data.test)
    ###you should have a very low p-value

Exercise 2.8 (How to interpret a p-value?) You now have tested a null-hypothesis using a non-parametric test and found that the chance his data agrees with this null-hypothesis is very small. What is the biological interpretation of this p-value?

Presenting

Boxplot

Can you now also make a plot of what you just tested? Use geom_boxplot() to visualize the distribution and geom_jitter() to show the individual measurements.

    ggplot(data.test,aes(x=pallida,y=juveniles)) +
    geom_boxplot() + geom_jitter(height=0,width=0.25)

Exercise 2.9 (Boxplot with facet_grid()) There is another variable in the data as well: ‘Experiment’. Use facet_grid() to split this out. Figure out how using the ggplot2 website.

    ggplot(data.test,aes(x=pallida,y=juveniles)) +
    geom_boxplot() + geom_jitter(height=0,width=0.25) +
    facet_grid() ### add something in-between the brackets.

Fancier boxplot

Now we can add p-value as well using annotate().

    ###I'm getting the statistics here
    tmp <- tapply(data.test[,c("pallida","juveniles")],data.test$Experiment,function(x){
                                broom::tidy(wilcox.test(x$juveniles~x$pallida))})
   
    ### I need to include the facets
    statplot <- do.call(rbind,tmp) %>%
                mutate(Experiment=1:3,pallida=1.5,juveniles=max(data.test$juveniles),
                       label=paste("p =",signif(p.value,2)))

    ###save the plot into an object
    p1 <- ggplot(data.test,aes(x=pallida,y=juveniles)) +
          geom_boxplot() + geom_jitter(height=0,width=0.25) +
          facet_grid(~Experiment) + geom_text(aes(label=label),data=statplot)
    
    ###plot the plot
    p1

Exercise 2.10 (Save plot) Now save the plot, check back at Chapter 1. Use a graphical device to save the plot with dimensions of 7 (width) by 4 (height) inch.

    ### use a graphical device (e.g. pdf())
        print(p1) ### we must use print, as we have a ggplot2 object
    ### close the graphical device

Inspecting data part 2

Correlation analysis

Correlation analysis is useful to discover patterns in large sets of data or between multiple measurements on the same sample (for instance different measurements on the same plant).

As the data is not exactly normally distributed, we also have to deal with that when calculating correlation. For normal (distribution) correlation, you should use the Pearson method. This is the standard setting for cor(). For data that is not normally distributed, the Spearman method should be used. To do this, you need to specify a parameter within a fuction. If you check the function cor() using the ?, you see that you can specify method; cor(x,y,method="Spearman").

    data.test <- pheno_data

    ### Pearson correlation
    cor(data.test$Gpal_vir,data.test$Gpal_avir)
    
    ### Spearman correlation
    cor(data.test$Gpal_vir,data.test$Gpal_avir,method = "spearman")
    
    ### Test of correlation
    ### it is possible you get a warning here; this is not an error.
    cor.test(data.test$Gpal_vir,data.test$Gpal_avir,method = "spearman")

Exercise 2.11 (Correlation analysis interpretation) What do you conclude from the correlation analysis, are the two values correlated? What is the biological interpretation of this result?

Clustering analysis

Clustering is another method to explore data. Instead of patterns that move in the same direction, clustering is useful to apply if you expect consistent value differences in values.

For clustering we use the functions dist() and hclust(). The first function calculates the distance matrix, there are multiple methods to calculate the distance-matrix, here we use the standard method (“euclidean”). The second performs the clustering. This process groups the measurements that are most similar.

Because the dataset is too large to perform clustering on all the measurements, we summarise the data by taking the mean per experiment for the virulent and avirulent populations. To achieve this, we group the data per experiment using dplyr::group_by(), then we summarise using dplyr::summarise(). Then we need to transform the data to long format, for this we use tidyr::gather(). To see what happens, the code below includes a before and after inspection of the dataframe.

    ### To make the distance matrix, we first need to summarise the data
    ### otherwise the matrix becomes too large
    data.test <- dplyr::group_by(pheno_data,Experiment) %>%
                 dplyr::summarise(Gpal_vir=mean(Gpal_vir),Gpal_avir=mean(Gpal_avir))
   
    ###check the shape of the data before gather
    head(data.test)
    
    data.test <- tidyr::gather(data.test,key=population,value=reproduction,-Experiment) 

    ###and after gather
    head(data.test)
    
    
    ### Here we calculate the distance matrix
    dist_mat <- dist(data.test$reproduction,diag=TRUE,upper=TRUE)
    
    ### to give understandable names to the distance matrix we re-write it as matrix
    dist_mat <- as.matrix(dist_mat)
        ### these two lines replace the column and row names in the matrix
        colnames(dist_mat) <- paste(data.test$population,data.test$Experiment)
        rownames(dist_mat) <- paste(data.test$population,data.test$Experiment)
    ### to conduct clustering the matrix is re-written als distance matrix
    dist_mat <- as.dist(dist_mat)
    
    ### here we use the base plot-function to look at the clustering
    plot(hclust(dist_mat))

Exercise 2.12 (Clustering analysis) What do you conclude from the clustering analysis, are there differences between the two G. pallida populations? What is the biological interpretation of this result?

heatmap

A handy visualization tool for correlation analysis as well as clustering analysis, is a heat map. There is a basic function in R, heatmap() that works with matrices. When you run the code below, you get a visualization of the distance matrix. The same can be done with a correlation matrix, but we do not use that today.

    ### To make the distance matrix, we first need to summarise the data
    ### otherwise the matrix becomes too large
    data.test <- dplyr::group_by(pheno_data,Experiment) %>%
                 dplyr::summarise(Gpal_vir=mean(Gpal_vir),Gpal_avir=mean(Gpal_avir)) %>%
                 tidyr::gather(key=population,value=reproduction,-Experiment) 
    
    ### Here we calculate the distance matrix
    dist_mat <- dist(data.test$reproduction,diag=TRUE,upper=TRUE)
    
    ### to give understandable names to the distance matrix we re-write it as matrix
    dist_mat <- as.matrix(dist_mat)
        ### these two lines replace the column and row names in the matrix
        colnames(dist_mat) <- paste(data.test$population,data.test$Experiment)
        rownames(dist_mat) <- paste(data.test$population,data.test$Experiment)
        
    heatmap(dist_mat)

Exercise 2.13 (Plot of correlation) Correlation can best be visualized with a scatter plot. complete the code below to generate a scatterplot of your G. pallida data. Add the correct column name for the y-axis and the right geom to plot all data points.

Code

    data.plot <- pheno_data

    ggplot(data.plot, aes(x = Gpal_vir, y = AAA, colour=Experiment)) + ###replace the AAA with the correct column name
    geom_ +  ### add the geom to make a scatter plot
    scale_color_gradient(low="black",high="grey") + theme_bw()

Some reading

In part, this assignment was inspired by a paper in the journal Genome Biology, by Itai Yanai and Martin Lercher³. Read this paper as food for thought about what we covered today in p-value testing and its role in data analysis. This editorial paper is part of an interesting series of editorials about data science in biology: Night science⁴.

--- title: "From messy to clean" author: "Mark Sterken" date: "2026-02-05" format: html: toc: true toc-depth: 3 code-fold: true code-tools: true link-external-newwindow: true link-external-icon: true editor: source execute: eval: false warning: false message: false --- ## From messy to clean ### Introduction to day 2 Today we will continue with goal 1 (Built basic knowledge on using R and R studio) and start with goal 2 (Be able to recognize and load common data formats) and goal 3 (Apply common statistical tools for inspecting and analyzing data) We will make use of a fragmented simulated dataset. You need to puzzle it together using a couple of functions. Therafter you can analyze it further. ::: {.callout-tip appearance="simple"} ### Potato cyst nematodes, virulent *Globodera pallida* within North-West Europe One of the current projects that is running at [the Laboratory of Nematology](https://www.wur.nl/en/chair-groups/laboratory-of-nematology), is an investigation into the virulence of the potato cyst nematode *Globodera pallida*. Although plant susceptibility is a complex traits, there are 'Mendelian nuggets' to be found that control these parasites. These 'Mendelian nuggets' are loci (e.g. one or a couple of genes) that have a strong effect on a trait. In this case, a particular resistance gene that was bred into potato and which strongly controls the reproduction of *G. pallida*. A PhD in our group, Arno Schaveling (who studied Plant Science at WUR) found out that the Resistance gene GpaV~vrn~ was broadly introduced into potato. This led to massive, parallel selection for virulence of *G. pallida* in the field[@Schaveling2026]. Currently resulting in a virulence outbreak in North-West Europe[@Schaveling2025]. In particular, you can select for virulence in *G. pallida* over subsequent generations in one potato cultivar. Thereafter the ability of the resulting populations to reproduce in many potato cultivars is enhanced. This result meant that the resistance base in potato was relatively narrow. ![Selection on the GpaV~vrn~-containing potato variety Seresta also increases virulence on other commercial potato varieties. (a) The selection experiment that was carried out with four generations of selection on Seresta. The colours indicate the potato variety used (Red Desiree, Orange Seresta), and the circles with the numbers indicate which propagation was used in the standard potato cyst nematode (PCN) resistance test. The numbers indicate the generation of selection. Due to the amount of material required for the standard PCN resistance test, Generation 0 and Generation 1 could not be derived from the selection line Generations 2--4 were on. (b) The outcome of the standard PCN resistance test for selected Generations 1--4. Each dot represents the reproduction (P~f~/P~i~) measured in one 2-l pot inoculated with 10000 larvae. The lines are linear regressions added as a visual aid. Colours indicate varieties, where Seresta and Innovator belong to Cl~SER~ and Libero, Avarna and Festien belong to Cl~FES~. The R~2~ is the amount of variance explained by generation as derived from a PERMANOVA.](Week1_day2_Schaveling1.jpeg){#fig-schaveling1} Next, we were able to identify a region on the *G. pallida* genome that was selected, ultimately identifying the gene *Gp-pat-1*. This gene is currently used as the target for a molecular assay to detect virulence in agricultural fields. ![Single genetic locus is selected after repeated selection on Seresta. (a) Overview of the AMPOP02 (dots) and AMPOP10 (triangles) samples generated in the selection experiment and how many were successfully sequenced. As the larvae from which we isolated DNA underwent either a selection on Desiree (red) or Seresta (orange), we took the final number of generations selected on Seresta along in the analysis. (b) The number of variants that significantly correlate with virulence per scaffold plotted against the scaffold size in million bases (Mb). Dots indicate the variants found in AMPOP02, and triangles indicate variants found in AMPOP10. Text indicates which scaffold of the Rookmaker genome the datapoint belongs to. (c) The significance of the variants on scaffold 28 associated with generation in AMPOP10. The variants are plotted on their location on scaffold 28 (in Mb) vs the significance in −log~10~(P). The variants where the frequency of the alternative allele is decreasing over generations are coloured red, while those where the alternative allele is increasing are coloured blue. Grey variants are not significant. The dashed horizontal line indicates the Bonferroni-corrected threshold. The dashed vertical lines indicate the candidate locus. Note that the y-axis has been cut off at −log~10~(P) \< 2.](Week1_day2_Schaveling2.jpeg){#fig-schaveling2} ::: ### Set up your R environment For the basics of setting up your environment, you can check [Chapter 1](week1/day1.qmd). The Markdown file to use for the exercises and documentation of your code is [here](day2_student_notebook.qmd). ### Loading data into R In this tutorial, we will start with loading different types of data files into R. This is an essential skill for working with real-world data. We'll cover a couple of file-types that are common. *Each of you has their data in these formats*, you can find these via the fileshare on [Brightspace](https://brightspace.wur.nl/). If you are familiar with R, this part might be easy. - Excel files (`.xlsx`) - Comma-separated values (`.csv`) - Semicolon-separated CSV files (`.csv`) - Tab-delimited text files (`.txt` or `.tsv`) (you encountered this filetype yesterday) To make this part of the course realistic, *we have provided you with data in these four formats*. The reason being that this is also how it works in practice. Finding out the format of the data and loading it into memory is a crucial step in using any data-analysis pipeline. You now need to load in data from all four data formats. ::: {#exr-excel} #### Loading data from an excel file (.xlsx) To read Excel files, we use the `read_excel()` function from the readxl package. Excel files are very common but also very tricky. Namely, these files are mostly made by persons and people *do not automatically follow conventions* that are clear for a programming language. Therefore, always check the excel file before loading it in. ```{r w1d2_load_xlsx, include = TRUE,eval=FALSE, class.source = 'fold-show'} #| code-fold: false ### Package install.packages("readxl") # For reading Excel files library(readxl) # first activate the package # Replace with the path to your Excel file data_excel <- readxl::read_excel("data/YOURFILENAME.xlsx") ### Check after loading head(data_excel) ``` ::: ::: {#exr-csv} #### Loading data from a CSV File (Comma-separated) CSV files are very common as format for data. Here each column is separated by a comma, rows are typically observations. We use the `read.csv()` function to load the data. ```{r w1d2_load_csv, include = TRUE,eval=FALSE, class.source = 'fold-show'} #| code-fold: false # Replace with the path to your csv file data_csv <- read.csv("data/YOURFILENAME.csv") ### Check after loading head(data_csv) ``` ::: ::: {#exr-csv2} #### Loading data from a CSV file (semicolon-separated) Some CSV files (especially from European systems) use a semicolon (;) instead of a comma. Use `read.csv2()` for these files ```{r w1d2_load_csv2, include = TRUE,eval=FALSE, class.source = 'fold-show'} #| code-fold: false # Replace with the path to your csv file data_csv2 <- read.csv2("data/YOURFILENAME.csv") ### Check after loading head(data_csv2) ``` ::: ::: {#exr-tab} #### Loading a Tab-delimited File These files use tabs to separate values. They often have a .txt or .tsv extension. ```{r w1d2_load_tabdelim, include = TRUE,eval=FALSE, class.source = 'fold-show'} #| code-fold: false # Replace with the path to your csv file data_tsv <- read.delim("data/YOURFILENAME.txt") ### Check after loading head(data_tsv) ``` ::: ## Analysis of (a)virulence of *Globodera pallida* This set of instructions follows the data analysis cycle, so it also includes setting up the R environment as well as loading in the data. If all is well, you already prepared this when going through the code above. ### Setup R Before we start, make sure you have set the work directory and have the following packages ready: {tidyverse} and {readxl}. Of course you also need to activate them. If you're not sure how, check [Chapter 1](/week1/day1.qmd). ```{r w1d2_packages_readxl} #| code-fold: false ### type the code to install the packages ### type the code to activate the packages ``` ### Load data Load your data for the tutorial into R, see the exercise above. ### Inspecting data part 1 #### Using functions Last time we used `head()`, `tail()`, and `summarise()`. Again apply these functions. But now we also want to check which piece is which. For that the functions `length`, `ncol()`, `nrow()`, and `dim()` are handy. ```{r w1d2_check_the_data, include = TRUE,eval=FALSE} #| code-fold: false # Check the first 6 rows head(data_csv) # Check the last 6 rows tail(data_csv) # get a summary of the data summary(data_csv) # get the length of a vector length(data_csv) # get the number of columns ncol(data_csv) # get the number of rows nrow(data_csv) # get the dimensions of an object dim(data_csv) ``` ::: {#exr-check} #### Which file is which puzzle piece? Inspect the four files you have using `dim()`, `nrow()`, and `ncol()`. Which piece is which? ```{r} #| code-fold: false ###Type the code to check these three properties of each file ``` ![The puzzle pieces and the functions you will need to combine them further on in the excercise.](Week1_day2_Puzzle.png) Answer the following questions (note this is *random* for each personal dataset!). You can deduce which piece is which via their dimensions. 1. Which 2 files are piece 1 and 2? 2. which file is piece 3? 3. Which file is piece 4? ::: ### Cleaning and preparing the data Now you have the dimensions of the puzzle pieces. You can combine them in only one way. ::: {#exr-combine} #### Combine the data First you need to join the two pieces (1 and 2) with the equal number of rows using `cbind()`. Second, you need to add the piece (3) that now has the same number of columns using `rbind()`. Finally, you need to add the last piece (4) of data using `merge()`. Complete the code below ```{r w1d2_combine_pieces_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ### bind objects together over columns pheno_data <- cbind(YOURPIECE1,YOURPIECE2) ### binds objects together over rows pheno_data <- rbind(pheno_data,YOURPIECE3) ### merges objects, seeks for the same identitier, or you can provide it using by.x and by.y pheno_data <- merge(pheno_data,YOURPIECE4) ``` ::: #### Inspecting data using histograms After combining, it is smart to inspect the data again to check if everything went well. You can use the same functions as before or, for instance, use `geom_histogram()`. ```{r w1d2_make_a_ggplot2_histogram_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ggplot2::ggplot(pheno_data,aes(x=Gpal_vir)) + geom_histogram() ggplot2::ggplot(pheno_data,aes(x=Gpal_avir)) + geom_histogram() ``` ### Analyzing You now have the data complete. As you can see we have two measurements linked to one plant ID. The reason is that this data represents various inbred lines of potato plants tested for both virulent and avirulent *G. pallida* populations. #### Normal distribution Same as before, let's figure out whether the data is normally distributed. For this we can use the functions `stat_qq()` and `stat_qq_line()`. Find the code to do this in [Chapter 1](/week1/day1.qmd) and check the data in the `Gpal_vir` and `Gpal_avir` columns. ```{r w1d2_qqplot_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ### Add the code to make the qqplot for Gpal_vir ### Add the code to make the qqplot for Gpal_avir ``` Of course, you can also use a Shapiro-Wilk test (`shapiro.test`), this calculates the likelyhood whether the data follows a normal distribution. ```{r w1d2_shapiro_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} shapiro.test(pheno_data$Gpal_vir) shapiro.test(pheno_data$Gpal_avir) ``` ::: {#exr-normal_dist} #### Does the data follow a normal distribution? You now conducted two types of analyses that tell you whether the data is normally distributed. (1) what do you see in the qqplots, and what does it mean? (2) what are the p-values from the Shapiro-Wilk test? (3) what does a p-value mean? ::: #### Wilcoxon test As the data is not exactly normally distributed, let's deal with that in the proper way. Now we can use a Wilcoxon Rank Sum test to compare between two samples. We first need to transform the dataset, as we do not have a single observation per row currently. For this you can use the `gather()` function from the {tidyr} package. Use functions like `head()` or `tail()` to check what happened. ```{r w1d2_gather_pallida, include = TRUE,eval=FALSE, class.source = 'fold-show'} #| code-fold: false data.test <- tidyr::gather(pheno_data,key="pallida",value="juveniles",-ID,-Experiment) ``` Now we can test how the potato plants performed against virulent versus avirulent *G. pallida*. Using `wilcox.test()` ```{r w1d2_Wilcox_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false wilcox.test(juveniles~pallida,data=data.test) ###you should have a very low p-value ``` ::: {#exr-ttest} #### How to interpret a p-value? You now have tested a null-hypothesis using a non-parametric test and found that the chance his data agrees with this null-hypothesis is very small. What is the *biological interpretation* of this p-value? ::: ### Presenting #### Boxplot Can you now also make a plot of what you just tested? Use `geom_boxplot()` to visualize the distribution and `geom_jitter()` to show the individual measurements. ```{r w1d2_wilcoxplot_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ggplot(data.test,aes(x=pallida,y=juveniles)) + geom_boxplot() + geom_jitter(height=0,width=0.25) ``` ::: {#exr-facet_grid} #### Boxplot with facet_grid() There is another variable in the data as well: 'Experiment'. Use `facet_grid()` to split this out. Figure out how using the [ggplot2 website](https://ggplot2.tidyverse.org/). ```{r w1d2_wilcoxplot_facet_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ggplot(data.test,aes(x=pallida,y=juveniles)) + geom_boxplot() + geom_jitter(height=0,width=0.25) + facet_grid() ### add something in-between the brackets. ``` ::: #### Fancier boxplot Now we can add p-value as well using `annotate()`. ```{r w1d2_wilcoxplot_pval_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ###I'm getting the statistics here tmp <- tapply(data.test[,c("pallida","juveniles")],data.test$Experiment,function(x){ broom::tidy(wilcox.test(x$juveniles~x$pallida))}) ### I need to include the facets statplot <- do.call(rbind,tmp) %>% mutate(Experiment=1:3,pallida=1.5,juveniles=max(data.test$juveniles), label=paste("p =",signif(p.value,2))) ###save the plot into an object p1 <- ggplot(data.test,aes(x=pallida,y=juveniles)) + geom_boxplot() + geom_jitter(height=0,width=0.25) + facet_grid(~Experiment) + geom_text(aes(label=label),data=statplot) ###plot the plot p1 ``` ::: {#exr-saveplot} #### Save plot Now save the plot, check back at [Chapter 1](/week1/day1.qmd). Use a graphical device to save the plot with dimensions of 7 (width) by 4 (height) inch. ```{r w1d2_pdf_of_pallida_plot, include = TRUE, eval=FALSE} #| code-fold: false ### use a graphical device (e.g. pdf()) print(p1) ### we must use print, as we have a ggplot2 object ### close the graphical device ``` ::: ### Inspecting data part 2 #### Correlation analysis Correlation analysis is useful to discover patterns in large sets of data or between multiple measurements on the same sample (for instance different measurements on the same plant). As the data is not exactly normally distributed, we also have to deal with that when calculating correlation. For normal (distribution) correlation, you should use the Pearson method. This is the standard setting for `cor()`. For data that is not normally distributed, the Spearman method should be used. To do this, you need to specify a parameter within a fuction. If you check the function `cor()` using the `?`, you see that you can specify `method`; `cor(x,y,method="Spearman")`. ```{r w1d2 correlation analysis ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false data.test <- pheno_data ### Pearson correlation cor(data.test$Gpal_vir,data.test$Gpal_avir) ### Spearman correlation cor(data.test$Gpal_vir,data.test$Gpal_avir,method = "spearman") ### Test of correlation ### it is possible you get a warning here; this is not an error. cor.test(data.test$Gpal_vir,data.test$Gpal_avir,method = "spearman") ``` ::: {#exr-correlation} #### Correlation analysis interpretation What do you conclude from the correlation analysis, are the two values correlated? What is the *biological* interpretation of this result? ::: #### Clustering analysis Clustering is another method to explore data. Instead of patterns that move in the same direction, clustering is useful to apply if you expect consistent value differences in values. For clustering we use the functions `dist()` and `hclust()`. The first function calculates the distance matrix, there are multiple methods to calculate the distance-matrix, here we use the standard method ("euclidean"). The second performs the clustering. This process groups the measurements that are most similar. Because the dataset is too large to perform clustering on all the measurements, we summarise the data by taking the mean per experiment for the virulent and avirulent populations. To achieve this, we group the data per experiment using `dplyr::group_by()`, then we summarise using `dplyr::summarise()`. Then we need to transform the data to long format, for this we use `tidyr::gather()`. To see what happens, the code below includes a before and after inspection of the dataframe. ```{r w1d2 clustering analysis ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ### To make the distance matrix, we first need to summarise the data ### otherwise the matrix becomes too large data.test <- dplyr::group_by(pheno_data,Experiment) %>% dplyr::summarise(Gpal_vir=mean(Gpal_vir),Gpal_avir=mean(Gpal_avir)) ###check the shape of the data before gather head(data.test) data.test <- tidyr::gather(data.test,key=population,value=reproduction,-Experiment) ###and after gather head(data.test) ### Here we calculate the distance matrix dist_mat <- dist(data.test$reproduction,diag=TRUE,upper=TRUE) ### to give understandable names to the distance matrix we re-write it as matrix dist_mat <- as.matrix(dist_mat) ### these two lines replace the column and row names in the matrix colnames(dist_mat) <- paste(data.test$population,data.test$Experiment) rownames(dist_mat) <- paste(data.test$population,data.test$Experiment) ### to conduct clustering the matrix is re-written als distance matrix dist_mat <- as.dist(dist_mat) ### here we use the base plot-function to look at the clustering plot(hclust(dist_mat)) ``` ::: {#exr-correlation} #### Clustering analysis What do you conclude from the clustering analysis, are there differences between the two *G. pallida* populations? What is the *biological* interpretation of this result? ::: #### heatmap A handy visualization tool for correlation analysis as well as clustering analysis, is a heat map. There is a basic function in `R`, `heatmap()` that works with matrices. When you run the code below, you get a visualization of the distance matrix. The same can be done with a correlation matrix, but we do not use that today. ```{r w1d2 heatmap, include = TRUE,eval=FALSE} #| code-fold: false ### To make the distance matrix, we first need to summarise the data ### otherwise the matrix becomes too large data.test <- dplyr::group_by(pheno_data,Experiment) %>% dplyr::summarise(Gpal_vir=mean(Gpal_vir),Gpal_avir=mean(Gpal_avir)) %>% tidyr::gather(key=population,value=reproduction,-Experiment) ### Here we calculate the distance matrix dist_mat <- dist(data.test$reproduction,diag=TRUE,upper=TRUE) ### to give understandable names to the distance matrix we re-write it as matrix dist_mat <- as.matrix(dist_mat) ### these two lines replace the column and row names in the matrix colnames(dist_mat) <- paste(data.test$population,data.test$Experiment) rownames(dist_mat) <- paste(data.test$population,data.test$Experiment) heatmap(dist_mat) ``` ::: {#exr-correlation} #### Plot of correlation Correlation can best be visualized with a scatter plot. complete the code below to generate a scatterplot of your *G. pallida* data. Add the correct column name for the y-axis and the right geom to plot all data points. ```{r w1d2 correlation plot, include = TRUE,eval=FALSE} data.plot <- pheno_data ggplot(data.plot, aes(x = Gpal_vir, y = AAA, colour=Experiment)) + ###replace the AAA with the correct column name geom_ + ### add the geom to make a scatter plot scale_color_gradient(low="black",high="grey") + theme_bw() ``` ::: ### Some reading In part, this assignment was inspired by a paper in the journal Genome Biology, by Itai Yanai and Martin Lercher[@Yanai2020]. Read this paper as food for thought about what we covered today in p-value testing and its role in data analysis. This editorial paper is part of an interesting series of editorials about data science in biology: [Night science](https://link-springer-com.ezproxy.library.wur.nl/collections/cejajaicde)[@Yanai2019].