3 Your digital notebook

Author

Mark Sterken

Published

February 5, 2026

3.1 Your digital notebook

Introduction to day 3

Today we will continue with goal 1 (Built basic knowledge on using R and R studio) and goal 3 (Apply common statistical tools for inspecting and analyzing data). We will also start with goal 4 (Be able to document your code in a clear & concise way). You will find out that we’ve already covered this by requiring you to write a script. Now we’ll add one more layer to it, the notebook.

New today, is that you get an assignment. In this assignment, you are expected to go over goal 5 [Apply the data-science cycle (load, inspect, ~~clean~~, analyse, present)]. Also, this is not new, we’ve been doing it for the past two days.

As on the previous days, we will be making use of a real data-case.

Plant growth impairment by nematodes.

You already know that plant susceptibility to nematodes is a complex trait, with a lot of variation in susceptibility between individuals of the same species. This variation also exists for the dose-response relation between plants and nematodes. The data you will be using today is derived from a study by Jaap-Jan Willig, who used a camera platform to track the growth of 960 A. thaliana plants simultaneously¹.

Figure 3.1: The relationship between the inoculation density (P_i) of *H. schachtii* and green canopy area (relative yield). Nine-day-old Arabidopsis seedlings were inoculated with 20 different densities of *H. schachtii* juveniles (0–100 J2s g^–1 of dry sand) in 200 ml pots containing 200 g of dry sand. (A) Experimental set up and colour channel decomposition of photos to extract the green leaf surface from the images. (B) Average growth curve of Arabidopsis plants inoculated with different inoculum densities of *H. schachtii* from 0 to 21 dpi. Line fitting was based on a LOESS regression. The red box indicates the data that are used for fitting the Seinhorst yield loss equation (n=10–24). (C) Fitting according to the Seinhorst yield loss equation. Parameter values for Seinhorst’s equation for the relationship between initial population density (P_i) of *H. schachtii* and measured leaf surface area. P_i and tolerance limit (TSYLM) are expressed in *H. schachtii* (per g of dry sand) while, the minimal yield (m) is the lowest proportion of the maximum green canopy area (cm²) (Ymax) at 21 dpi. The goodness of fit of the model on the data is expressed as the coefficient of determination (R²).

The goal of this research was to develop a platform that allowed us to track in real-time plant growth under below-ground biotic stress caused by nematode infection. Doing this experiment, we identified a compensation response where, at low levels of infection, plants perform better than in uninfected conditions.

Set up your R environment

For the basics of setting up your environment, you can check Chapter 1.

Quarto markdown

The use of a markdown document

The last two days you have been working with .qmd files (Quarto markdown). This is a flavor of markdown, which is a markup language that relies on simple text to produce a layout of a document. As you have a lab-journal for experimental work in the field, greenhouse, or laboratory, a markdown notebook is useful to have when writing code.

The Quarto markdown file you make today for the assignment, should be handed in on Brightspace today, before the deadline. You hand in the .qmd file. Your reviewer will go through the file and check if it renders and whether your conclusions are correct.

Exercise 3.1 (Make your own markdown file) In RStudio go to File > New File > Quarto document…

Based on the documents provided last two days (Chapter 1 and Chapter 2), make your own markdown file. Include meta information (title, name, registration number) and headings for: setup of the R environment (working directory, packages), and the five steps in the data analysis cycle (see Chapter 1).

You can choose whether it renders a .html or a .pdf document. For now, settle on .html, this typically renders well and does not require many additional programs. You probably will need pandoc.

You can find general tips on how to write in markdown in Appendix QMD notebook cheatsheet.

Typically, when you make a markdown file, you can make errors. RStudio tries to make you aware of these errors by putting a notification at the line numbers. But some errors you only catch when you render the document. Or when you run code line-by-line.

Many errors were made by me - and the other teachers - in building the book you are now reading. Do ask us about our frustrations in working with R.

3.2 Analysis of plant-growth data

Today we want to cover one more aspect of data analysis, namely including the fit of a simple model into a plot. In this case, we will use some plant-growth data. And to be precise, we will use the linear part of the growth curve. It is perfectly possible to fit a growth-model within R, but we want to start a bit simpler with an actual linear curve. you can find the data for this assignment via the fileshare on Brightspace. You need the general file under day 3, not your personal dataset for the code-review assignment.

Exercise 3.2 (Loading data into R) Load the data using load(). We have not used this function before. Use ?load() to figure out how this function works.

    ### here add the code to read the data

Note that the name of the object you loaded into memory is not the same as the name of the file you have on your hard disk. The reason is that the filename is inherited from my R-environment. Also note that this can be a major source of annoyance; namely we have been using object name pheno_data consistently. If you were to run load(), that would also mean you overwrite any other file with that name in the R environment.

Up till now, we have been avoiding talking about many of the programming aspects of working with R. But, grasshoppers, I’m afraid it is due.

There are some things you probably noticed by now: any wrong wording, comma’s or other ‘unintended mistakes’ from your side are immediately recognized by R and rewarded with an ‘error’. There are also so-called ‘warnings’. The difference is that ‘error’ results in code that does not work (e.g. you will not get the specific calculation/processing executed) and a ‘warning’ means that you might have done something that was not intended (but you will get output). Next week in Chapter 5 we cover this more thoroughly.

The most important step in data-analysis is checking after an operation. Make sure that what you did was what you intended and that what comes back is correct. How often have we, your teachers, not wept when we ‘quickly’ skipped this and found out that the output we got was incorrect.

There are some general measures you can take to reduce risks:

Always start with a clean environment. At exit, do not save the environment, but start afresh. This ensures that you know which packages you loaded, which files you need to collect, which variables you need to define.
When writing a pipe (%>%) or any complex set of operations, check each step individually. Was the outcome of your function as intended?
Do a pilot. If you run a calculation that takes a long time, run it on a small subset of your data and thoroughly check the output.
Know what you do, especially with statistics. Always visualize the outcome of the analysis if you can. Otherwise it might be that you take a trip into statistical winter-wonder land.
Be wary of the class factors. Factors are indexed sets of data. If this indexed set of data is a number, the number on which the calculation is performed is the index number, not the actual number you see. In this case, you need to transform the factor to an actual number. You can check if this might be a problem using the function class().

Inspecting data

Using functions

Inspect the data using the same functions as in Chapter 2. Also, these functions you can find in Appdix Base R cheatsheet.

Code

    head(pheno_data)
    tail(pheno_data)
    summary(pheno_data)
    length(pheno_data)
    ncol(pheno_data)
    nrow(pheno_data)
    dim(pheno_data)

Exercise 3.3 (Inspecting data) Answer these three questions:

What does the head() function tell you about an object?
Similarly, what does the tail() function tell you about an object?
If you would not know this, what is the command you could execute within the R console to find out?

Cleaning and preparing the data

Not needed here today. We keep this heading in for completeness though.

Analyzing

Now the fun starts. We pick up on our old friend cor(). To determine which kind of correlation we need to perform, we need to figure out the statistical distribution of the data.

Normal distribution

Same as before, let’s figure out whether the data is normally distributed. If you don’t know the two tests anymore, check back at Chapter 2.

    ###Use the graphical method

    ###Use the statistical method

Exercise 3.4 (Does the data follow a normal distribution?) You now conducted two types of analyses that tell you whether the data is normally distributed. What is your assessment, is it normally distributed?

Correlation analysis

Yesterday we conducted an analysis of correlation using the cor() function. There is actually another way to conduct this type of analysis, namely using a linear model with lm(). The advantage of this method is that we will get a slope and an intercept.

We can use the function summary() to get the fit of the model.

    data.test <- pheno_data
    
    ### here the linear model is done
    tested <- lm(value~dpi,data=data.test)
    
    ### use summary to get the fit
    summary(tested)

The summary() function is a very versatile function, that can generate outputs from many objects. In this case, it presents the analysis from the linear model, which in essence fits a correlation line. The Estimate for the intercept of that line (-2.97528) and the slope (0.50663) are reported. Also the fit of the model (the Adjusted R-squared: 0.8274) and of the parameters. For instance, there is a significant correlation between day post infection (dpi) and the leaf area (value); p < 2e-16.

Presenting

Scatter plot

Now, let’s make a proper plot of correlation. Of course, you can find instructions in both Chapter 1 and Chapter 2. Also, check Appendix GGplot cheatsheet.

    data.plot <- pheno_data

    ###I'm getting the statistics here
    statplot <- broom::tidy(summary(lm(value~dpi,data=data.plot)))

    p1 <- ggplot(data.plot, aes(x = dpi, y = value)) +
          geom_point() + geom_smooth(method = "lm") + 
          theme_bw() + annotate("text", 
                                x=10, 
                                y=max(data.plot$value), 
                                label=paste("p =", signif(statplot$p.value[2], 2)))
    
    p1

Exercise 3.5 (Is the plant-growth correlated with time?) What is the biological interpretation of the correlation you calculated and observed?

What does the intercept represent?
What does the slope represent?
Is a linear model a good fit for plant growth in general?

3.3 Assignment for code-review week 1

The assignment

Via the fileshare on Brightspace, you can find your dataset for the code review. Everyone of you has their own folder, with their own dataset. This means that the conclusions you draw will be based on your individual dataset. We expect that you hand in a .html file generated using a Quarto Markdown (QMD) file you wrote yourself in R. Within the .html file you provide the code to:

Setup of the R environment
Load the data
Inspect the data
‘Clean’ the data
Analyse the data
Present the data

Also, you should provide a conclusion about the biological interpretation of the data. The exercises below take you through the assignment step-by step. Make sure that for the R steps you include code chunks. If you print figures or execute functions within these blocks, the output will be included in the .html.

Do not wait untill you are finished to make the first render. At the very start, when your file is still pristine (e.g without code chunks but after adapting a .qmd file from Chapter 1 or Chapter 2), do a check on whether it renders correctly. You render the document by hitting the ‘Render’ button on top of the top-left pane in RStudio.

Note that in day 1 and day 2 the parameter eval is set to false in the yaml header (this is the text at the start of the document between the --- and ---). This means that in the .html generated, none of the code blocks are executed. This is a good option to prevent R coding errors from rendering the document. However, if you want to include a figure, you need to set this option to true within that code-chunk #| eval: true.

If you need more tips in general on QMD, check Appendix QMD notebook cheatsheet.

Exercise 3.6 (Setup of the R environment) Set the working directory and install and activate the required packages to run the code within the file.

While going through the exercises, you might encounter steps requiring you to install additional packages. Simply add them to this section of the code one-by-one. You help yourself by being explicit about which package you use (e.g. dplyr:: or tidyr::).

Exercise 3.7 (Load the data) Load the two data files into the R environment using the appropriate read function. Make sure that your name is included here, so the reviewers can check if your code works. That way your reviewer can find your datafiles.

The data consists of the growth rates of A. thaliana Col-0 when inoculated with either of two plant parasitic nematodes¹. The two nematode species tested are the beet-cyst nematode Heterodera schachtii and the root-knot nematode Meloidogyne incognita. Within your personal folder you find the data of the experiment conducted at a particular infection density (P_i) in nematodes g^-1. The measured trait is the day-by-day growth rate, measured between two consecutive days post infection (dpi). For instance between day 2 and day 3. The dpi within the data file would be 2.5 in that case. This data is given per plant.

Exercise 3.8 (Inspect) Inspect the data using functions (no plot required). Based on the output indicate which bind function you need to use to combine the two objects into one.

We covered the data inspection functions in Chapter 1 and Chapter 2) under the heading Inspecting data.

Exercise 3.9 (Clean; combine) Use a ‘bind’ function to combine the two objects into one.

We covered the bind functions in Chapter 2) when combining the four objects.

Exercise 3.10 (Analyse) Compare growth rates of the A. thaliana plant when infected with either H. schachtii or M. incognita, analyse whether these differ using a test that does not assume a normal distribution. Report at least the p-value.

We do not ask you to check for normality for this assignment, as that dataset overall is not normally distributed, we give you that your subset is also not. Note that in the paper We deal with this by using a non-parametric test as well¹.

Quantification of the tolerance limit of Col-0 to Meloidogyne incognita based on growth rate. Nine-day-old Arabidopsis seedlings were inoculated with 18 different densities (Pi) of M. incognita juveniles (0–100 juveniles g^–1 of dry sand. (A) Green canopy area of plants treated with P_i=0 (blue line) or P_i=5 (aquamarine line) from 0 to 21 dpi. (B) Rate of growth of plants treated with P_i=0 (blue line) or P_i=5 (aquamarine line) from 0 to 21 dpi was calculated using Equation 2 in the manuscript. (C) Rate of growth of plants inoculated with increasing P_i values growing from 5 to 16 dpi. (D) Rate of growth of plants inoculated with increasing P_i values growing from 17 to 21 dpi. (E) Cumulative growth rate of plants inoculated with increasing Pi values. (C–E) Dots represent individual plants. Data were analysed with a Wilcoxon rank sum test. ns=not significant, * P<0.05, ** P<0.01, *** P<0.001 (n=10–12).

Exercise 3.11 (Present) Make a ggplot2:: boxplot showing the two nematode species on the x-axis and the growth rate on the y-axis.

We covered the boxplot function in Chapter 1) in the presentation section. Also, you can find how to in Appendix GGplot cheatsheet.

Now you have analysed the difference in growth rates between the two species at a particular density at a particular day of the experiment. Now it is time to draw conclusions.

Exercise 3.12 (Interpret) What is the biological interpretation of the boxplot and the wilcoxon-test p-value?

Once you have worked through all these excercises related to the assignment, you can render the .html file. Check the file (is everything included?). If you run into errors, the ‘Background Jobs’ window will indicate where the rendering has halted.

How to do code-review

During code-review, we try to establish if the code that is written:

is well-organized
is well-documented
properly used
works

Next to that, we also check whether the conclusions drawn from the analysis and figure are valid.

How to give peer-feedback

You need to review 2 coding assignments of other students within the course. Your teachers will check your feedback. In the first week, you will get an indication from us whether you give sufficient and clear feedback. You need to get a ‘pass’ for the assignments in week 2-4.

What is feedback?

Feedback literally means ‘giving back’. It is not an euphemism for criticism. Rather, feedback can be described as: information regarding the way my message or behaviour (or coding and interpreting the outcome of data-analysis in this case) has been received and interpreted. Feedback stimulates you to think about and implement possibilities to improve.

Feedback is critical for your own improvement. Peer feedback relies on giving and receiving feedback. Receiving feedback allows you to develop your competences and skills. Well-given feedback can help you asses your strengths and weak points, and allows you to address the weak points from the basis of your strengths. However, giving feedback also ask you to be critical and recognize certain aspects of data analysis, which you can implement again in your own process of analyzing data in R.

How to give feedback

When giving feedback it is important to:

Balance your feedback. Mention both positive and constructive aspects (Tops and Tips). It is equally important to know your strengths, as it is to know your weaknesses. Moreover, it is motivating to receive positive feedback.
Be specific. Feedback is not helpful to the receiver when the feedback is not specifically formulated. Comments such as: “I think it is bad code and it needs revision” does not tell the receiver why it is a bad code. Instead, “I think you arrive at your result in more steps then necessary, I suggest you use function x instead of function y and function z” is much more helpful.
Connect concrete suggestions for improvement when providing constructive feedback. This can be quite challenging, and requires you to place yourself in the shoes of the someone else. This is very helpful for the receiver, as they receive help on how to improve.
Provide examples. Link your feedback to specific pieces of code or sentences in the document. In Feedbackfruits you can highlight text and make comments.
Formulate your feedback to what you observe (objectively) in an open way (without judgement or premature conclusions). For instance: “I noticed that you used the same object name for two different datasets, did you do this for a certain reason?” You can also describe the effect it has on you: “For me, it would be more logical if you use readxl::read_xlsx() to load an .xlsx file instead of read.delim().”
Do not generalize. Avoid words like always/never in your feedback e.g. “You always forget to indicate which package a function comes from…”

Well organized code

There are many ways in which you can write well-organized code. What all have in common is consistency. Here, I teach you a way of doing this. These are the rules that I generally follow. I would like to stress that these are guidelines. These guidelines help others to read and understand your code. However, they also help you when you need to re-use or re-run your code.

Check:

Are clear section-level headings used?
are hierarchies indicated with a tab?
from non-base R functions, is the package indicated (e.g. dplyr::)?
do objects have a clear, understandable, name?

Well documented code

When writing code, leave comments. Especially when you’re doing something new / special this is very helpful to you in the future. But also to others that will check or even use your code. It can be very handy to indicate why you use a particular function for instance.

Check:

Are comments included in the code?
Above code chunks, is there an explanation of what is done and why?

Properly used code

This checks whether the functions are used as intended. For instance, you can run a correlation analysis on numbers that are of class factor (check this by using class() on a vector/data frame column). However, correlations should be done using actualy numbers is.numeric() should return TRUE. Another example is whether the correct statistical test was used given the data?

Check:

Are the correct functions / inputs used?

Working code

Check:

can you run the code and get the same output?

The Biological interpretation

Do you agree with the biological interpretation given? Note that all of you have different days and P_i’s to compare. Thus, the conclusions will probably differ amongst yourselves!

Some reading

Yesterday you might have been convinced that a hypothesis is a liability², today you may convince yourself otherwise. There is a nice critique written of the provocative paper you read yesterday. The authors in the paper ‘The data-hypothesis relationship’ will convince you that a hypothesis is not such a bad idea after all³.

--- title: "Your digital notebook" author: "Mark Sterken" date: "2026-02-05" format: html: toc: true toc-depth: 3 code-fold: true code-tools: true link-external-newwindow: true link-external-icon: true editor: source execute: eval: false warning: false message: false --- ## Your digital notebook ### Introduction to day 3 Today we will continue with goal 1 (Built basic knowledge on using R and R studio) and goal 3 (Apply common statistical tools for inspecting and analyzing data). We will also start with goal 4 (Be able to document your code in a clear & concise way). You will find out that we've already covered this by requiring you to write a script. Now we'll add one more layer to it, the notebook. New today, is that you get an assignment. In this assignment, you are expected to go over goal 5 \[Apply the data-science cycle (load, inspect, ~~clean~~, analyse, present)\]. Also, this is not new, we've been doing it for the past two days. As on the previous days, we will be making use of a real data-case. ::: {.callout-tip appearance="simple"} ### Plant growth impairment by nematodes. You already know that plant susceptibility to nematodes is a complex trait, with a lot of variation in susceptibility between individuals of the same species. This variation also exists for the dose-response relation between plants and nematodes. The data you will be using today is derived from a study by Jaap-Jan Willig, who used a camera platform to track the growth of 960 *A. thaliana* plants simultaneously[@Willig2023]. ![The relationship between the inoculation density (P~i~) of *H. schachtii* and green canopy area (relative yield). Nine-day-old Arabidopsis seedlings were inoculated with 20 different densities of *H. schachtii* juveniles (0--100 J2s g^--1^ of dry sand) in 200 ml pots containing 200 g of dry sand. (A) Experimental set up and colour channel decomposition of photos to extract the green leaf surface from the images. (B) Average growth curve of Arabidopsis plants inoculated with different inoculum densities of *H. schachtii* from 0 to 21 dpi. Line fitting was based on a LOESS regression. The red box indicates the data that are used for fitting the Seinhorst yield loss equation (n=10--24). (C) Fitting according to the Seinhorst yield loss equation. Parameter values for Seinhorst's equation for the relationship between initial population density (P~i~) of *H. schachtii* and measured leaf surface area. P~i~ and tolerance limit (TSYLM) are expressed in *H. schachtii* (per g of dry sand) while, the minimal yield (m) is the lowest proportion of the maximum green canopy area (cm^2^) (Ymax) at 21 dpi. The goodness of fit of the model on the data is expressed as the coefficient of determination (R^2^).](Week1_day3_Willig.jpeg){#fig-willig} The goal of this research was to develop a platform that allowed us to track in real-time plant growth under below-ground biotic stress caused by nematode infection. Doing this experiment, we identified a compensation response where, at low levels of infection, plants perform better than in uninfected conditions. ::: ### Set up your R environment For the basics of setting up your environment, you can check [Chapter 1](/week1/day1.qmd). ### Quarto markdown #### The use of a markdown document The last two days you have been working with .qmd files ([Quarto markdown](https://quarto.org/docs/guide/)). This is a flavor of markdown, which is a markup language that relies on simple text to produce a layout of a document. As you have a lab-journal for experimental work in the field, greenhouse, or laboratory, a markdown notebook is useful to have when writing code. The Quarto markdown file you make today for the assignment, should be handed in on [Brightspace](https://brightspace.wur.nl/) today, before the deadline. You hand in the .qmd file. Your reviewer will go through the file and check if it renders and whether your conclusions are correct. ::: {#exr-MakeQmd} #### Make your own markdown file In RStudio go to File \> New File \> Quarto document... Based on the documents provided last two days ([Chapter 1](/week1/day1.qmd) and [Chapter 2](/week1/day2.qmd)), make your own markdown file. Include meta information (title, name, registration number) and headings for: setup of the R environment (working directory, packages), and the five steps in the data analysis cycle (see [Chapter 1](/week1/day1.qmd)). You can choose whether it renders a .html or a .pdf document. For now, settle on .html, this typically renders well and does not require many additional programs. You probably will need [pandoc](https://pandoc.org/). You can find general tips on how to write in markdown in [Appendix QMD notebook cheatsheet](/appendices/qmd-notebooks-cheatsheet.qmd). ::: ::: {.callout-tip appearance="simple"} Typically, when you make a markdown file, you can make errors. RStudio tries to make you aware of these errors by putting a notification at the line numbers. But some errors you only catch when you render the document. Or when you run code line-by-line. Many errors were made by me - and the other teachers - in building the book you are now reading. Do ask us about our frustrations in working with R. ::: ## Analysis of plant-growth data Today we want to cover one more aspect of data analysis, namely including the fit of a simple model into a plot. In this case, we will use some plant-growth data. And to be precise, we will use the linear part of the growth curve. It is perfectly possible to fit a growth-model within R, but we want to start a bit simpler with an actual linear curve. you can find the data for this assignment via the fileshare on [Brightspace](https://brightspace.wur.nl/). You need the general file under day 3, not your personal dataset for the code-review assignment. ::: {#exr-load_robj} ### Loading data into R Load the data using `load()`. We have not used this function before. Use `?load()` to figure out how this function works. ```{r w1d3_load} #| code-fold: false ### here add the code to read the data ``` ::: Note that the name of the object you loaded into memory is not the same as the name of the file you have on your hard disk. The reason is that the filename is inherited from my R-environment. Also note that this can be a major source of annoyance; namely we have been using object name `pheno_data` consistently. If you were to run `load()`, that would also mean you *overwrite* any other file with that name in the R environment. ::: {.callout-note appearance="simple"} Up till now, we have been avoiding talking about many of the programming aspects of working with R. But, grasshoppers, I'm afraid it is due. There are some things you probably noticed by now: any wrong wording, comma's or other 'unintended mistakes' from your side are immediately recognized by R and rewarded with an 'error'. There are also so-called 'warnings'. The difference is that 'error' results in code that does not work (e.g. you will not get the specific calculation/processing executed) and a 'warning' means that you might have done something that was not intended (but you will get output). Next week in [Chapter 5](/week2/debugging_and_testing.qmd) we cover this more thoroughly. The most important step in data-analysis is checking after an operation. Make sure that what you did was what you intended and that what comes back is correct. How often have we, your teachers, not wept when we 'quickly' skipped this and found out that the output we got was incorrect. There are some general measures you can take to reduce risks: 1. Always start with a clean environment. At exit, do not save the environment, but start afresh. This ensures that you *know* which packages you loaded, which files you need to collect, which variables you need to define. 2. When writing a pipe (`%>%`) or any complex set of operations, check each step individually. Was the outcome of your function as intended? 3. Do a pilot. If you run a calculation that takes a long time, run it on a small subset of your data and *thoroughly* check the output. 4. Know what you do, especially with statistics. Always *visualize* the outcome of the analysis if you can. Otherwise it might be that you take a trip into statistical winter-wonder land. 5. Be wary of the class factors. Factors are indexed sets of data. If this indexed set of data is a number, the number on which the calculation is performed is the *index number*, not the actual number you see. In this case, you need to transform the factor to an actual number. You can check if this might be a problem using the function `class()`. ::: ### Inspecting data #### Using functions Inspect the data using the same functions as in [Chapter 2](/week1/day2.qmd). Also, these functions you can find in [Appdix Base R cheatsheet](/appendices/base-r-cheatsheet.qmd). ```{r w1d2_check_the_data, include = TRUE,eval=FALSE} head(pheno_data) tail(pheno_data) summary(pheno_data) length(pheno_data) ncol(pheno_data) nrow(pheno_data) dim(pheno_data) ``` ::: {#exr-headtail} #### Inspecting data Answer these three questions: 1. What does the `head()` function tell you about an object? 2. Similarly, what does the `tail()` function tell you about an object? 3. If you would not know this, what is the command you could execute within the R console to find out? ::: ### Cleaning and preparing the data Not needed here today. We keep this heading in for completeness though. ### Analyzing Now the fun starts. We pick up on our old friend `cor()`. To determine which kind of correlation we need to perform, we need to figure out the statistical distribution of the data. #### Normal distribution Same as before, let's figure out whether the data is normally distributed. If you don't know the two tests anymore, check back at [Chapter 2](/week1/day2.qmd). ```{r w1d2_qqplot_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false ###Use the graphical method ###Use the statistical method ``` ::: {#exr-normal_dist} #### Does the data follow a normal distribution? You now conducted two types of analyses that tell you whether the data is normally distributed. What is your assessment, is it normally distributed? ::: #### Correlation analysis Yesterday we conducted an analysis of correlation using the `cor()` function. There is actually another way to conduct this type of analysis, namely using a linear model with `lm()`. The advantage of this method is that we will get a slope and an intercept. We can use the function `summary()` to get the fit of the model. ```{r w1d2_gather_pallida, include = TRUE,eval=FALSE, class.source = 'fold-show'} #| code-fold: false data.test <- pheno_data ### here the linear model is done tested <- lm(value~dpi,data=data.test) ### use summary to get the fit summary(tested) ``` The `summary()` function is a very versatile function, that can generate outputs from many objects. In this case, it presents the analysis from the linear model, which in essence fits a correlation line. The `Estimate` for the intercept of that line (-2.97528) and the slope (0.50663) are reported. Also the fit of the model (the Adjusted R-squared: 0.8274) and of the parameters. For instance, there is a significant correlation between day post infection (dpi) and the leaf area (value); p < 2e-16. ### Presenting #### Scatter plot Now, let's make a proper plot of correlation. Of course, you can find instructions in both [Chapter 1](/week1/day1.qmd) and [Chapter 2](/week1/day2.qmd). Also, check [Appendix GGplot cheatsheet](/appendices/ggplot-cheatsheet.qmd). ```{r w1d2_wilcoxplot_pallida_ASSIGNMENT, include = TRUE,eval=FALSE} #| code-fold: false data.plot <- pheno_data ###I'm getting the statistics here statplot <- broom::tidy(summary(lm(value~dpi,data=data.plot))) p1 <- ggplot(data.plot, aes(x = dpi, y = value)) + geom_point() + geom_smooth(method = "lm") + theme_bw() + annotate("text", x=10, y=max(data.plot$value), label=paste("p =", signif(statplot$p.value[2], 2))) p1 ``` ::: {#exr-interpretation} #### Is the plant-growth correlated with time? What is the *biological interpretation* of the correlation you calculated and observed? 1. What does the intercept represent? 2. What does the slope represent? 3. Is a linear model a good fit for plant growth *in general*? ::: ## Assignment for code-review week 1 ### The assignment Via the fileshare on [Brightspace](https://brightspace.wur.nl/), you can find your dataset for the code review. Everyone of you has their own folder, with their own dataset. This means that the conclusions you draw will be based on your individual dataset. We expect that you hand in a .html file generated using a Quarto Markdown (QMD) file you wrote yourself in R. Within the .html file you provide the code to: 1. Setup of the R environment 2. Load the data 3. Inspect the data 4. 'Clean' the data 5. Analyse the data 6. Present the data Also, you should provide a conclusion about the *biological interpretation* of the data. The exercises below take you through the assignment step-by step. Make sure that for the R steps you include code chunks. If you print figures or execute functions within these blocks, the output will be included in the .html. ::: {.callout-tip appearance="simple"} Do not wait untill you are finished to make the first render. At the very start, when your file is still pristine (e.g without code chunks but after adapting a .qmd file from [Chapter 1](/week1/day1.qmd) or [Chapter 2](/week1/day2.qmd)), do a check on whether it renders correctly. You render the document by hitting the 'Render' button on top of the top-left pane in RStudio. Note that in day 1 and day 2 the parameter `eval` is set to `false` in the yaml header (this is the text at the start of the document between the \-\-\- and \-\-\-). This means that in the .html generated, none of the code blocks are executed. This is a good option to prevent R coding errors from rendering the document. However, if you want to include a figure, you need to set this option to true within that code-chunk `#| eval: true`. If you need more tips in general on QMD, check [Appendix QMD notebook cheatsheet](/appendices/qmd-notebooks-cheatsheet.qmd). ::: ::: {#exr-Assignment_setup} #### Setup of the R environment Set the working directory and install and activate the required packages to run the code within the file. ::: While going through the exercises, you might encounter steps requiring you to install additional packages. Simply add them to this section of the code one-by-one. You help yourself by being explicit about which package you use (e.g. `dplyr::` or `tidyr::`). ::: {#exr-Assignment_load} #### Load the data Load the two data files into the R environment using the appropriate read function. Make sure that your name is included here, so the reviewers can check if your code works. That way your reviewer can find your datafiles. ::: ::: {.callout-tip appearance="simple"} The data consists of the growth rates of *A. thaliana* Col-0 when inoculated with either of two plant parasitic nematodes[@Willig2023]. The two nematode species tested are the beet-cyst nematode *Heterodera schachtii* and the root-knot nematode *Meloidogyne incognita*. Within your personal folder you find the data of the experiment conducted at a particular infection density (P~i~) in nematodes g^-1. The measured trait is the day-by-day growth rate, measured between two consecutive days post infection (dpi). For instance between day 2 and day 3. The dpi within the data file would be 2.5 in that case. This data is given per plant. ::: ::: {#exr-Assignment_inspect} #### Inspect Inspect the data using functions (no plot required). Based on the output indicate which bind function you need to use to combine the two objects into one. ::: ::: {.callout-note appearance="simple"} We covered the data inspection functions in [Chapter 1](/week1/day1.qmd) and [Chapter 2](/week1/day2.qmd)) under the heading Inspecting data. ::: ::: {#exr-Assignment_clean} #### Clean; combine Use a 'bind' function to combine the two objects into one. ::: ::: {.callout-note appearance="simple"} We covered the bind functions in [Chapter 2](/week1/day2.qmd)) when combining the four objects. ::: ::: {#exr-Assignment_analyse} #### Analyse Compare growth rates of the *A. thaliana* plant when infected with either *H. schachtii* or *M. incognita*, analyse whether these differ using a test that does not assume a normal distribution. Report at least the p-value. ::: ::: {.callout-tip appearance="simple"} We do not ask you to check for normality for this assignment, as that dataset overall is not normally distributed, we give you that your subset is also not. Note that in the paper We deal with this by using a non-parametric test as well[@Willig2023]. ![Quantification of the tolerance limit of Col-0 to Meloidogyne incognita based on growth rate. Nine-day-old Arabidopsis seedlings were inoculated with 18 different densities (Pi) of M. incognita juveniles (0–100 juveniles g^–1^ of dry sand. (A) Green canopy area of plants treated with P~i~=0 (blue line) or P~i~=5 (aquamarine line) from 0 to 21 dpi. (B) Rate of growth of plants treated with P~i~=0 (blue line) or P~i~=5 (aquamarine line) from 0 to 21 dpi was calculated using Equation 2 in the manuscript. (C) Rate of growth of plants inoculated with increasing P~i~ values growing from 5 to 16 dpi. (D) Rate of growth of plants inoculated with increasing P~i~ values growing from 17 to 21 dpi. (E) Cumulative growth rate of plants inoculated with increasing Pi values. (C–E) Dots represent individual plants. Data were analysed with a Wilcoxon rank sum test. ns=not significant, \* P<0.05, \*\* P<0.01, \*\*\* P<0.001 (n=10–12).](Week1_day3_WilligAssignment.jpeg) ::: ::: {#exr-Assignment_present} #### Present Make a `ggplot2::` boxplot showing the two nematode species on the x-axis and the growth rate on the y-axis. ::: ::: {.callout-note appearance="simple"} We covered the boxplot function in [Chapter 1](/week1/day1.qmd)) in the presentation section. Also, you can find how to in [Appendix GGplot cheatsheet](/appendices/ggplot-cheatsheet.qmd). ::: Now you have analysed the difference in growth rates between the two species at a particular density at a particular day of the experiment. Now it is time to draw conclusions. ::: {#exr-Assignment_interpret} #### Interpret What is the *biological interpretation* of the boxplot and the wilcoxon-test p-value? ::: Once you have worked through all these excercises related to the assignment, you can render the .html file. Check the file (is everything included?). If you run into errors, the 'Background Jobs' window will indicate where the rendering has halted. ### How to do code-review During code-review, we try to establish if the code that is written: 1. is well-organized 2. is well-documented 3. properly used 4. works Next to that, we also check whether the conclusions drawn from the analysis and figure are valid. #### How to give peer-feedback You need to review 2 coding assignments of other students within the course. Your teachers will check your feedback. In the first week, you will get an indication from us whether you give sufficient and clear feedback. You need to get a ‘pass’ for the assignments in week 2-4. ##### What is feedback? Feedback literally means 'giving back'. It is not an euphemism for criticism. Rather, feedback can be described as: information regarding the way my message or behaviour (or coding and interpreting the outcome of data-analysis in this case) has been received and interpreted. Feedback stimulates you to think about and implement possibilities to improve. Feedback is critical for your own improvement. Peer feedback relies on giving and receiving feedback. Receiving feedback allows you to develop your competences and skills. Well-given feedback can help you asses your strengths and weak points, and allows you to address the weak points from the basis of your strengths. However, giving feedback also ask you to be critical and recognize certain aspects of data analysis, which you can implement again in your own process of analyzing data in R. ##### How to give feedback When giving feedback it is important to: 1. Balance your feedback. Mention both positive and constructive aspects (Tops and Tips). It is equally important to know your strengths, as it is to know your weaknesses. Moreover, it is motivating to receive positive feedback. 2. Be specific. Feedback is not helpful to the receiver when the feedback is not specifically formulated. Comments such as: "I think it is bad code and it needs revision" does not tell the receiver why it is a bad code. Instead, "I think you arrive at your result in more steps then necessary, I suggest you use function x instead of function y and function z" is much more helpful. 3. Connect concrete suggestions for improvement when providing constructive feedback. This can be quite challenging, and requires you to place yourself in the shoes of the someone else. This is very helpful for the receiver, as they receive help on how to improve. 4. Provide examples. Link your feedback to specific pieces of code or sentences in the document. In Feedbackfruits you can highlight text and make comments. 5. Formulate your feedback to what you observe (objectively) in an open way (without judgement or premature conclusions). For instance: "I noticed that you used the same object name for two different datasets, did you do this for a certain reason?" You can also describe the effect it has on you: "For me, it would be more logical if you use readxl::read_xlsx() to load an .xlsx file instead of read.delim()." 6. Do not generalize. Avoid words like always/never in your feedback e.g. "You always forget to indicate which package a function comes from..." #### Well organized code There are many ways in which you can write well-organized code. What all have in common is consistency. Here, I teach you *a way* of doing this. These are the rules that I *generally* follow. I would like to stress that these are guidelines. These guidelines help others to read and understand your code. However, they also help you when you need to re-use or re-run your code. Check: - Are clear section-level headings used? - are hierarchies indicated with a tab? - from non-base R functions, is the package indicated (e.g. `dplyr::`)? - do objects have a clear, understandable, name? #### Well documented code When writing code, leave comments. Especially when you're doing something new / special this is very helpful to you in the future. But also to others that will check or even use your code. It can be very handy to indicate why you use a particular function for instance. Check: - Are comments included in the code? - Above code chunks, is there an explanation of what is done and why? #### Properly used code This checks whether the functions are used as intended. For instance, you can run a correlation analysis on numbers that are of `class` `factor` (check this by using `class()` on a vector/data frame column). However, correlations should be done using actualy numbers `is.numeric()` should return `TRUE`. Another example is whether the correct statistical test was used given the data? Check: - Are the correct functions / inputs used? #### Working code Check: - can you run the code and get the same output? #### The Biological interpretation Do you agree with the biological interpretation given? Note that all of you have different days and P~i~'s to compare. Thus, the conclusions will probably differ amongst yourselves! ### Some reading Yesterday you might have been convinced that a hypothesis is a liability[@Yanai2020], today you may convince yourself otherwise. There is a nice critique written of the provocative paper you read yesterday. The authors in the paper 'The data-hypothesis relationship' will convince you that a hypothesis is not such a bad idea after all[@Felin2021].