Exploratory Data Analysis

Authors

Rens Holmer

Mark Sterken

Harm Nijveen

Peter Bourke

Published

March 5, 2026

Introduction

This online book is the main study material for the course BIF20806: Exploratory Data Analysis in R. This course is taught at Wageningen University and targets \(2^{nd}\) year students in the BSc program Plant Sciences (BPW). The aim of this book and the course is to provide sufficient information and background so that after finishing, students can work independently on data analysis projects with R, using datasets that are typically encountered in the plant sciences.

Data analysis in the plant sciences

There is no plant science without data, and working with data has always been an essential skill for any plant scientist. Historically, this would not have involved talking about programming languages. However, increasing awareness of reproducibility and the ever-growing volume of data in the plant sciences mean that plant scientists today routinely turn to computers when analysing their data.

One of the most famous experiments in plant science is Gregor Mendel’s work on pea plants in the mid-19th century¹. Mendel carefully crossed plants with different observable traits (such as seed colour or shape) and systematically counted how often each trait appeared in the offspring. These experiments later proved foundational for the development of Mendel’s theories on genetic inheritance.

What makes Mendel’s work particularly relevant here is not the biological theory that later emerged from it, but the way he worked with data. Mendel designed controlled experiments, collected quantitative observations, organised his results in tables, and searched for patterns in those data before proposing any formal explanation. In modern terms, much of this would now be described as exploratory data analysis.

If Mendel were working today, he would likely store his data in a spreadsheet or data frame, visualise trait frequencies with plots, and use simple statistical summaries to explore variation and uncertainty. The tools would be different, but the underlying questions would be the same.

What if Mendel could use R?

In his seminal experiments, Mendel counted inheritance patters and compared them to a theoretical model. At the time, the statistical tools we would use today had not been developed yet, so he mostly eyeballed the difference. These days we would calculate a p-value according to a statistical model to guide our estimation of uncertainty of the outcome. If Mendel used R, and had access to modern statistical tools, his approach might look something like the example below.

# Observed counts in F2: dominant vs recessive
obs <- c(dominant = 705, recessive = 224)

# Expected Mendelian proportions (3:1)
p <- c(0.75, 0.25)

# Goodness-of-fit test (Mendel did not have this!)
chisq.test(obs, p = p)


    Chi-squared test for given probabilities

data:  obs
X-squared = 0.39074, df = 1, p-value = 0.5319

From this statistical test we can conclude that the observed counts do not significantly deviate from a 3:1 ratio, confirming Mendel’s model of inheritance.

About the course

In this course, we focus on modern tools for asking and answering questions from data. We discuss strategies to acquire, process, explore, visualize, and analyse data for a variety of plant science related data sets. In addition, we work on making data analysis projects reproducible: by using the R programming language, we create reproducible analysis workflows.

Why R?

R is widely used in the plant sciences and related fields because it combines data handling, statistical analysis, and visualisation in a single environment. Unlike point-and-click software, analyses in R are expressed explicitly as code. This makes every step of an analysis transparent and reproducible, and allows analyses to scale from small, simple datasets to large and complex ones. Learning R therefore provides not only practical skills for this course, but also a foundation that students can build on in later courses, internships, and research projects.

This course does not assume prior experience with programming. Instead, we approach R as a tool for thinking with data. Concepts are introduced gradually, starting from basic data manipulation and simple visualisation, and building towards more complex analyses. Throughout the course, the focus remains on understanding the data and the questions being asked, rather than on writing code for its own sake.

About the book

This book is designed to support the learning objectives of the course by providing structured, practice-oriented learning material. Each chapter focuses on a specific set of data analysis skills and introduces these through examples based on datasets commonly encountered in the plant sciences. The book does not aim to be a comprehensive reference for R, but instead concentrates on methods and techniques that are essential for exploratory data analysis.

Throughout the book, code examples are used to demonstrate standard data analysis workflows.

Learning how to program in R

Programming and data analysis are not spectator activities. To develop practical skills, it is essential to actively work through the examples in this book and to apply the demonstrated techniques to new datasets. By writing, running, and modifying code yourself, you will build the proficiency needed to use R effectively and to carry out reproducible data analyses.

Just like the course, the book does not assume any prior knowledge on programming skills or working with R. To gradually build up to a level of independent data analysis, the book is structured into four main sections, aligning with the four weeks of the course (Figure 1).

Reading guide

Throughout the book, you will find a mix of examples, stories, theory, and assignments. All the study material for the course is available in the book, lectures are intended to solidify concepts and provide additional motivation and examples. This means that by completing the material of one section of the book and carefully reviewing the lecture slides, you should be able to successfully perform the graded assignment of the accompanying week.

To provide additional structure to the material in the book, you will find various ‘boxes’: these are intended to either highlight important information, or provide additional background material.