1. Genes, Proteins, Databases, Genome annotation - Introduction to Bioinformatics

📘 This content is part of version: v1.0.0 (Major release)

In this chapter your will learn the basics of molecular biology that are required for understanding bioinformatics approaches. In addition you will learn common approaches for storing and describing biomolecular data.

Learning outcomes

After studying this chapter you should be able to:

Describe the chemical composition and structural differences between DNA and RNA sequences, and how these sequences store, replicate, and pass genetic information.
Describe how genetic information is transcribed and translated into proteins - the central dogma.
Explain how amino acid physicochemical properties drive protein folding and function, and the effect of substitutions.
Describe the concepts of genome annotation and gene prediction and how they combine computational methods and biological evidence to decode genomic content.
List essential databases and file formats that organize and store biological sequence and annotation data.
Describe standardized ontologies help structure biological metadata and ensure data interoperability and reusability.

Biological background¶

A large part of bioinformatics deals with the analysis of biological Sequences. These sequences originate from organic macromolecules that play important roles in cells. In the first section of this chapter, we describe these macromolecules, their sequences, and the biological processes involved in generating their active structures and maintaining these.

As such, this section provides important background material for the entire course. Depending on your background, parts of this section might seem redundant, in which case this section can function as a refresher. Later chapters assume you are familiar with this section.

Nucleic acids¶

Deoxyribonucleic acid (DNA) carries the genetic information of organisms. Ribonucleic acid (RNA) is involved in the Protein expression and is also the genetic material of some viruses. Thus, these molecules are highly important as the basis of life on Earth. The Genome denotes the Cell’s entire genetic content and genomics is the study of genomes.

DNA and RNA are comprised of monomers called Nucleotides, which are comprised of three components (Figure 1):

A pentose sugar, where carbon residues are numbered 1’ to 5’ (read 1’ as “one prime”). The type of pentose distinguishes RNA and DNA: the sugar is deoxyribose in DNA and ribose in RNA. They are similar in structure, but deoxyribose has an H instead of an OH at the 2′ position.
A phosphate group that is attached to the 5’ position of the sugar.
A base that is attached to the 1’ position of the sugar.

Nucleotide overview — Figure 1:The components of a nucleotide. Credits: CC BY 4.0 Clark *et al.* (2018).

The bases can be divided into two categories: purines (with a double ring structure) and pyrimidines (with a single ring structure) (Figure 1). DNA contains A, T, C, and G; whereas RNA contains A, U, C, and G.

The DNA double helix¶

The DNA molecule is a polymer of deoxyribonucleotides and forms a right-handed double helix. The sugar and phosphate are on the outside forming the helix’s backbone and the bases are stacked in the interior and bind each other by hydrogen bonds. Thereby A pairs with T via two hydrogen bonds and C pairs with G via three hydrogen bonds, they are complementary bases. These pairings are also called Watson-Crick base-pairing, named after the discoverers of DNA.

Figure 2:The DNA structure. Credits: CC BY 3.0 OpenStax College (2013).

The two strands of the helix run in opposite directions, also called anti-parallel, i.e., one goes from 5’ to 3’ and the other from 3’ to 5’ (Figure 2). The nucleotide sequence is typically written in 5’ to 3’ direction. Due to the complementarity, the base sequence of a strand can be deduced from the base sequence from the other strand. This is called the reverse complement. For example, the reverse complement of AAGT is ACTT, where both strands are given in 5’ to 3’ direction.

DNA replication¶

As the two DNA strands are only connected via hydrogen bonds, they can be separated relatively easily, for example during DNA replication (Figure 3). The separated strands each serve as a template on which a new complementary strand is synthesized by the enzyme DNA polymerase in 5’ to 3’ direction. This mode of replication is called semiconservative.

The error rate of DNA replication is remarkably low, about one erroneous base in 10⁹ bases. This property preserves the genetic information during cell division, and also over generations. It also leads to mutations over evolutionary time (Figure 4), as we will see later (Substitutions).

DNA_mutation — Figure 4:A DNA mutation that occurs during replication. Credits: BY-NC-SA 4.0 UC Museum of Paleontology (2020).

RNA, transcription, and splicing¶

During Transcription, RNA polymerase reads the template strand (also called noncoding strand) in the 3’ to 5’ direction (Figure 5). This produces an RNA molecule from 5’ to 3’, which is a copy of the coding strand. During transcription thymine is replaced by uracil. In contrast to DNA, RNA does not form a stable double helix. RNA is mainly single stranded, but most RNAs show intramolecular base pairing between complementary bases.

There are four major types of RNA:

Messenger RNA (mRNA): RNA molecules that will later be translated into proteins and therefore serve as a ‘messenger’ in protein production.
Ribosomal RNA (rRNA): the primary component of ribosomes (the ‘powerplants’ of a cell).
Transfer RNA (tRNA): functions as ‘adapter molecule’ that serve as the physical link between mRNA and the amino acid sequence of a protein during translation.
MicroRNA (miRNA): non-coding RNA molecules of 21-23 nucleotides involved in RNA silencing and post-transcriptional regulation of Gene expression.

Transcription — Figure 5:RNA is produced by transcribing DNA: as such, it is a direct copy of the information contained in the DNA. Where DNA contains thymine (T, indicated in blue), RNA contains uracil (U, indicated in purple). Credits: CC BY-NC 4.0 Ridder *et al.* (2024).

In eukaryotes, precursor mRNA molecules undergo various postprocessing steps to produce mature mRNA molecules. To stabilize the mRNA, the 5’ end of the molecule is capped with a modified guanine nucleotide (more specifically, a 7-methylguanylate) and the 3’ end is extended with a long stretch of adenine nucleotides (known as poly-adenylation). In addition, many eukaryotic mRNA molecules undergo Splicing. During RNA splicing, the spliceosome protein complex removes introns: specific non-coding parts of an mRNA molecule that are not used during translation (Figure 6), to create mature mRNA. Most introns are characterized by a GU and AG dinucleotide motif in the 5’ and 3’ end respectively.

Splicing — Figure 6:During splicing, introns are removed from precursor mRNA moleculus to create mature mRNA. Most introns contain recognition sequences for the spliceosome and produce specific secondary structures that improve splicing efficiency: **(1)** 3’ splice site, **(2)** poly pyrimidine tract, **(3)** branch site, **(4)** 5’ splice site’. Credits: CC0 1.0 miguelferig (2011).

Translation¶

During protein Translation, ribosomes synthesize polypeptides from messenger RNA (mRNA) (Figure 7). During this process tRNAs decode the information on the RNA into amino acids, where a codon consisting of three nucleotides encodes the information for one amino acid.

The genetic code¶

The genetic code shows the correspondence between codons and amino acids (Figure 8). Since 64 possible codons code for 20 different amino acids, the genetic code is degenerate, i.e., most amino acids are specified by more than one codon. Thus, the codons encoding one particular amino acid may differ in one or two of their positions. You can notice in Figure 8 that the third codon position often differs between codons for the same amino acid. As a result of the code degeneracy, the protein sequence can be deduced from the DNA or RNA sequence but not vice versa.

There are three codons that do not encode for an amino acid, but instead signal the end of the protein sequence, called stop codons. Furthermore, translation generally starts with the start codon AUG encoding methionine. More information of how protein information is encoded in genomes can be found in the section on genome annotation.

The central dogma of molecular biology¶

According to the central dogma of molecular biology, the flow of genetic information is essentially in one direction: from DNA via RNA to proteins (Figure 9). Nevertheless, there are also genes that do not code for proteins, but where functional RNA is the end product. Furthermore, mobile genetic elements and viruses can encode reverse transcriptases (which can synthesize DNA from an RNA template) or RNA dependent RNA polymerases (which can replicate RNA).

Proteins¶

Proteins are large, complex macromolecules that play many important roles in the body. They are critical to most of the work done by cells and are required for the structure, function and regulation of the body’s tissues and organs. The basic building blocks of proteins are amino acids.

Amino acids¶

An amino acid contains a central carbon atom (called α-carbon, or C_α) (Figure 10). The α-carbon is bound to an amino group (NH₂), a carboxyl group (COOH), and a hydrogen atom. In addition, each amino acid has a specific residue (R) group.

Structure of an amino acid — Figure 10:The structure of an amino acid. Four elements are connected to the α-carbon: an amino group, a hydrogen atom, a carboxyl group, and a side chain (R group). Credits: CC BY 4.0 Clark *et al.* (2018).

Table 1:Amino acids and their abbreviations and basic properties

Amino acid	Three-letter code	One-letter code	Property
Arginine	Arg	R	Positively charged
Histidine	His	H	Positively charged
Lysine	Lys	K	Positively charged
Aspartic acid	Asp	D	Negatively charged
Glutamic acid	Glu	E	Negatively charged
Serine	Ser	S	Polar uncharged
Threonine	Thr	T	Polar uncharged
Asparagine	Asn	N	Polar uncharged
Glutamine	Gln	Q	Polar uncharged
Alanine	Ala	A	Hydrophobic
Valine	Val	V	Hydrophobic
Isoleucine	Ile	I	Hydrophobic
Leucine	Leu	L	Hydrophobic
Methionine	Met	M	Hydrophobic
Phenylalanine	Phe	F	Hydrophobic and aromatic
Tyrosine	Tyr	Y	Hydrophobic and aromatic
Trypotophan	Trp	W	Hydrophobic and aromatic
Glycine	Gly	G	Special (only H as side chain)
Proline	Pro	P	Special (side chain bound to backbone nitrogen)
Cysteine	Cys	C	Special (forms disulfide bonds)

Some amino acids have non-polar side chains, and these are generally hydrophobic, i.e., water molecules cannot form hydrogen bonds with these molecules. Thus, they can often be found in the interior of proteins together with other hydrophobic amino acids. Aromatic amino acids contain aromatic rings, and often stabilize folded protein structures.

In contrast, the charged and the polar amino acids are hydrophilic, i.e., water molecules can form hydrogen bonds with these molecules. They can often be found on the surface of proteins or in the interior, when they can interact with another oppositely charged amino acid. Positively charged amino acids, are also called basic amino acids and negatively charged amino acids are also called acidic amino acids.

Although amino acids can be classified into these groups based on their properties, some amino acids stand out. The smallest amino acid is glycine, which provides great flexibility due to its small size. In contrast, proline is an amino acid, where the side chain is bonded to the backbone nitrogen atom, which makes it very rigid. Finally, one cysteine amino acid can form a disulfide bridge with another cysteine.

Protein structure¶

A protein is made up of one or more long, folded chains of amino acids (each called a polypeptide). The 3D structure of a protein is also called its conformation. The protein conformation is described on four levels - primary to quaternary structure (Figure 11).

The structure of a protein is critical for its function. For example, in an enzyme, the active site must be in the correct structure to be able to bind the substrate. Other proteins might bind proteins (and influence their activity) or bind DNA (and regulate gene expression). Additionally, some proteins are secreted from the cell or might function within the cell membrane. Finally, proteins are often modified after protein synthesis (see Translation), called post-translational modification. These modifications can be important for protein function.

Primary structure¶

In a protein, amino acids are connected by covalent bonds, called peptide bonds. A peptide bond connects one amino acid’s carboxyl group and the next amino acid’s amino group (Figure 12). The sequence of amino acids linked by peptide bonds is called the primary structure. The protein sequence is determined by the gene sequence encoding the protein. The continuous chain of atoms along the protein is also called the backbone, it consists of the three backbone atoms (nitrogen, C_α, carbon).

Each protein has a free amino group on one end, called the N terminus. The other end has a free carboxyl group, called the C terminus.

Secondary structure¶

Secondary structures are local conformations in the protein that are stabilized by hydrogen bonds between backbone atoms. We distinguish the regular helices (i.e., alpha helix - α-helix) and sheet structures (i.e., beta sheet - β-sheet) (Figure 13) and irregular turns.

α-helices are stabilized by hydrogen bonds between the oxygen atom in the C group in one amino acid, and the hydrogen in the N group of the amino acids that is four amino acids farther along the chain. Every helical turn has 3.6 amino acids residues and the side chains stick out of the helix.

β-pleated sheets (short: β-sheets) consist of β-strands, where the R groups extend above and below the strands. The strands have a direction determined by the N- and C-terminus of the protein and are usually depicted as an arrow pointing towards the C-terminus. Depending on the direction, strands can align parallel or antiparallel to each other.

Turns are short secondary structure elements that are stabilized by hydrogen bonds between amino acids that are 1 to 5 peptide bonds away. The most common form are β-turns, which connect antiparallel β-strands.

The peptide bond is very rigid and planar, i.e., it cannot rotate to form the elements of protein structure. However, the N-C_α and the C_α-C bonds can freely rotate, being only limited by the size and properties of the R-groups. The 3D shape of the polypeptide backbone is thus determined by two torsion angles: phi (φ) between N and C_α and psi (ψ) between C_α and C (Figure 14A). Although φ and ψ can rotate in principle, steric hindrance prevents certain combinations of angles, i.e., the bulkiness of the R-groups restricts the possible conformations. Thus, certain combinations of φ and ψ are preferred. We can plot the combinations of φ and ψ in a protein, in a so-called Ramachandran plot (Figure 14B).

The regular secondary structure elements (α-helix and β-sheet) contain consecutive amino acids with similar (φ,ψ) values. These regions are typically highly populated in a Ramachandran plot. Thus, the Ramachandran plot can be used to assess how plausible a predicted protein structure is.

Phi, psi, and Ramachandran plot — Figure 14:A) The φ, and ψ torsion angles of a polypeptide chain. Credits: CC BY-NC 4.0 Ridder *et al.* (2024). B) A typical Ramachandran plot. The red regions marked do not have any steric hindrance, yellow areas represent conformations that have steric hindrance, light yellow areas represent conformations that are generally sterically unfavorable, and white areas do not have any allowed conformations. Credits: Ramachandran plot modified from PROCHECK Laskowski *et al.* (1993).

Tertiary structure¶

The tertiary structure of a protein describes the complete folding of an entire polypeptide chain. In contrast to the secondary structure, the tertiary structure of a protein involves interactions between the amino acid’s side chains that can occur at short-range and long-range (Figure 15). Thus, the chemical properties of the amino acids are very important for the tertiary structure. Different types of interactions stabilize the tertiary structure:

Hydrogen bonds involving polar amino acids.
Ionic bonds between positively and negatively charged amino acids.
Hydrophobic R groups that tend to lie in the protein’s interior, stabilized by hydrophobic interactions.
Disulfide bonds (i.e., covalent bonds between cysteines).

Tertiary structure interactions — Figure 15:Chemical interactions that stabilize the tertiary structure of proteins. Credits: CC BY 4.0 Clark *et al.* (2018).

When studying many different protein structures, various reoccurring substructures can be observed. These so-called Domains are distinct functional and/or structural units in a protein and are typically 50 to 350 amino acids long. Usually, a domain is responsible for a particular function or interaction, contributing to the overall role of a protein. A domain can exist in different contexts with other domains (Figure 16). In a multidomain protein, each domain folds independently of the others.

Domain examples — Figure 16:A) Example of an Src homology 3 (SH3) domain that is involved in protein-protein interaction. SH3 domains occur in a diverse range of proteins with different functions. B) The cytoplasmic protein Nck contains multiple SH3 domains. C) Domain composition of phospholipase D1, which has multiple functional domains that contribute to its overall function. Credits: CC BY 4.0 Sangrador (2023).

Quaternary structure¶

Finally, individual folded polypeptides can interact to form protein complexes, also called quaternary structures. The quaternary structure is stabilized by the same types of interactions as the tertiary structure. The difference is that the amino acids involved belong to different polypeptides.

Many functional proteins are composed of multiple subunits, they are also called oligomers (Figure 17). The subunits can originate from the same protein sequence (called a homomer) or from different sequences (called a heteromer). Proteins consisting of two subunits are also called dimer.

Figure 17:Examples of oligomers. A) Myoglobin, a heteromer of four subunits (PDB structure 1HV4 colored by chain). Credits: Berman *et al.* (2000)Liang *et al.* (2001)Rose *et al.* (2018). B) UDP-galactose 4-epimerase, a homodimer (PDB structure 1EK5 colored by secondary structure). Credits: Berman *et al.* (2000)Thoden *et al.* (2000)Rose *et al.* (2018).

Substitutions¶

Mutations in the gene sequence can lead to changes in the primary structure of the protein, e.g., a substitution of one amino acid by a different one. Often, such substitutions still lead to highly similar protein structures that perform a similar or even the same function, especially when the exchanged amino acids have similar chemical properties. Nevertheless, single amino acid substitutions can have severe consequences. A prominent example is sickle cell anemia, where a substitution of glutamic acid to valine in hemoglobin β results in a structural change that leads to a distortion in red blood cells (Figure 18).

Sickle cell anemia — Figure 18:Consequences of a substitution in hemoglobin β resulting in sickle cell anemia. Credits: Rao, A., Tag, A. Ryan, K. and Fletcher, S. Department of Biology, Texas A&M University.

Visualization¶

There are many styles to view protein molecular structures. Some styles focus on detailed chemical structure, others are targeted at the protein surface. For some examples see Figure 19.

Protein representation — Figure 19:Different representations of the PDB structure 5PEP generated with NGL. Credits: Berman *et al.* (2000)Cooper *et al.* (1990)Rose *et al.* (2018).

Genome annotation¶

Annotation of genomes is the process of deciphering what information is encoded in an organism’s DNA. It is an ongoing effort in organisms with known genome sequences. Even moreso, genome annotation is a critical step in acquiring biological insights from newly sequenced genomes. Given the large size of any genome, automated procedures are used to identify various genomic elements such as genes, regulatory regions, transposable elements, or other non-coding elements. Each of these bioinformatic procedures typically focuses on identifying one type of element, and as such a complete genome annotation project can be thought of as a pipeline of various procedures. The following section describes the most common steps in genome annotation.

Repeat masking¶

Repeat masking involves the identification and masking (hiding) of repetitive sequences within a genome. It is an essential first step in annotating most genomes because repetitive sequences can pose significant challenges in genome annotation. Masking repeats generally improves:

Accuracy: repetitive elements can be mistakenly annotated as genes or other functional elements, leading to inaccurate predictions and interpretations of the genome.
Computational efficiency: identifying and processing repetitive sequences can be computationally intensive. However, masking these repetitive regions reduces the computation time of all downstream analyses.
Biological relevance: repetitive sequences are usually not involved in the coding of proteins of interest. Therefore, focusing on non-repetitive regions is a smart choice in understanding the genes and regulatory elements that drive biological processes.

Most repeat masking workflows work by first compiling (or using a precompiled) ‘repeat library’: a collection of known repetitive elements that have previously been characterized. Subsequently, the genome to be annotated is compared against this repeat library using various computational algorithms, such as (specifically configured versions of) BLAST or RepeatMasker. When a match is found, the corresponding region in the genome is ‘masked’ or annotated as a repetitive element. This means that these regions are excluded from further analysis or labeled as repetitive.

Gene prediction¶

The process of finding protein coding genes differs between prokaryotic and eukaryotic genomes. In both cases the aim is to find open reading frames (ORFs): contiguous stretches of nucleotides that encode proteins. More specifically, an ORF starts with a start codon, ends with a stop codon, and it’s length is a multiple of three (Refer to the genetic code in Figure 8). Since RNA splicing (Figure 6) is almost absent in prokaryotic genomes, prokaryotic ORFs can be found directly in the genomic DNA. As a result, simply enumerating all possible ORFs in a genome is a common step in prokaryotic genome annotation. In contrast, ORFs in eukaryotic genomes are found on mature mRNAs. As such, all eukaryotic gene prediction methods take splicing into account, thereby greatly increasing their computational complexity. Both prokaryotic and eukaryotic gene prediction typically can be classified as either evidence based prediction or ab initio prediction, both will be explained below.

Evidence based prediction¶

This data-driven approach uses existing and newly generated data to get hints on what regions of a genome encode genes. Depending on the type of data, these predictions have more or less predictive power. Some commonly used evidence types are:

RNA-sequencing data: the most direct form of evidence for what regions of the genome are transcribed. As such, RNA-sequencing (often abbreviated to RNA-seq) ‘reads’ often provide the best form of evidence in identifying splice sites in eukaryotes. Note that not all transcribed RNA will be translated into proteins, and that therefore not all RNA-sequencing reads are evidence for protein coding genes. Distinguishing between protein-coding and non-coding RNA is not always trivial.
Homology evidence: Aligning DNA or protein sequences of known genes (from other organisms) is valuable evidence in finding coding regions of the genome. Due to the redundancy in the genetic code, it is not trivial to correctly identify splice sites when aligning protein sequences to a genome. Homology evidence from closely related organisms leads to higher quality predictions than evidence from distantly related organisms.
Whole-genome alignments: this approach uses the annotated genome of a closely related organism to directly identify coding regions in a novel genome. For example: whole-genome alignment of mouse and human genomes reveals that large parts of mouse chromosome 2 are homologous to human chromosome 20. The alignment procedure results in a direct 1-to-1 mapping of mouse and human genome coordinates, and as such annotation coordinates can be transferred between genomes.

Ab initio prediction¶

Ab initio (latin): from first principles, from the beginning

These methods rely on statistics to learn a predictive model from a known annotated genome. Various forms of ab initio models exist, and whereas implementation details differ, most follow a similar line of reasoning. For now, we will stick to a high level description. All ab initio models scan through a DNA sequence and at each position give a score for a specific type of annotation. In addition, they often take the genomic context of a specific position into account. For example, the probability of a protein-coding annotation on a nucleotide A is high when the next two observed nucleotides are T and G, producing the ATG start-codon methionine. In addition, most methods also take the predicted annotation of the genomic context into account. For example: the probibility that ATG actually codes for a start codon is much higher if we can also predict an in-frame stop codon. In eukaryotic genome prediction these models become quite complex because they have to include splice sites in all three reading frames. How exactly a model decides what annotation score to give to which nucleotide is part of the model architecture and parameterization. In all cases, the model parameters are chosen to accurately reproduce a known genome annotation. If sufficient data is used to learn the model parameters, it is assumed that these models can be used to predict annotations on novel genome sequences. Like homology-based prediction, this model-based approach works best for closely related organisms. In the past, almost all ab initio prediction methods were formulated as hidden Markov models (HMMs) (see Note 1.5). Examples of tools implementing HMM based ab initio prediction are SNAP, GeneMark, and Augustus. With the availability of more high quality data (genome sequences and accompanying annotations), approaches based on deep learning and generative AI have proven to frequently perform better than HMM based approaches.

Note 1.5: Hidden Markov models (HMMs)

Hidden Markov Models (HMMs) are a statistical tool for analyzing sequences. They are widely used in bioinformatics to study biological data such as DNA or protein sequences.

A full technical description is beyond the scope of this book. Instead, we give a simplified introduction.

An HMM predicts hidden labels across a sequence of observations. For example, in genome annotation the genome is observed as a sequence of nucleotides, while coding and non-coding regions are hidden labels. The word hidden refers to these unobserved labels. The word Markov refers to statistical assumptions about dependence between consecutive labels, which make computation efficient.

Formally, an HMM has:

Hidden states: the unobserved labels (e.g. coding or non-coding).
Emission probabilities: the likelihood of observing a symbol (e.g. a nucleotide) given a hidden state.
Transition probabilities: the likelihood of moving from one hidden state to another.

Together, these elements allow questions such as:

Given my current observation (e.g. a nucleotide) and the label of the previous position, what is the most likely label now?

In genome annotation, this might become:

Given that I see a stop codon, and the previous label was coding sequence, what is the current label?

The most likely answer: non-coding (see Figure 20).

Figure 20:A: Graphical representation of a general hidden Markov model. Shaded circles indicate observations, white circles indicate unobserved labellings (hidden states). Black arrows indicate transition probabilities between hidden states, and emission probabilies for observations from hidden states. Note that there are no arrows between observations! This is one of the properties of HMMs that enable efficient computation. B: A (simplified) HMM variant that labels a sequence of DNA codons as either coding or non-coding. Real-world gene predicition HMMs use a more elaborate structure with more hidden states, and six-frame representations of the DNA. Credits: CC BY-NC 4.0 Ridder et al. (2024).

Chapter 2 and Chapter 4 cover various other applications of HMMs in bioinformatics, such as defining and prediction sequence domains, or transmembrane properties of proteins.

Evidence/prediction integration¶

From the previous sections it has now become clear there are several ways of predicting what the genes in a genome look like. Since these various approaches almost never agree exactly in their predictions, a final step in genome annotation is evidence and prediction integration. Typically a weighted consensus approach is used: each individual source of evidence is given a weight representing how much it should influence the final decision, after which a majority vote decides what the annotation should look like. Typically RNA-seq evidence gets a high weight, and various forms of homology evidence can be weighted depending on how closely related they are to the genome of interest.

Functional annotation¶

So far, all described steps in the genome annotation process have dealt with what genes look like on a structural level. To gain biological insight, the next step is to assign functional annotations to the predicted genes. This functional annotation step consists of using various sequence alignment and search tools to find sequences with a known function/description and to transfer the information of the known gene to the predicted gene. Several databases of high-quality known functions are often used, which are described in more detail in the next section of this chapter. In Chapter 2 we will learn about approaches how to search these databases efficiently.

Note 1.6: Visualizing gene structure

Gene models: the genomic structure of a gene (often referred to as a gene ‘model’) is typically visualized by a set of lines and rectangles with predefined meaning.

Figure 21:An example gene model. Various visualization conventions can be identified: boxes represent genomic regions that are transcribed. Boxes are Exons, lines between boxes are Introns. Narrow boxes (sometimes with a lighter color) are untranscribed regions (UTRs), wider boxes (sometimes darker colored) are coding sequence regions (CDS). The arrow indicates the direction of transcription. In this example a gene on chromosome 1 with two splice variants is shown, where the first variant has a slightly longer 5’ UTR and an additional CDS exon in between the first and last exons. Credits: CC BY-NC 4.0 Ridder et al. (2024).

Genome browsers facilitate interactive visualization of annotations and evidence alignments on genome sequences. Various implementations exist, but all genome browsers typically provide a linear view of a chromosome that can be scrolled and zoomed. In addition, various annotation ‘tracks’ can often be toggled, to display for instance known gene structures, RNA sequencing alignments, or homologous protein sequence alignments. Most visualization elements can be clicked to open pop-up windows with additional information.

Figure 22:A screenshot of the JBrowse genome browser showing Arabidopsis thaliana chromosome 1 with a gene that has multiple splice variants. Credits: Buels et al. (2016).

Databases¶

Introduction¶

Databases are at the core of bioinformatics. In all analyses, we integrate pre-existing data and we need to access this data. The journal Nucleic Acids Research publishes an entire issue in the beginning of each year on new and updated databases. The list of these databases can also be accessed online.

Computer scientists have developed different kinds of databases. One example are relational databases, which can be queried by SQL (structured query language) and which perform well for data that is processed computationally. Another example are XML (extended markup language) databases which store data in specified well-structured XML files. Nevertheless, most databases for biological sequence data use flat file databases, where the data is saved in structured text files. This data can be manipulated in a text editor without requiring an additional program for database management, and they can be easily exchanged between scientists. On the downside, searching them has a lower performance. This is why they are often indexed, i.e., they contain an index of keywords, similar to a glossary in a book.

Depending on the kind of data included, we distinguish different kinds of biological databases:

Primary databases contain primary sequence information from experimentally derived data that is directly submitted by the scientists that generated the data.
Secondary databases provide the results of analyses of the information in primary databases.

Each entry in a database has a unique accession number. This number is permanent and provides an unambiguous way to link to the entry. The information that the accession refers to should not change. To still allow updates to an entry, the accession number can contain a version, usually after a dot. For example, NC_003070.9 is the latest version (version 9) for Arabidopsis thaliana chromosome 1 in RefSeq.

Database entries often link to each other via cross links.

GenBank¶

GenBank is a popular primary database for nucleotide sequences and is based at the NCBI (National Center for Biotechnology Information). A GenBank release usually occurs every two months and the most recent release from the 15^th of December 2023 contains ~250 million sequences and additionally ~3.7 billion WGS (whole genome shotgun) records. The latter are genome assemblies or genomes that were not yet completed. The complete database is available for download via FTP, but the most convenient way to access individual entries is via the search on the GenBank website (Figure 23).

Since data is directly submitted to GenBank, the information for some loci can be highly redundant. The sequence records are owned by the original submitter and cannot be altered by someone else.

Genbank is part of the INSDC (International Nucleotide Sequence Database Collaboration). The other two member databases are ENA (European Nucleotide Archive) and DDBJ (DNA Data Bank of Japan). The data submitted to either database is exchanged daily, so all databases contain essentially the same information.

RefSeq¶

The Reference Sequence (RefSeq) collection is also hosted at NCBI and contains genomic DNA, transcripts, and proteins. The aim of RefSeq is to provide non-redundant, curated data. RefSeq genomes are copies of selected assembled genomes in GenBank. Additionally, transcript and protein records are generated by several processes:

Computation via the eukaryotic or prokaryotic annotation pipeline.
Manual curation.
Transfer of information from annotated genomes in GenBank. In contrast to GenBank, RefSeq records are owned by NCBI and can be updated to maintain annotation. The current release is 231 from the 11^th of July 2025 and contains ~418 million proteins from ~167,000 organisms.

The RefSeq accessions directly provide information on molecule types. For example, NC_ accessions denote complete genomes, NP_ accessions denote proteins in one genome, and WP_ accessions denote proteins in multiple genomes.

UniProt¶

There is lots of information available for proteins, such as sequence information, domains, expression, or 3D structure. The aim of the Universal Protein Resource (UniProt) is to provide a comprehensive resource for proteins and their annotation. UniProt contains three databases (Figure 24):

UniProt Knowledgebase (UniProtKB) - see below.
UniProt Reference Clusters (UniRef - clusters of protein sequences at 100%, 90%, and 50% identity.
UniProt Archive (UniParc - non-redundant archive of publicly available protein sequences seen across different databases.

Overview of UniProt — Figure 24:The information flow in Uniprot. Credits: CC BY-NC-ND 4.0 Leon & Pastor (2021).

UniProtKB is the central hub for functional information on proteins. For each protein it contains the core data (such as sequence, name, description, taxonomy, citation) and as much annotation information as possible. It contains many cross-references to other databases and is generally a very good starting point to find information on a protein.

UniProtKB consists of two sections:

Swiss-Prot - manually-annotated records with information extracted from literature and curated computational analysis.
TrEMBL - automatically annotated records that are not reviewed.

UniProtKB is updated every 8 weeks. The current release has ~570,000 entries in Swiss-Prot and ~250 million entries in TrEMBL.

Prosite¶

Prosite is a secondary database of protein domains, families, and functional sites. Some regions in protein families are more conserved than others because they are important for the structure or function of the protein. Prosite contains motifs and profiles specific for many protein families or domains. Searching motifs in new proteins can provide a first hint for protein function.

The current release of Prosite from the 18^th of June 2025 contains 1311 patterns, 1403 profiles, and 1421 ProRule entries.

A Prosite pattern is typically 10 to 20 amino acids in length. These short patterns are usually located in short well-conserved regions, such as catalytic sites in enzymes or binding sites. A pattern is represented as a regular expression, where amino acids are separated by hyphens and x denotes any letter. Repetitions can also be given as the number of repetitions in brackets. For example, [AC]-x-V-x(4)-{ED} matches sequences that contain the following amino acid sequence: (Alanine or Cysteine)-any-Valine-any-any-any-any-(any but Glutamic acid or Aspartic acid). Note that this representation is qualitative, a sequence either matches a pattern or it does not.

Patterns cannot deal with mismatches and are limited to exact matches to the pattern. Thus, they are not well suited to identify distant homologs. A Prosite profile is more general than a pattern and can also detect poorly conserved domains or families. They characterize protein domains over their entire length and do not just model the conserved parts. Profiles are estimated from multiple sequence alignments and we learn more about them in Chapter 2. For now, it is important to know that profiles model matches, insertions, and deletions. Importantly, profiles are quantitative representations, they will return a score how well the sequence fits to the profile. A threshold can be applied to get high-scoring profiles for a sequence. In contrast to patterns, a mismatch to a profile can be accepted if the rest of the sequence is highly similar to the profile. Profiles are well suited to model structure properties of a domain.

Notably, profiles cover the structural relationships of domains, but they might also score a sequence highly that lacks important functional residues. To include that information, ProRule contains additional information about Prosite profiles, such as the position of structurally or functionally important amino acids. ProRule is used to guide curated annotation of UniProtKB/Swiss-Prot.

InterPro¶

The Integrated Resource of Protein Families, Domains and Sites (InterPro) integrates 13 member databases (including Prosite and Pfam) into a comprehensive secondary database. Additionally, it provides annotation from other tools, for example to annotate signal peptides and transmembrane regions. It allows to identify functionally important domains and conserved sites in a sequence by simultaneously annotating it using the member databases. Interpro can be used to find out which protein family a sequence belongs to, or what its putative function is. Additionally, one InterPro entry can integrate entries from the member databases, if they represent the same biological entity, reducing redundancy. InterPro entries are also linked to Gene Ontology. They are curated before being released.

InterPro is updated every 8 weeks. The current release from the 19^th of June 2025 contains ~49,000 entries, which represent different types:

As an example, look at the InterPro entry for the type 2 malate dehydrogenase protein family. The entry has a name (malate dehydrogenase, type 2) and accession (IPR010945). The contributing entries in member databases are shown on the right-hand side, with links to the individual member database entries. A descriptive abstract explains what these proteins are and what their function is. A set of GO terms is also provided, which describe the characteristics of the proteins matched by the entry.

You can get the InterPro annotation for a protein by running a new sequence search (Figure 25), or by by looking up its UniProt accession (Figure 26).

Searching interpro — Figure 25:Search fields on the InterPro home page, showing text search field (A) and the sequence search (B) options, including ‘Advanced options’, where you can limit your search to member databases or sequence features of interest. Selecting the browse tab in the top menu (C) allows access to a browse search, (e.g., search for member database signature, InterPro entry type), see also Figure 26. You can also search for a particular domain architecture (D). Credits: Paysan-Lafosse *et al.* (2022).

Browsing interpro — Figure 26:Browse the annotated proteins in Interpro and search for a UniProt accession. See resulting entry in (Figure 27). Credits: Paysan-Lafosse *et al.* (2022).

A UniProt entry in Interpro — Figure 27:The result page when looking up UniProt accession A0A076FRI5 in InterPro. You can see the family and domain annotation and on the right the accessions in InterPro and in the member databases. You can click on each of these accessions to get to the entry information. Credits: Paysan-Lafosse *et al.* (2022).

You may have noticed a colored letter before each InterPro accession, e.g., F before IPR011835 or D before IPR001296 (Figure 27). These icons denote the different InterPro entry types:

(Homologous) Superfamily - a large diverse family, usually with shared protein structure.
Family - a group of proteins sharing a common evolutionary origin, reflected by their related functions and similarities in sequence or structure.
Domain - a distinct functional or structural unit in a protein, usually responsible for a particular function or interaction.
Repeat - typically a short amino acid sequence that is repeated within a protein.
Site - a group of amino acids with certain characteristics that may be important for protein function, e.g., active sites or binding sites

Interpro types — Figure 28:The icons for the different InterPro entries (homologous superfamily, family, domain, repeat or site). Credits: CC BY-SA 4.0 Mitchell (2020).

Pfam¶

Pfam is an important resource for protein domains. In Pfam, domains are classified according to profiles that are modelled as Hidden Markov models (HMMs). We will learn more on HMMs in Chapter 2. Pfam is now integrated in InterPro. Each Pfam domain can be represented with a logo, where the amino acids occurring more frequently at a particular position are represented as larger letters (Figure 29).

Figure 29:The Pfam logo for PF12924. Credits: Paysan-Lafosse et al. (2022).

File formats¶

There are many different formats for biological data. A format is a set of rules about the contents and organization of the data. You should be familiar with a couple of common data formats in bioinformatics (See Table 2), which you will experience in the practicals.

Table 2:Examples of common data formats in bioinformatics. Unless explicitly noted these are plain text formats.

File format	Usage	Common extension
FASTA	Nucleotide or amino acid sequences	`.fa`, `.fasta`, `.fna`, `.faa`
Genbank	Sequences, annotations, metadata	`.gb`
Generic Feature Format	Sequence annotations	`.gff`
FASTQ	DNA sequencing data including basecalling quality scores	`.fq`
SAM/BAM (Sequence/Binary Alignment and Map)	Typically alignments between raw sequencing data and a reference, with quality scores	`.sam` (`.bam` for the binary file format)
VCF (Variant Call Format)	(Genetic) variant calls based on sequencing data aligned to a reference	`.vcf` (`.bcf` for the binary file format)
PDB	Protein structure data	`.pdb`
Unstructured text	Miscellaneous	`.txt`

Plain text files¶

Many of the biological data formats are plain text files: they only contain letters, numbers, and symbols, but no formatting, such as font size or colors. Whereas these filetypes can have various different extensions (e.g. .fasta, .gff, etc.), they only contain plain text. The advantage of plain text files is that they can be opened with any text editor on any computer. Plain text differs from rich text format, where the latter can also include formatting. Many bioinformatics programs expect plain text files as input. Thus, when creating them on your computer, take care to save in this format, and not for example in rtf or word.

On a Windows computer, plain text files can for example be created with the Notepad program (Figure 30).

Notepad editor — Figure 30:A screenshot of Notepad on Windows. Credits: CC BY-NC 4.0 Ridder *et al.* (2024).

On a Mac, plain text files can for example be created with the TextEdit program (Figure 31). Take care to set the settings to plain text.

TextEdit editor — Figure 31:A screenshot of TextEdit on Mac. Credits: CC BY-NC 4.0 Ridder *et al.* (2024).

There are some important file formats in bioinformatics. A fasta file stores a DNA or protein sequence (Figure 32). Information on the sequence is found in the header (starting with >), which is on one line and the sequence can go over multiple lines. A multi-fasta file stores multiple sequences.

The GenBank file format is a popular format to represent genes or genomes. Here you can find an example GenBank record with annotations. Important elements are the Locus, Definition (i.e., the name), and the Organism. Additionally, Features, such as genes and CDSs (coding sequences) are listed.

Binary files¶

Binary files are all the files that are not text files, they cannot be opened in a text editor. Instead, they need special programs to write and to open and interpret them. Examples are word files (.docx) which can be opened with Word, pdf files (.pdf) which can be opened with Acrobat Reader, or image files (e.g., .png) which can be opened with image viewers.

Binary files are are also sometimes used in bioinformatics. Examples include the bam format, which is a binary version of the sam format or the gzip format. Gzip is used for compressing text files without the loss of information. For large files, lots of disk space can be saved this way.

Ontologies¶

An ontology is a comprehensive and structured vocabulary for a particular domain, such as biology, genetics, or medicine. It defines the various terms used in a domain, along with their meanings and interconnections. As such, ontologies serve as standardized frameworks for organizing and categorizing information in a way that enables effective communication and reasoning among researchers, practitioners, and computer systems. For example, the terms in an ontology can encompass biological entities like genes, proteins, and cells, as well as processes, functions, and interactions that occur within living organisms. Most of the databases mentioned mentioned in this chapter use ontologies in some way to describe their data.

Ontologies play a crucial role in bioinformatics because they facilitate:

Standardization and consistency: ontologies provide a common language and consistent framework for researchers and professionals, ensuring that everyone understands and uses terms in the same way.
Interoperability: ontologies facilitate the sharing and integration of data and knowledge across different research groups, institutions, and databases. They enable computer systems to process data more accurately, leading to more meaningful analyses and discoveries.
Scientific reasoning: by organizing information in a logical and structured way, ontologies help researchers generate hypotheses, design experiments, and validate findings more effectively.

Ontologies typically form a hierarchy, where specific terms point to more generic terms. More generally, most ontologies are represented as a graph, where ontology terms are the nodes and relationships between terms are edges. As such, one ontology term may have more than one parent term. A variety of ontologies are frequently used in the life sciences, some of which are discussed in greater detail below.

Gene Ontology¶

The Gene Ontology (GO) is a knowledgebase for the function of genes and gene products (e.g. proteins). It is organised into three different domains covering various aspects:

Molecular Function: molecular-level functions performed by gene products (e.g. proteins), such as ‘catalysis’ or ‘transport’. Most molecular functions can be performed by individual gene products, but some functions are performed by complexes consisting of multiple (possibly differing) gene products. GO molecular functions often include the word “activity” (an amylase enzyme would have the GO molecular function amylase activity).
Cellular Component: the cellular structures (or location relative to them) in which a gene product performs its function. Can be cellular compartments (e.g., mitochondrion) or macromolecular complexes of which they are part (e.g., the ribosome).
Biological Process: the larger biological programs composed of multiple molecular activities, for example DNA repair or signal transduction.

A good example of how ontologies are represented as graphs is the biological process hexose biosynthetic process, which has two parents: hexose metabolic process and monosaccharide biosynthetic process. This reflects that biosynthetic process is a subtype of metabolic process and a hexose is a subtype of monosaccharide. (Figure 33).

Edges between GO terms in the GO hierarchy can represent various relationships between genes and gene products. The four main relationship types used in the gene ontology are ‘is a’, ‘part of’, ‘has part’, and ‘regulates’ (see Figure 34).

Sequence Ontology¶

The Sequence Ontology (SO) describes biological sequence elements such as genes or repeats, along with their features and attributes.

The sequence ontology is organized on four main levels:

Attribute: an attribute describes a certain quality of a given sequence, for example the sequence source (i.e., how it was generated).
Collection: multiple discontiguous sequences together, for example the chromosomes of a complete genome.
Feature: the most general top-level entry that describes any extent of a continuous biological sequence, for example a gene is a region, which in turn is a sequence feature.
Variant: intended to describe genetic variation. The definition of a sequence variant is composed of other entries in the sequence ontology: “A sequence_variant is a non-exact copy of a sequence_feature or genome exhibiting one or more sequence_alterations”

SO hierarchy example — Figure 34:An extract of the Sequence Ontology hierarchy. Credits: Eilbeck *et al.* (2005).

Other ontologies¶

Many more ontologies exist and are relevant to biomedical research. The European Bioinformatics Institure (EBI) provides an ontology lookup service that facilitates searching for ontologies. Examples of other ontologies are the plant ontology that describes various anatomical structures in plants, and the human disease ontology.

Practical assignments¶

This practical contains questions and exercises to help you process the study materials of Chapter 1. You have 2 mornings to work your way through the exercises. In a single session you should aim to get about halfway through this guide (i.e., day 1: assignment 1-3, day 2: assignment 4 and project preparation exercise). Use the time indication to make sure that you do not get stuck in one assignment. These practical exercises offer you the best preparation for the project. Especially the project preparation exercise at the end is a good reflection of the level that is required to write a good project report. Make sure that you develop your practical skills now, in order to apply them during the project.

Note, the answers will be made available after the practical!

Exercise 1 (DNA/Genes, 45 minutes)

How do you distinguish a ribose sugar from a deoxyribose?
Which bases are purines?
What is the complementary base of A? C? G? T?
What is the reverse complement of sequence ACGGTGATC?
What is the GC content of sequence ATCGATCGGC?
Which is correct? A nucleotide sequence is written from:
A. 5’ to 3’
B. 3’ to 5’
In a DNA sequence the G stands for:
A. Glycine
B. Guanine
C. Glucose
D. Glutamic acid
Given a coding DNA strand. Write down the non-coding strand, the transcribed sequence, and the resulting chain of amino acids. You may use Figure 3.

Coding strand:         5' ATGGTTTTACTTGAA 3'
Non-coding strand:     ......................
mRNA:                  ......................
Amino acids:           ......................

On your computer, browse to UniProt and search for UniProt ID B3H4Y2.
a. In which organism is this protein found? What is the length of this protein? What is the corresponding gene ID?
b. Write down the first 5 and last 5 amino acids of the protein.
Browse to arabidopsis.org and click on “JBrowse” (Firefox or Chrome recommended). This will take you to a genome browser of the Arabidopsis genome. Search for the gene ID from question 9 (see screenshot above). Under “Help” -> “General” you can find some information to help you understand what you are looking at.
a. You can see that this gene produces two different mRNA transcripts (indicated by .1 and .2) and thus 2 different proteins. How many exons do these transcripts contain? How many introns?
b. Turn on the track “Light grown seedling” under “RNA-seq based evidence”/“Aligned reads”. Do you recognize the splice sites? Are the first two and last two bases of the intron as expected (based on Figure 6)?
c. Save the data for transcript 1. Save a fasta file for the whole transcript and one for each coding sequence (CDS). Create a fasta file on your computer that contains the complete coding sequence of the protein.
d. Is the length of the coding sequence in line with your expectation (based on your findings in question 9a)?
e. Translate the first and last few codons to compare them against the protein sequence (question 9). Do they match?
f. Look upstream of the gene. Can you find the TATA box? How many nucleotides before the start of transcription?
GC content
a. Find a tool on the internet to calculate the GC content of a gene. Which tool did you find? Use it to calculate the GC content for the whole transcript and for the coding sequence that you created in the previous task. What do you observe?
b. Look up the GC content of the chromosome where this gene is located (Hint: Search NCBI Genome for the species). Read about GC content in coding sequences. Which of the information presented here agrees with your analysis?
Why are viruses not represented in the tree of life? Take a look at this site.
Browse to the NCBI taxonomy. Look up the domain and family of the following species:

Species	Domain	Family
Moraxella catarrhalis
Haloarcula quadrata
Loxodonta cyclotis

Exercise 2 (Proteins, 45 minutes)

What is special about the amino acid glycine?
List three hydrophobic amino acids.
Which amino acids are acidic?
alanine glutamine leucine serine
arginine glutamic acid lysine threonine
asparagine glycine methionine tryptophan
aspartic acid histidine phenylalanine tyrosine
cysteine isoleucine proline valine
Which is incorrect?
a. A = Arginine
b. V = Valine
c. Q = Glutamine
d. T = Threonine
In a folded protein, the nonpolar amino acids tend to be:
a. On the inside of the protein
b. At the surface of the protein
c. Randomly distributed
The side chains of amino acids play important roles in the folding and the function of proteins. Below, you can see a short peptide that has been formed by five amino acids (labeled from 1 to 5).
a. Indicate in blue and red the N-terminus and the C-terminus of the peptide, respectively, and highlight all peptide bonds in green.
b. For each of the five amino acids (1-5), give either the name, the three-letter or the one-letter code, depending on the information lacking (for example, for amino acid 1, give the three- and the one-letter code, while for amino acid 2 give the name and the three-letter code).
c. Indicate for each of the amino acids (1 to 5) its physiochemical properties (nonpolar, polar, acidic, basic).
d. Describe in one sentence what specific property the side chain of amino acid 4 has, and why this property is important to form protein structures.

Amino acids and their side chains can interact with other amino acids and form bonds and interactions. Interactions between amino acids and their side chains play important roles in stable folded proteins structures. Revisit Figure 15 from the reader. With this information in mind, take another look at the peptide sequence with five amino acids (see question 6). Below this peptide, you will find the backbone of another peptide in which the side chains have been only indicated with R (labeled a-e). Look at the 20 amino acids in the amino acid table in the reader. Discuss with your neighbour which of the 20 possible amino acids could be placed as a side chains R (a-e) such that these can likely interact, i.e., forms bonds or other interactions, with the corresponding amino acids (1-5) in the upper peptide (i.e., a interacts with 1, b with 2, and so on). Indicate for each which type of interaction (e.g., hydrogen bonds) are occurring between your proposed pair of amino acids.
Proteins fold into compact structures, and this structure is important for proteins to have biological functional activity. In folded proteins (tertiary structure), the secondary structure is often still visible, i.e., helices and beta sheets are still visible. Sometimes proteins are not only formed by a single structural unit, a so-called domain, but by multiple domains that can either be the same type or of different types. Sometimes, it can be useful to look at the tertiary structure of proteins with known fold (either experimentally or in silico determined), e.g., to see where mutations in the structure occur. We will have a look at the protein structure of Gamma B-Crystallin. Go to the website of PDB, which is a resource for protein structures, and search on the main page for Gamma B-Crystallin, with the ID “1AMM”. Click on ‘3D View’ to see a three-dimensional model of the structure. On the bottom right, change the viewer to NGL. a. Color the structure by Secondary structure (see ‘Color’, under ‘Structure View’). Under ‘Structure View Documentation’ you can find the meaning of each color. Can you identify the number of secondary structure elements (helix, sheet) you can observe in the structure?
b. How many domains does this protein have?
Amino acid quiz: you have now worked extensively with amino acids and you should know the relation between the 1- and 3-letter code, the name of the amino acid and its biochemical properties. To test this knowledge once more, perform this small test by filling in the missing information in the table (do not look at the reader before finalizing the quiz).

¶	1-letter	3-letter	Full name	Class
1			Glutamic acid	Nonpolar/Polar/Acidic/Basic
2		Phe		Nonpolar/Polar/Acidic/Basic
3	T			Nonpolar/Polar/Acidic/Basic
4		Pro		Nonpolar/Polar/Acidic/Basic
5			Serine	Nonpolar/Polar/Acidic/Basic
6	K			Nonpolar/Polar/Acidic/Basic
7			Isoleucine	Nonpolar/Polar/Acidic/Basic
8		Asn		Nonpolar/Polar/Acidic/Basic
9			Methionine	Nonpolar/Polar/Acidic/Basic
10	A			Nonpolar/Polar/Acidic/Basic
11	P			Nonpolar/Polar/Acidic/Basic
12		His		Nonpolar/Polar/Acidic/Basic

Exercise 3 (Databases, 45 minutes)

In a web browser, navigate to the Molecular Biology Database Collection of the journal Nucleic Acids Research (NAR). Pick three databases from the list that draw your attention, preferably from different categories, and explore them (approx. 5 min each).
a. What type of data is in there?
b. What would it be used for? Highly specialized or broad applications?
c. How can you search the database?
d. Does it look up-to-date and regularly maintained?
Redundancy
a. What does redundancy in a database mean? Give an example of redundancy in a sequence database.
b. Are the UniProt databases redundant or non-redundant?
c. What is the difference between RefSeq and GenBank in terms of redundancy?
Ontology
a. Describe what an ontology is (use the information in the reader and/or Google to find information).
b. The Gene Ontology is one of the most important ontologies in bioinformatics. Which biological domains are covered in the Gene Ontology?
c. Look up the Arabidopsis protein in UniProt (Accession B3H4Y2). What information do you find about the GO terms associated with this protein?
d. Now look up the famous Arabidopsis gene FRIGIDA (Accession P0DH90). Which GO terms are associated with this gene? In which cellular component is this protein found and which biological process is it involved in?
UniProt
a. Look up the two proteins from Q3 in UniProt again. In which of the sections of UniProt is each of them deposited. Which of the two has a higher annotation quality?
b. How many publications are linked to each of these proteins? Which of these publications contains specific information on the protein (based on the title)?
c. For each protein, look up at least one cross-reference to a database that you know and to a database that you do not yet know. Spend a few minutes to browse the information that you can gain in this way.
d. Calculate the frequency of individual amino acids in both protein sequences using the PIR website. Do you notice something remarkable (Hint: look at relative abundance of various amino acids)? Can you relate this to information that is present in Uniprot (Hint: look at family/domains)?
A hot topic in biological data management is “FAIR” data. What do the letters in FAIR stand for and what do those terms mean?

Exercise 4 (Genome annotation, 120 minutes)

Explain how homology searches can be useful in genome annotation and why it is more complex for eukaryotes than for prokaryotes.
How does RNA-sequencing data help gene prediction? Is RNA-sequencing data on its own sufficient to annotate a genome?
In the UniProt databse look up the Arabidopsis protein with identifier B3H4Y2. The corresponding gene ID is AT1G65484. In a second tab, look up this gene in TAIR JBrowse. The sequence of this gene can be found on BrightSpace. In a third tab, go to the NCBI Open Reading Frame Finder and paste in the gene sequence in FASTA format and hit submit.
a. How many ORFs are found? Are they all in the same reading frame?
b. Do any of the ORFs correspond to the ORF in the annotated gene? Why or why not?
c. When is a simple ORF detection tool useful? When is it insufficient?
Yeast (Saccharomyces cerevisiae) is a well-studied model organism. A lot of information about yeast is stored in the Saccharomyces Genome Database (SGD). You will work on the reference strain S288C. The relatively small genome of yeast gives us the opportunity to explore some online genome annotation tools (within reasonable time) and compare our findings to the high-quality annotation that is available. In the menu on SGD click on Sequence -> Reference Genome -> Genome Snapshot. Under “Features by Type” explore both the graph and the table view.
a. How many genes does the yeast genome contain?
b. How many tRNAs have been annotated on chromosome 3 and how many on the mitochondrial genome?
First you will annotate the mitochondrial genome of yeast, using MITOS2.
c. Use the mitochondrial genome, provided on Brightspace or download the sequence from SGD. What is the length of the mitochondrial (MT) genome? Be creative or use Google to find the answer. Is the MT genome linear or circular?
d. The yeast MT genome does not use the Standard Genetic Code. Which one does it use? Use NCBI Genetic Codes to find the answer. How many codons have a different meaning?
e. Go to MITOS2 tool. Select a relevant Reference and Genetic Code for yeast and upload the fasta file. Make sure to change the output to GFF. Hit submit. This annotation can take about 10 minutes, so continue with the next exercises (5 and 6) until the results are done.
f. Browse the results by clicking around. How many tRNAs are predicted?
g. Look up the first predicted tRNA in the genome browser on SGD. Does the prediction match the known tRNA? To which amino acid does this tRNA correspond? How many GO terms are associated with this tRNA gene?
h. What can you find out about the “giy” gene? In the FAA fasta text file (download from the menu on the left) you can find the protein sequence produced by the gene. Go to InterPro and use the protein sequence to search for known protein domains/functions using InterProScan. What functional information do you find for this protein? (Running InteProScan can take a few minutes, in the meantime continue with the next question).
Next, we will predict protein-coding genes on chromosome 3 using the widely used Augustus ab-initio gene predictor. Augustus outputs its predictions in the GFF3 text-based file format. To be able to make sense of the Augustus output, familiarize yourself with this format here. The Augustus gene predictor has already been trained on many organisms.
a. Upload the sequence of chromosome 3, choose the right Organism and “run AUGUSTUS” (predict genes on both strands). How many genes are predicted? How many of those contain an intron?
b. What can you say about the function, domains, etc. of gene g100? Which databases did you use to find this information?
You have now seen several examples of tools used in genome annotation. Describe structural and functional genome annotation in your own words and give an example of each.

Project Preparation Exercise

We want to obtain insights into members of the ARF gene family in Arabidopsis thaliana. ARF5 (UniProt ID P93024) and IAA5 (UniProt ID P33078) are two well-studied A. thaliana proteins that play a role in auxin-mediated regulation of gene expression. They are therefore chosen here as the starting points for exploring the plant ARF gene family. Perform a small background study on ARF5 and IAA5. Explore the protein sequences, properties (e.g., length, composition, etc.), interaction partners, and functional regions of ARF5 and IAA5. Finally, explore the genes encoding ARF5 and IAA5 in A. thaliana (genomic location, exon structure, expression, etc.).

Describe the following items in a few bullet points each. You may include up to two figures or tables.

Materials & Methods What did you do? Which data, databases and tools did you use, and why did you choose these? What important settings did you select?
Results What did you find, what are the main results? Report the relevant data, numbers, tables/figures, and clearly describe your observations.
Discussion & Conclusion Do the results make sense? Are they according to your expectation or do you see something surprising? What do the results mean, how can you interpret them? Do different tools agree or not? What can you conclude? Make sure to describe the expectations and assumptions underlying your interpretation.

Glossary¶

Annotation: The process of identifying functional elements within a genomic sequence, such as genes, coding regions, and regulatory motifs.
Cell: The basic structural and functional unit of all living organisms.
DNA: DeoxyriboNucleic Acid
Exon: A DNA segment in a gene that encodes part of the mature messenger RNA (mRNA) after intron removal.
Gene: A segment of DNA that encodes functional products, typically proteins.
Genome: The complete set of genes or genetic material present in a cell or organism.
Genome browser: A tool for visually inspecting genomic regions, annotations, and experimental data tracks.
HMM: Hidden Markov Model - a statistical model that represents systems where the states are not directly observable (hidden) but can be inferred from observable data.
Intron: A DNA segment in a gene that is not expressed in the mature messenger RNA (mRNA) product and is removed during the RNA splicing process.
mRNA: Mature mRNA, or mature messenger RNA, is a processed form of RNA that has had its introns removed and consists only of exons, making it ready for translation into proteins.
miRNA: MicroRNA are small, single-stranded, non-coding RNA molecules containing 21–23 nucleotides.
Nucleotide: The basic building block of DNA and RNA, consisting of a base, sugar, and phosphate group.
Protein: A molecule composed of amino acids, encoded by genes, and responsible for cellular structure and function.
RNA: RiboNucleic Acid
rRNA: Ribosomal RNA is a type of non-coding RNA that is a key component of ribosomes, which are essential for protein synthesis in all living cells.
Sequence: The precise order of nucleotides in a DNA or RNA strand.
Splicing: A biological process where non-coding regions (introns) are removed from a precursor messenger RNA (pre-mRNA) transcript, and the coding regions (exons) are joined together to form a mature messenger RNA (mRNA) that can be translated into proteins.
Transcription: The process of copying a DNA sequence into RNA.
Translation: The process of converting RNA sequences into proteins.
tRNA: Transfer RNA, is a type of RNA molecule that helps decode messenger RNA (mRNA) sequences into proteins.

References¶

Clark, M. A., Douglas, M., & Choi, J. (2018). 3.5 Nucleic Acids. In Biology 2e. OpenStax. https://openstax.org/books/biology-2e/pages/3-5-nucleic-acids
OpenStax College. (2013). DNA Nucleotides. https://commons.wikimedia.org/wiki/File:DNA_Nucleotides.jpg
Ball, M. P. (2013). DNA replication split. https://commons.wikimedia.org/wiki/File:DNA_replication_split.svg
Koch, L. (2009). Semiconservative replication. https://commons.wikimedia.org/wiki/File:Semiconservative_replication.png
UC Museum of Paleontology. (2020). The causes of mutations. https://evolution.berkeley.edu/evolution-101/mechanisms-the-processes-of-evolution/the-causes-of-mutations/
Ridder, D. de, Kupczok, A., Holmer, R., Bakker, F., Hooft, J. van der, Risse, J., Navarro, J., & Sardjoe, T. (2024). Self-created figure.
miguelferig. (2011). Intron miguelferig. https://commons.wikimedia.org/wiki/File:Intron_miguelferig.jpg
Greenwood, S. (2018). Genetic Code. https://commons.wikimedia.org/wiki/File:Genetic_Code.png
Squidonius. (2008). Molbio-Header. https://commons.wikimedia.org/wiki/File:Molbio-Header.svg
Clark, M. A., Douglas, M., & Choi, J. (2018). 3.4 Proteins. In Biology 2e. OpenStax. https://openstax.org/books/biology-2e/pages/3-4-proteins
LadyofHats. (2008). Main protein structure levels en. https://commons.wikimedia.org/wiki/File:Main_protein_structure_levels_en.svg
OpenStax College. (n.d.). Protein Structure. https://openstax.org/books/microbiology/pages/7-4-proteins#OSC_Microbio_07_04_secondary
Laskowski, R. A., MacArthur, M. W., Moss, D. S., & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. Journal of Applied Crystallography, 26(2), 283–291. https://doi.org/10.1107/S0021889892009944
Sangrador, A. (2023). What are protein domains? https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/protein-classification/what-are-protein-domains/
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Bourne, P. E. (2000). The Protein Data Bank. In Nucleic Acids Research (Vol. 28, pp. 235–242). 10.1093/nar/28.1.235

Introduction to Bioinformatics

2. Alignment, sequence search, and primer design

alanine	glutamine	leucine	serine
arginine	glutamic acid	lysine	threonine
asparagine	glycine	methionine	tryptophan
aspartic acid	histidine	phenylalanine	tyrosine
cysteine	isoleucine	proline	valine