Skip to article frontmatterSkip to article content

1. Genes, Proteins, Databases, Genome annotation

Authors
Affiliations
WUR Bioinformatics
WUR Bioinformatics
WUR Bioinformatics
📘 This content is part of version: v1.0.0 (Major release)

In this chapter your will learn the basics of molecular biology that are required for understanding bioinformatics approaches. In addition you will learn common approaches for storing and describing biomolecular data.

Biological background

A large part of bioinformatics deals with the analysis of biological Sequences. These sequences originate from organic macromolecules that play important roles in cells. In the first section of this chapter, we describe these macromolecules, their sequences, and the biological processes involved in generating their active structures and maintaining these.

As such, this section provides important background material for the entire course. Depending on your background, parts of this section might seem redundant, in which case this section can function as a refresher. Later chapters assume you are familiar with this section.


Nucleic acids

Deoxyribonucleic acid (DNA) carries the genetic information of organisms. Ribonucleic acid (RNA) is involved in the Protein expression and is also the genetic material of some viruses. Thus, these molecules are highly important as the basis of life on Earth. The Genome denotes the Cell’s entire genetic content and genomics is the study of genomes.

DNA and RNA are comprised of monomers called Nucleotides, which are comprised of three components (Figure 1):

Nucleotide overview

Figure 1:The components of a nucleotide. Credits: CC BY 4.0 Clark et al. (2018).

The bases can be divided into two categories: purines (with a double ring structure) and pyrimidines (with a single ring structure) (Figure 1). DNA contains A, T, C, and G; whereas RNA contains A, U, C, and G.


The DNA double helix

The DNA molecule is a polymer of deoxyribonucleotides and forms a right-handed double helix. The sugar and phosphate are on the outside forming the helix’s backbone and the bases are stacked in the interior and bind each other by hydrogen bonds. Thereby A pairs with T via two hydrogen bonds and C pairs with G via three hydrogen bonds, they are complementary bases. These pairings are also called Watson-Crick base-pairing, named after the discoverers of DNA.

DNA structure

Figure 2:The DNA structure. Credits: CC BY 3.0 OpenStax College (2013).

The two strands of the helix run in opposite directions, also called anti-parallel, i.e., one goes from 5’ to 3’ and the other from 3’ to 5’ (Figure 2). The nucleotide sequence is typically written in 5’ to 3’ direction. Due to the complementarity, the base sequence of a strand can be deduced from the base sequence from the other strand. This is called the reverse complement. For example, the reverse complement of AAGT is ACTT, where both strands are given in 5’ to 3’ direction.


DNA replication

As the two DNA strands are only connected via hydrogen bonds, they can be separated relatively easily, for example during DNA replication (Figure 3). The separated strands each serve as a template on which a new complementary strand is synthesized by the enzyme DNA polymerase in 5’ to 3’ direction. This mode of replication is called semiconservative.

Replication

Figure 3:A) The process of DNA replication. Credits: CC0 1.0 Ball (2013). B) Semiconservative DNA replication, where the two copies each contain one original strand and one new strand. Credits: CC0 1.0 modified from Koch (2009).

The error rate of DNA replication is remarkably low, about one erroneous base in 109 bases. This property preserves the genetic information during cell division, and also over generations. It also leads to mutations over evolutionary time (Figure 4), as we will see later (Substitutions).

DNA_mutation

Figure 4:A DNA mutation that occurs during replication. Credits: BY-NC-SA 4.0 UC Museum of Paleontology (2020).


RNA, transcription, and splicing

During Transcription, RNA polymerase reads the template strand (also called noncoding strand) in the 3’ to 5’ direction (Figure 5). This produces an RNA molecule from 5’ to 3’, which is a copy of the coding strand. During transcription thymine is replaced by uracil. In contrast to DNA, RNA does not form a stable double helix. RNA is mainly single stranded, but most RNAs show intramolecular base pairing between complementary bases.

There are four major types of RNA:

Transcription

Figure 5:RNA is produced by transcribing DNA: as such, it is a direct copy of the information contained in the DNA. Where DNA contains thymine (T, indicated in blue), RNA contains uracil (U, indicated in purple). Credits: CC BY-NC 4.0 Ridder et al. (2024).

In eukaryotes, precursor mRNA molecules undergo various postprocessing steps to produce mature mRNA molecules. To stabilize the mRNA, the 5’ end of the molecule is capped with a modified guanine nucleotide (more specifically, a 7-methylguanylate) and the 3’ end is extended with a long stretch of adenine nucleotides (known as poly-adenylation). In addition, many eukaryotic mRNA molecules undergo Splicing. During RNA splicing, the spliceosome protein complex removes introns: specific non-coding parts of an mRNA molecule that are not used during translation (Figure 6), to create mature mRNA. Most introns are characterized by a GU and AG dinucleotide motif in the 5’ and 3’ end respectively.

Splicing

Figure 6:During splicing, introns are removed from precursor mRNA moleculus to create mature mRNA. Most introns contain recognition sequences for the spliceosome and produce specific secondary structures that improve splicing efficiency: (1) 3’ splice site, (2) poly pyrimidine tract, (3) branch site, (4) 5’ splice site’. Credits: CC0 1.0 miguelferig (2011).


Translation

During protein Translation, ribosomes synthesize polypeptides from messenger RNA (mRNA) (Figure 7). During this process tRNAs decode the information on the RNA into amino acids, where a codon consisting of three nucleotides encodes the information for one amino acid.

Translation

Figure 7:The translation process, where ribosomes with tRNA molecules “read” codons on the mRNA using anticodons, which then get translated into their corresponding amino acids. These amino acids are linked together by peptide bonds to form a polypeptide chain. Credits: CC BY-NC 4.0 Ridder et al. (2024).

The genetic code

The genetic code shows the correspondence between codons and amino acids (Figure 8). Since 64 possible codons code for 20 different amino acids, the genetic code is degenerate, i.e., most amino acids are specified by more than one codon. Thus, the codons encoding one particular amino acid may differ in one or two of their positions. You can notice in Figure 8 that the third codon position often differs between codons for the same amino acid. As a result of the code degeneracy, the protein sequence can be deduced from the DNA or RNA sequence but not vice versa.

There are three codons that do not encode for an amino acid, but instead signal the end of the protein sequence, called stop codons. Furthermore, translation generally starts with the start codon AUG encoding methionine. More information of how protein information is encoded in genomes can be found in the section on genome annotation.

The genetic code

Figure 8:The universal genetic code. Note that exceptions to this code exist, for example the vertebrate mitochondrial code. Credits: CC BY 4.0 Greenwood (2018).

The central dogma of molecular biology

According to the central dogma of molecular biology, the flow of genetic information is essentially in one direction: from DNA via RNA to proteins (Figure 9). Nevertheless, there are also genes that do not code for proteins, but where functional RNA is the end product. Furthermore, mobile genetic elements and viruses can encode reverse transcriptases (which can synthesize DNA from an RNA template) or RNA dependent RNA polymerases (which can replicate RNA).

Central dogma

Figure 9:The central dogma of molecular biology. Credits: CC0 1.0 modified from Squidonius (2008).

Proteins

Proteins are large, complex macromolecules that play many important roles in the body. They are critical to most of the work done by cells and are required for the structure, function and regulation of the body’s tissues and organs. The basic building blocks of proteins are amino acids.


Amino acids

An amino acid contains a central carbon atom (called α-carbon, or Cα) (Figure 10). The α-carbon is bound to an amino group (NH2), a carboxyl group (COOH), and a hydrogen atom. In addition, each amino acid has a specific residue (R) group.

Structure of an amino acid

Figure 10:The structure of an amino acid. Four elements are connected to the α-carbon: an amino group, a hydrogen atom, a carboxyl group, and a side chain (R group). Credits: CC BY 4.0 Clark et al. (2018).

Table 1:Amino acids and their abbreviations and basic properties

Amino acidThree-letter codeOne-letter codeProperty
ArginineArgRPositively charged
HistidineHisHPositively charged
LysineLysKPositively charged
Aspartic acidAspDNegatively charged
Glutamic acidGluENegatively charged
SerineSerSPolar uncharged
ThreonineThrTPolar uncharged
AsparagineAsnNPolar uncharged
GlutamineGlnQPolar uncharged
AlanineAlaAHydrophobic
ValineValVHydrophobic
IsoleucineIleIHydrophobic
LeucineLeuLHydrophobic
MethionineMetMHydrophobic
PhenylalaninePheFHydrophobic and aromatic
TyrosineTyrYHydrophobic and aromatic
TrypotophanTrpWHydrophobic and aromatic
GlycineGlyGSpecial (only H as side chain)
ProlineProPSpecial (side chain bound to backbone nitrogen)
CysteineCysCSpecial (forms disulfide bonds)

Some amino acids have non-polar side chains, and these are generally hydrophobic, i.e., water molecules cannot form hydrogen bonds with these molecules. Thus, they can often be found in the interior of proteins together with other hydrophobic amino acids. Aromatic amino acids contain aromatic rings, and often stabilize folded protein structures.

In contrast, the charged and the polar amino acids are hydrophilic, i.e., water molecules can form hydrogen bonds with these molecules. They can often be found on the surface of proteins or in the interior, when they can interact with another oppositely charged amino acid. Positively charged amino acids, are also called basic amino acids and negatively charged amino acids are also called acidic amino acids.

Although amino acids can be classified into these groups based on their properties, some amino acids stand out. The smallest amino acid is glycine, which provides great flexibility due to its small size. In contrast, proline is an amino acid, where the side chain is bonded to the backbone nitrogen atom, which makes it very rigid. Finally, one cysteine amino acid can form a disulfide bridge with another cysteine.


Protein structure

A protein is made up of one or more long, folded chains of amino acids (each called a polypeptide). The 3D structure of a protein is also called its conformation. The protein conformation is described on four levels - primary to quaternary structure (Figure 11).

The four levels of protein structure

Figure 11:The four levels of protein structure. Credits: CC0 1.0 LadyofHats (2008).

The structure of a protein is critical for its function. For example, in an enzyme, the active site must be in the correct structure to be able to bind the substrate. Other proteins might bind proteins (and influence their activity) or bind DNA (and regulate gene expression). Additionally, some proteins are secreted from the cell or might function within the cell membrane. Finally, proteins are often modified after protein synthesis (see Translation), called post-translational modification. These modifications can be important for protein function.


Primary structure

In a protein, amino acids are connected by covalent bonds, called peptide bonds. A peptide bond connects one amino acid’s carboxyl group and the next amino acid’s amino group (Figure 12). The sequence of amino acids linked by peptide bonds is called the primary structure. The protein sequence is determined by the gene sequence encoding the protein. The continuous chain of atoms along the protein is also called the backbone, it consists of the three backbone atoms (nitrogen, Cα, carbon).

Peptide bond

Figure 12:A peptide bond connecting two amino acids. Credits: CC BY 4.0 Clark et al. (2018).

Each protein has a free amino group on one end, called the N terminus. The other end has a free carboxyl group, called the C terminus.


Secondary structure

Secondary structures are local conformations in the protein that are stabilized by hydrogen bonds between backbone atoms. We distinguish the regular helices (i.e., alpha helix - α-helix) and sheet structures (i.e., beta sheet - β-sheet) (Figure 13) and irregular turns.

α-helices are stabilized by hydrogen bonds between the oxygen atom in the C group in one amino acid, and the hydrogen in the N group of the amino acids that is four amino acids farther along the chain. Every helical turn has 3.6 amino acids residues and the side chains stick out of the helix.

β-pleated sheets (short: β-sheets) consist of β-strands, where the R groups extend above and below the strands. The strands have a direction determined by the N- and C-terminus of the protein and are usually depicted as an arrow pointing towards the C-terminus. Depending on the direction, strands can align parallel or antiparallel to each other.

Secondary structure elements

Figure 13:α-helices and β-sheets are stablized by hydrogen bonds (the dotted lines) between the backbone of proteins, i.e., the side chains are not involved. The hydrogen bonds form between the oxygen atom in the C group in one amino acid and the hydrogen in the N group. Credits: CC BY 4.0 modified from OpenStax College (n.d.).

Turns are short secondary structure elements that are stabilized by hydrogen bonds between amino acids that are 1 to 5 peptide bonds away. The most common form are β-turns, which connect antiparallel β-strands.

The peptide bond is very rigid and planar, i.e., it cannot rotate to form the elements of protein structure. However, the N-Cα and the Cα-C bonds can freely rotate, being only limited by the size and properties of the R-groups. The 3D shape of the polypeptide backbone is thus determined by two torsion angles: phi (φ) between N and Cα and psi (ψ) between Cα and C (Figure 14A). Although φ and ψ can rotate in principle, steric hindrance prevents certain combinations of angles, i.e., the bulkiness of the R-groups restricts the possible conformations. Thus, certain combinations of φ and ψ are preferred. We can plot the combinations of φ and ψ in a protein, in a so-called Ramachandran plot (Figure 14B).

The regular secondary structure elements (α-helix and β-sheet) contain consecutive amino acids with similar (φ,ψ) values. These regions are typically highly populated in a Ramachandran plot. Thus, the Ramachandran plot can be used to assess how plausible a predicted protein structure is.

Phi, psi, and Ramachandran plot

Figure 14:A) The φ, and ψ torsion angles of a polypeptide chain. Credits: CC BY-NC 4.0 Ridder et al. (2024). B) A typical Ramachandran plot. The red regions marked do not have any steric hindrance, yellow areas represent conformations that have steric hindrance, light yellow areas represent conformations that are generally sterically unfavorable, and white areas do not have any allowed conformations. Credits: Ramachandran plot modified from PROCHECK Laskowski et al. (1993).


Tertiary structure

The tertiary structure of a protein describes the complete folding of an entire polypeptide chain. In contrast to the secondary structure, the tertiary structure of a protein involves interactions between the amino acid’s side chains that can occur at short-range and long-range (Figure 15). Thus, the chemical properties of the amino acids are very important for the tertiary structure. Different types of interactions stabilize the tertiary structure:

Tertiary structure interactions

Figure 15:Chemical interactions that stabilize the tertiary structure of proteins. Credits: CC BY 4.0 Clark et al. (2018).

When studying many different protein structures, various reoccurring substructures can be observed. These so-called Domains are distinct functional and/or structural units in a protein and are typically 50 to 350 amino acids long. Usually, a domain is responsible for a particular function or interaction, contributing to the overall role of a protein. A domain can exist in different contexts with other domains (Figure 16). In a multidomain protein, each domain folds independently of the others.

Domain examples

Figure 16:A) Example of an Src homology 3 (SH3) domain that is involved in protein-protein interaction. SH3 domains occur in a diverse range of proteins with different functions. B) The cytoplasmic protein Nck contains multiple SH3 domains. C) Domain composition of phospholipase D1, which has multiple functional domains that contribute to its overall function. Credits: CC BY 4.0 Sangrador (2023).


Quaternary structure

Finally, individual folded polypeptides can interact to form protein complexes, also called quaternary structures. The quaternary structure is stabilized by the same types of interactions as the tertiary structure. The difference is that the amino acids involved belong to different polypeptides.

Many functional proteins are composed of multiple subunits, they are also called oligomers (Figure 17). The subunits can originate from the same protein sequence (called a homomer) or from different sequences (called a heteromer). Proteins consisting of two subunits are also called dimer.

Examples of oligomers

Figure 17:Examples of oligomers. A) Myoglobin, a heteromer of four subunits (PDB structure 1HV4 colored by chain). Credits: Berman et al. (2000)Liang et al. (2001)Rose et al. (2018). B) UDP-galactose 4-epimerase, a homodimer (PDB structure 1EK5 colored by secondary structure). Credits: Berman et al. (2000)Thoden et al. (2000)Rose et al. (2018).


Substitutions

Mutations in the gene sequence can lead to changes in the primary structure of the protein, e.g., a substitution of one amino acid by a different one. Often, such substitutions still lead to highly similar protein structures that perform a similar or even the same function, especially when the exchanged amino acids have similar chemical properties. Nevertheless, single amino acid substitutions can have severe consequences. A prominent example is sickle cell anemia, where a substitution of glutamic acid to valine in hemoglobin β results in a structural change that leads to a distortion in red blood cells (Figure 18).

Sickle cell anemia

Figure 18:Consequences of a substitution in hemoglobin β resulting in sickle cell anemia. Credits: Rao, A., Tag, A. Ryan, K. and Fletcher, S. Department of Biology, Texas A&M University.


Visualization

There are many styles to view protein molecular structures. Some styles focus on detailed chemical structure, others are targeted at the protein surface. For some examples see Figure 19.

Protein representation

Figure 19:Different representations of the PDB structure 5PEP generated with NGL. Credits: Berman et al. (2000)Cooper et al. (1990)Rose et al. (2018).


Genome annotation

Annotation of genomes is the process of deciphering what information is encoded in an organism’s DNA. It is an ongoing effort in organisms with known genome sequences. Even moreso, genome annotation is a critical step in acquiring biological insights from newly sequenced genomes. Given the large size of any genome, automated procedures are used to identify various genomic elements such as genes, regulatory regions, transposable elements, or other non-coding elements. Each of these bioinformatic procedures typically focuses on identifying one type of element, and as such a complete genome annotation project can be thought of as a pipeline of various procedures. The following section describes the most common steps in genome annotation.

Repeat masking

Repeat masking involves the identification and masking (hiding) of repetitive sequences within a genome. It is an essential first step in annotating most genomes because repetitive sequences can pose significant challenges in genome annotation. Masking repeats generally improves:

Most repeat masking workflows work by first compiling (or using a precompiled) ‘repeat library’: a collection of known repetitive elements that have previously been characterized. Subsequently, the genome to be annotated is compared against this repeat library using various computational algorithms, such as (specifically configured versions of) BLAST or RepeatMasker. When a match is found, the corresponding region in the genome is ‘masked’ or annotated as a repetitive element. This means that these regions are excluded from further analysis or labeled as repetitive.

Gene prediction

The process of finding protein coding genes differs between prokaryotic and eukaryotic genomes. In both cases the aim is to find open reading frames (ORFs): contiguous stretches of nucleotides that encode proteins. More specifically, an ORF starts with a start codon, ends with a stop codon, and it’s length is a multiple of three (Refer to the genetic code in Figure 8). Since RNA splicing (Figure 6) is almost absent in prokaryotic genomes, prokaryotic ORFs can be found directly in the genomic DNA. As a result, simply enumerating all possible ORFs in a genome is a common step in prokaryotic genome annotation. In contrast, ORFs in eukaryotic genomes are found on mature mRNAs. As such, all eukaryotic gene prediction methods take splicing into account, thereby greatly increasing their computational complexity. Both prokaryotic and eukaryotic gene prediction typically can be classified as either evidence based prediction or ab initio prediction, both will be explained below.

Evidence based prediction

This data-driven approach uses existing and newly generated data to get hints on what regions of a genome encode genes. Depending on the type of data, these predictions have more or less predictive power. Some commonly used evidence types are:

Ab initio prediction

Ab initio (latin): from first principles, from the beginning

These methods rely on statistics to learn a predictive model from a known annotated genome. Various forms of ab initio models exist, and whereas implementation details differ, most follow a similar line of reasoning. For now, we will stick to a high level description. All ab initio models scan through a DNA sequence and at each position give a score for a specific type of annotation. In addition, they often take the genomic context of a specific position into account. For example, the probability of a protein-coding annotation on a nucleotide A is high when the next two observed nucleotides are T and G, producing the ATG start-codon methionine. In addition, most methods also take the predicted annotation of the genomic context into account. For example: the probibility that ATG actually codes for a start codon is much higher if we can also predict an in-frame stop codon. In eukaryotic genome prediction these models become quite complex because they have to include splice sites in all three reading frames. How exactly a model decides what annotation score to give to which nucleotide is part of the model architecture and parameterization. In all cases, the model parameters are chosen to accurately reproduce a known genome annotation. If sufficient data is used to learn the model parameters, it is assumed that these models can be used to predict annotations on novel genome sequences. Like homology-based prediction, this model-based approach works best for closely related organisms. In the past, almost all ab initio prediction methods were formulated as hidden Markov models (HMMs) (see Note 1.5). Examples of tools implementing HMM based ab initio prediction are SNAP, GeneMark, and Augustus. With the availability of more high quality data (genome sequences and accompanying annotations), approaches based on deep learning and generative AI have proven to frequently perform better than HMM based approaches.

Evidence/prediction integration

From the previous sections it has now become clear there are several ways of predicting what the genes in a genome look like. Since these various approaches almost never agree exactly in their predictions, a final step in genome annotation is evidence and prediction integration. Typically a weighted consensus approach is used: each individual source of evidence is given a weight representing how much it should influence the final decision, after which a majority vote decides what the annotation should look like. Typically RNA-seq evidence gets a high weight, and various forms of homology evidence can be weighted depending on how closely related they are to the genome of interest.

Functional annotation

So far, all described steps in the genome annotation process have dealt with what genes look like on a structural level. To gain biological insight, the next step is to assign functional annotations to the predicted genes. This functional annotation step consists of using various sequence alignment and search tools to find sequences with a known function/description and to transfer the information of the known gene to the predicted gene. Several databases of high-quality known functions are often used, which are described in more detail in the next section of this chapter. In Chapter 2 we will learn about approaches how to search these databases efficiently.


Databases

Introduction

Databases are at the core of bioinformatics. In all analyses, we integrate pre-existing data and we need to access this data. The journal Nucleic Acids Research publishes an entire issue in the beginning of each year on new and updated databases. The list of these databases can also be accessed online.

Computer scientists have developed different kinds of databases. One example are relational databases, which can be queried by SQL (structured query language) and which perform well for data that is processed computationally. Another example are XML (extended markup language) databases which store data in specified well-structured XML files. Nevertheless, most databases for biological sequence data use flat file databases, where the data is saved in structured text files. This data can be manipulated in a text editor without requiring an additional program for database management, and they can be easily exchanged between scientists. On the downside, searching them has a lower performance. This is why they are often indexed, i.e., they contain an index of keywords, similar to a glossary in a book.

Depending on the kind of data included, we distinguish different kinds of biological databases:

Each entry in a database has a unique accession number. This number is permanent and provides an unambiguous way to link to the entry. The information that the accession refers to should not change. To still allow updates to an entry, the accession number can contain a version, usually after a dot. For example, NC_003070.9 is the latest version (version 9) for Arabidopsis thaliana chromosome 1 in RefSeq.

Database entries often link to each other via cross links.


GenBank

GenBank is a popular primary database for nucleotide sequences and is based at the NCBI (National Center for Biotechnology Information). A GenBank release usually occurs every two months and the most recent release from the 15th of December 2023 contains ~250 million sequences and additionally ~3.7 billion WGS (whole genome shotgun) records. The latter are genome assemblies or genomes that were not yet completed. The complete database is available for download via FTP, but the most convenient way to access individual entries is via the search on the GenBank website (Figure 23).

Genbank website

Figure 23:A screenshot of the GenBank website. Credits: Benson et al. (2012).

Since data is directly submitted to GenBank, the information for some loci can be highly redundant. The sequence records are owned by the original submitter and cannot be altered by someone else.

Genbank is part of the INSDC (International Nucleotide Sequence Database Collaboration). The other two member databases are ENA (European Nucleotide Archive) and DDBJ (DNA Data Bank of Japan). The data submitted to either database is exchanged daily, so all databases contain essentially the same information.


RefSeq

The Reference Sequence (RefSeq) collection is also hosted at NCBI and contains genomic DNA, transcripts, and proteins. The aim of RefSeq is to provide non-redundant, curated data. RefSeq genomes are copies of selected assembled genomes in GenBank. Additionally, transcript and protein records are generated by several processes:

The RefSeq accessions directly provide information on molecule types. For example, NC_ accessions denote complete genomes, NP_ accessions denote proteins in one genome, and WP_ accessions denote proteins in multiple genomes.


UniProt

There is lots of information available for proteins, such as sequence information, domains, expression, or 3D structure. The aim of the Universal Protein Resource (UniProt) is to provide a comprehensive resource for proteins and their annotation. UniProt contains three databases (Figure 24):

Overview of UniProt

Figure 24:The information flow in Uniprot. Credits: CC BY-NC-ND 4.0 Leon & Pastor (2021).

UniProtKB is the central hub for functional information on proteins. For each protein it contains the core data (such as sequence, name, description, taxonomy, citation) and as much annotation information as possible. It contains many cross-references to other databases and is generally a very good starting point to find information on a protein.

UniProtKB consists of two sections:

UniProtKB is updated every 8 weeks. The current release has ~570,000 entries in Swiss-Prot and ~250 million entries in TrEMBL.


Prosite

Prosite is a secondary database of protein domains, families, and functional sites. Some regions in protein families are more conserved than others because they are important for the structure or function of the protein. Prosite contains motifs and profiles specific for many protein families or domains. Searching motifs in new proteins can provide a first hint for protein function.

The current release of Prosite from the 18th of June 2025 contains 1311 patterns, 1403 profiles, and 1421 ProRule entries.

A Prosite pattern is typically 10 to 20 amino acids in length. These short patterns are usually located in short well-conserved regions, such as catalytic sites in enzymes or binding sites. A pattern is represented as a regular expression, where amino acids are separated by hyphens and x denotes any letter. Repetitions can also be given as the number of repetitions in brackets. For example, [AC]-x-V-x(4)-{ED} matches sequences that contain the following amino acid sequence: (Alanine or Cysteine)-any-Valine-any-any-any-any-(any but Glutamic acid or Aspartic acid). Note that this representation is qualitative, a sequence either matches a pattern or it does not.

Patterns cannot deal with mismatches and are limited to exact matches to the pattern. Thus, they are not well suited to identify distant homologs. A Prosite profile is more general than a pattern and can also detect poorly conserved domains or families. They characterize protein domains over their entire length and do not just model the conserved parts. Profiles are estimated from multiple sequence alignments and we learn more about them in Chapter 2. For now, it is important to know that profiles model matches, insertions, and deletions. Importantly, profiles are quantitative representations, they will return a score how well the sequence fits to the profile. A threshold can be applied to get high-scoring profiles for a sequence. In contrast to patterns, a mismatch to a profile can be accepted if the rest of the sequence is highly similar to the profile. Profiles are well suited to model structure properties of a domain.

Notably, profiles cover the structural relationships of domains, but they might also score a sequence highly that lacks important functional residues. To include that information, ProRule contains additional information about Prosite profiles, such as the position of structurally or functionally important amino acids. ProRule is used to guide curated annotation of UniProtKB/Swiss-Prot.

InterPro

The Integrated Resource of Protein Families, Domains and Sites (InterPro) integrates 13 member databases (including Prosite and Pfam) into a comprehensive secondary database. Additionally, it provides annotation from other tools, for example to annotate signal peptides and transmembrane regions. It allows to identify functionally important domains and conserved sites in a sequence by simultaneously annotating it using the member databases. Interpro can be used to find out which protein family a sequence belongs to, or what its putative function is. Additionally, one InterPro entry can integrate entries from the member databases, if they represent the same biological entity, reducing redundancy. InterPro entries are also linked to Gene Ontology. They are curated before being released.

InterPro is updated every 8 weeks. The current release from the 19th of June 2025 contains ~49,000 entries, which represent different types:

As an example, look at the InterPro entry for the type 2 malate dehydrogenase protein family. The entry has a name (malate dehydrogenase, type 2) and accession (IPR010945). The contributing entries in member databases are shown on the right-hand side, with links to the individual member database entries. A descriptive abstract explains what these proteins are and what their function is. A set of GO terms is also provided, which describe the characteristics of the proteins matched by the entry.

You can get the InterPro annotation for a protein by running a new sequence search (Figure 25), or by by looking up its UniProt accession (Figure 26).

Browsing interpro

Figure 26:Browse the annotated proteins in Interpro and search for a UniProt accession. See resulting entry in (Figure 27). Credits: Paysan-Lafosse et al. (2022).

A UniProt entry in Interpro

Figure 27:The result page when looking up UniProt accession A0A076FRI5 in InterPro. You can see the family and domain annotation and on the right the accessions in InterPro and in the member databases. You can click on each of these accessions to get to the entry information. Credits: Paysan-Lafosse et al. (2022).

You may have noticed a colored letter before each InterPro accession, e.g., F before IPR011835 or D before IPR001296 (Figure 27). These icons denote the different InterPro entry types:

Interpro types

Figure 28:The icons for the different InterPro entries (homologous superfamily, family, domain, repeat or site). Credits: CC BY-SA 4.0 Mitchell (2020).

Pfam

Pfam is an important resource for protein domains. In Pfam, domains are classified according to profiles that are modelled as Hidden Markov models (HMMs). We will learn more on HMMs in Chapter 2. Pfam is now integrated in InterPro. Each Pfam domain can be represented with a logo, where the amino acids occurring more frequently at a particular position are represented as larger letters (Figure 29).

Pfam profile

Figure 29:The Pfam logo for PF12924. Credits: Paysan-Lafosse et al. (2022).


File formats

There are many different formats for biological data. A format is a set of rules about the contents and organization of the data. You should be familiar with a couple of common data formats in bioinformatics (See Table 2), which you will experience in the practicals.

Table 2:Examples of common data formats in bioinformatics. Unless explicitly noted these are plain text formats.

File formatUsageCommon extension
FASTANucleotide or amino acid sequences.fa, .fasta, .fna, .faa
GenbankSequences, annotations, metadata.gb
Generic Feature FormatSequence annotations.gff
FASTQDNA sequencing data including basecalling quality scores.fq
SAM/BAM (Sequence/Binary Alignment and Map)Typically alignments between raw sequencing data and a reference, with quality scores.sam (.bam for the binary file format)
VCF (Variant Call Format)(Genetic) variant calls based on sequencing data aligned to a reference.vcf (.bcf for the binary file format)
PDBProtein structure data.pdb
Unstructured textMiscellaneous.txt
Plain text files

Many of the biological data formats are plain text files: they only contain letters, numbers, and symbols, but no formatting, such as font size or colors. Whereas these filetypes can have various different extensions (e.g. .fasta, .gff, etc.), they only contain plain text. The advantage of plain text files is that they can be opened with any text editor on any computer. Plain text differs from rich text format, where the latter can also include formatting. Many bioinformatics programs expect plain text files as input. Thus, when creating them on your computer, take care to save in this format, and not for example in rtf or word.

On a Windows computer, plain text files can for example be created with the Notepad program (Figure 30).

Notepad editor

Figure 30:A screenshot of Notepad on Windows. Credits: CC BY-NC 4.0 Ridder et al. (2024).

On a Mac, plain text files can for example be created with the TextEdit program (Figure 31). Take care to set the settings to plain text.

TextEdit editor

Figure 31:A screenshot of TextEdit on Mac. Credits: CC BY-NC 4.0 Ridder et al. (2024).

There are some important file formats in bioinformatics. A fasta file stores a DNA or protein sequence (Figure 32). Information on the sequence is found in the header (starting with >), which is on one line and the sequence can go over multiple lines. A multi-fasta file stores multiple sequences.

Fasta file

Figure 32:A sequence in fasta format. Credits: CC BY-NC 4.0 Ridder et al. (2024).

The GenBank file format is a popular format to represent genes or genomes. Here you can find an example GenBank record with annotations. Important elements are the Locus, Definition (i.e., the name), and the Organism. Additionally, Features, such as genes and CDSs (coding sequences) are listed.

Binary files

Binary files are all the files that are not text files, they cannot be opened in a text editor. Instead, they need special programs to write and to open and interpret them. Examples are word files (.docx) which can be opened with Word, pdf files (.pdf) which can be opened with Acrobat Reader, or image files (e.g., .png) which can be opened with image viewers.

Binary files are are also sometimes used in bioinformatics. Examples include the bam format, which is a binary version of the sam format or the gzip format. Gzip is used for compressing text files without the loss of information. For large files, lots of disk space can be saved this way.


Ontologies

An ontology is a comprehensive and structured vocabulary for a particular domain, such as biology, genetics, or medicine. It defines the various terms used in a domain, along with their meanings and interconnections. As such, ontologies serve as standardized frameworks for organizing and categorizing information in a way that enables effective communication and reasoning among researchers, practitioners, and computer systems. For example, the terms in an ontology can encompass biological entities like genes, proteins, and cells, as well as processes, functions, and interactions that occur within living organisms. Most of the databases mentioned mentioned in this chapter use ontologies in some way to describe their data.

Ontologies play a crucial role in bioinformatics because they facilitate:

  1. Standardization and consistency: ontologies provide a common language and consistent framework for researchers and professionals, ensuring that everyone understands and uses terms in the same way.
  2. Interoperability: ontologies facilitate the sharing and integration of data and knowledge across different research groups, institutions, and databases. They enable computer systems to process data more accurately, leading to more meaningful analyses and discoveries.
  3. Scientific reasoning: by organizing information in a logical and structured way, ontologies help researchers generate hypotheses, design experiments, and validate findings more effectively.

Ontologies typically form a hierarchy, where specific terms point to more generic terms. More generally, most ontologies are represented as a graph, where ontology terms are the nodes and relationships between terms are edges. As such, one ontology term may have more than one parent term. A variety of ontologies are frequently used in the life sciences, some of which are discussed in greater detail below.

Gene Ontology

The Gene Ontology (GO) is a knowledgebase for the function of genes and gene products (e.g. proteins). It is organised into three different domains covering various aspects:

A good example of how ontologies are represented as graphs is the biological process hexose biosynthetic process, which has two parents: hexose metabolic process and monosaccharide biosynthetic process. This reflects that biosynthetic process is a subtype of metabolic process and a hexose is a subtype of monosaccharide. (Figure 33).

Edges between GO terms in the GO hierarchy can represent various relationships between genes and gene products. The four main relationship types used in the gene ontology are ‘is a’, ‘part of’, ‘has part’, and ‘regulates’ (see Figure 34).

Go hierarchy

Figure 33:An extract of the Gene Ontology hierarchy. Credits: Binns et al. (2009)

Sequence Ontology

The Sequence Ontology (SO) describes biological sequence elements such as genes or repeats, along with their features and attributes.

The sequence ontology is organized on four main levels:

SO hierarchy example

Figure 34:An extract of the Sequence Ontology hierarchy. Credits: Eilbeck et al. (2005).

Other ontologies

Many more ontologies exist and are relevant to biomedical research. The European Bioinformatics Institure (EBI) provides an ontology lookup service that facilitates searching for ontologies. Examples of other ontologies are the plant ontology that describes various anatomical structures in plants, and the human disease ontology.


Practical assignments

This practical contains questions and exercises to help you process the study materials of Chapter 1. You have 2 mornings to work your way through the exercises. In a single session you should aim to get about halfway through this guide (i.e., day 1: assignment 1-3, day 2: assignment 4 and project preparation exercise). Use the time indication to make sure that you do not get stuck in one assignment. These practical exercises offer you the best preparation for the project. Especially the project preparation exercise at the end is a good reflection of the level that is required to write a good project report. Make sure that you develop your practical skills now, in order to apply them during the project.

Note, the answers will be made available after the practical!

Glossary

Annotation
The process of identifying functional elements within a genomic sequence, such as genes, coding regions, and regulatory motifs.
Cell
The basic structural and functional unit of all living organisms.
DNA
DeoxyriboNucleic Acid
Exon
A DNA segment in a gene that encodes part of the mature messenger RNA (mRNA) after intron removal.
Gene
A segment of DNA that encodes functional products, typically proteins.
Genome
The complete set of genes or genetic material present in a cell or organism.
Genome browser
A tool for visually inspecting genomic regions, annotations, and experimental data tracks.
HMM
Hidden Markov Model - a statistical model that represents systems where the states are not directly observable (hidden) but can be inferred from observable data.
Intron
A DNA segment in a gene that is not expressed in the mature messenger RNA (mRNA) product and is removed during the RNA splicing process.
mRNA
Mature mRNA, or mature messenger RNA, is a processed form of RNA that has had its introns removed and consists only of exons, making it ready for translation into proteins.
miRNA
MicroRNA are small, single-stranded, non-coding RNA molecules containing 21–23 nucleotides.
Nucleotide
The basic building block of DNA and RNA, consisting of a base, sugar, and phosphate group.
Protein
A molecule composed of amino acids, encoded by genes, and responsible for cellular structure and function.
RNA
RiboNucleic Acid
rRNA
Ribosomal RNA is a type of non-coding RNA that is a key component of ribosomes, which are essential for protein synthesis in all living cells.
Sequence
The precise order of nucleotides in a DNA or RNA strand.
Splicing
A biological process where non-coding regions (introns) are removed from a precursor messenger RNA (pre-mRNA) transcript, and the coding regions (exons) are joined together to form a mature messenger RNA (mRNA) that can be translated into proteins.
Transcription
The process of copying a DNA sequence into RNA.
Translation
The process of converting RNA sequences into proteins.
tRNA
Transfer RNA, is a type of RNA molecule that helps decode messenger RNA (mRNA) sequences into proteins.
References
  1. Clark, M. A., Douglas, M., & Choi, J. (2018). 3.5 Nucleic Acids. In Biology 2e. OpenStax. https://openstax.org/books/biology-2e/pages/3-5-nucleic-acids
  2. OpenStax College. (2013). DNA Nucleotides. https://commons.wikimedia.org/wiki/File:DNA_Nucleotides.jpg
  3. Ball, M. P. (2013). DNA replication split. https://commons.wikimedia.org/wiki/File:DNA_replication_split.svg
  4. Koch, L. (2009). Semiconservative replication. https://commons.wikimedia.org/wiki/File:Semiconservative_replication.png
  5. UC Museum of Paleontology. (2020). The causes of mutations. https://evolution.berkeley.edu/evolution-101/mechanisms-the-processes-of-evolution/the-causes-of-mutations/
  6. Ridder, D. de, Kupczok, A., Holmer, R., Bakker, F., Hooft, J. van der, Risse, J., Navarro, J., & Sardjoe, T. (2024). Self-created figure.
  7. miguelferig. (2011). Intron miguelferig. https://commons.wikimedia.org/wiki/File:Intron_miguelferig.jpg
  8. Greenwood, S. (2018). Genetic Code. https://commons.wikimedia.org/wiki/File:Genetic_Code.png
  9. Squidonius. (2008). Molbio-Header. https://commons.wikimedia.org/wiki/File:Molbio-Header.svg
  10. Clark, M. A., Douglas, M., & Choi, J. (2018). 3.4 Proteins. In Biology 2e. OpenStax. https://openstax.org/books/biology-2e/pages/3-4-proteins
  11. LadyofHats. (2008). Main protein structure levels en. https://commons.wikimedia.org/wiki/File:Main_protein_structure_levels_en.svg
  12. OpenStax College. (n.d.). Protein Structure. https://openstax.org/books/microbiology/pages/7-4-proteins#OSC_Microbio_07_04_secondary
  13. Laskowski, R. A., MacArthur, M. W., Moss, D. S., & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. Journal of Applied Crystallography, 26(2), 283–291. https://doi.org/10.1107/S0021889892009944
  14. Sangrador, A. (2023). What are protein domains? https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/protein-classification/what-are-protein-domains/
  15. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Bourne, P. E. (2000). The Protein Data Bank. In Nucleic Acids Research (Vol. 28, pp. 235–242). 10.1093/nar/28.1.235