GENES

AND

GENE

PREDICTION

exon
intron
start codon
stop codon
coding sequence
open reading frame
5'UTR and 3'UTR

Genes and Gene Prediction

Gene structure | protein-coding genes

In eukaryotes, which comprise animals, plants, fungi, yeast and protists of various origin, protein-coding genes usually consist of the genomic part coding for a protein as well as 5'- and 3'-regions where the promoter and termination signals are located, respectively. In most cases, the coding part is interrupted. The organisation of all the coding and non-coding parts is termed gene structure. The many parts have different biological roles and functions that are defined and described in the following.

exon

Exons are those parts of the gene sequence that remain in the mRNA after RNA-splicing of the premature mRNA sequence. Not all exon regions encode protein.

intron

Introns are those parts of the gene sequence that are removed from the premature mRNA sequence by RNA-splicing.

start codon

The start codon defines the beginning of the coding sequence. In most cases the start codon is ATG decoded by a so-called initiator tRNA, a variant of the common Met-tRNA(CAU). However, other codons have been found as start codons as well, such as ACG, CUG, AUU, and AGG, and while these are most often also decoded by the initiator tRNA, protein translation might in rare cases even be started by other tRNAs.

stop codon

Stop codons terminate the coding sequence. They are not matched by tRNAs but by release factors. There are three stop codons in the standard genetic code:

  • TAG    amber
  • TAA    ochre
  • TGA    opal

coding sequence | CDS

A coding sequence (CDS) is a piece of DNA sequence that is translated into protein. A full-length CDS consists of all exon regions that code for protein. This includes the start codon and excludes the stop codon.

Note: The stop codon is included in the CDS of GenBank annotations. Inclusion or exclusion of the stop codon from the CDS are also different in GFF3 (included) and GTF (excluded) file formats.

open reading frame | ORF

An open reading frame (ORF) is the protein-coding region of a gene and comprises an uninterrupted stretch of nucleotides from a start codon to the next stop codon in the same reading frame. As long as the start codon and translation have not been confirmed, any ORF has to be considered a hypothesis.

In bacteria, where only few genes are interrupted by introns, an ORF is almost always identical to the CDS. Therefore, in the context of prokaryotes, ORF and CDS are often used interchangeably. In eukaryotes, where coding regions are interrupted by introns, the ORF-concept only makes sense for some protists and yeasts, which do not have many or no introns. In all other eukaryotes, potential ORFs extend into intron regions, contain intron regions or miss the start codon. Thus, all these potential ORFs are immediately disproven.

not an open reading frame
Just a stretch of nucleotides from a stop codon to the next stop codon is not an open reading frame. Stretches of nucleotides, who can be ruled out to be ORFs (e.g. because they clearly miss the start codon or contain intronic region), are obviously not ORFs.

5'UTR and 3'UTR

The exon regions upstream (5') and downstream (3') of the coding exon regions are termed untranslated regions (UTRs), 5'UTRs (also termed leader sequences) and 3'UTRs (also termed trailer sequences), respectively. UTRs can therefore be interrupted by introns or alternatively spliced as every other exon regions. 

COMMON

FILE

FORMATS

fasta
GFF | GTF

Common File Formats

Sequences and gene structures

fasta

The fasta file format is a format to hold biological sequences.

>A18572.1 Tubulin gene
TTTGCATGCTGTCCAACACGACGCGATCGCTGAAGCTTGGGCTCGTTTGGATATAAGTTTGACCTTATGT
ATGCCAAGCGTGCATTCGTCCACTGGTGAGTGTTCTTTCGACATCATCTTTTTCATTTGCAGTTGTTCTG
CATATACAACATTTTATGAAAGTCAGATATACTGTTCAGGTATGTCGGAGAGGGAATGGAGGAAGGAGAG
TTCAGTGAGGNACGTGAAGATCTCCCCGGGCTGCAGGAATTCGATATCAAGCTTATCGATACCGT
>NM_001272455.2 Drosophila melanogaster hexokinase A, transcript variant C (Hex-A), mRNA
ATGCCATTTGTGGACCCCTCAGCGTCGCACATATACACGCCATATCTCCAACCATGCCGCCCCAAAACCG
ATTTTCAGTTTTTGACCTTCAAGGTTCCTTCACGGCGAGCGAGAGAGAGAGAGAGTGAGAGTGTGAGCGC
GGCTCGCAGTCGCGGTCGGCGCCGGCAGCGCAGCAGCAGCGACAGCGAAAAGTGAGCACCAGTCGGCGAG
TGAA

The example above clearly presents the features of the format:

  • The first line of each sequence starts with the character > immediately followed (no space) by a description of the sequence. The description can contain any character and can be of any length.
  • The lines until the next sequence header or the end of the file contain the sequence. The sequence can be represented by any character (except >) and characters can be in uppercase or lowercase. 
  • There is no limit defined by the file format, but usually sequences are given in maximum 80 characters per line.

The fasta format does not have a standard filename extension. The most commonly used extensions are:

.fasta  .fas  .fa generic extension
.fna often used to specify fasta files that contain only nucleotide sequences
.faa often used to specify fasta files that contain only protein sequences

GFF | GTF

GFF (General Feature Format, formerly known as Gene Feature Finding) and GTF (Gene Transfer Format) are file formats that hold information about gene and genome annotation. The format has initially been designed to define gene structures with respect to sequences, which could either be included in the file or present in an accompanying file. Both formats in their current versions require data in nine tab-delimited columns (not space delimited).

The initial nine-column GFF format was very flexible and did not restrict the entries in the feature column or the descriptions in the attribute (formerly called [group]) column. The missing specifications for most of the columns and the order of the rows resulted in several different flavours of the file type. All these flavours are more restrictive, allowing e.g. only certain feature types in column three or requiring certain keys in the attribute column. These restrictions would have not been problematic if the developed file parsers and validation tools did not adhere to these restrictions. Therefore, the GFF/GTF files need to be adjusted to the intended reading tool. In addition to these flavours, many software tools output files termed GFF where the only common characteristic with the here described GFF/GTF file types is the nine columns.

Because clear file type specifications were missing from the beginning, most of the descriptions available are vague in many aspects adding to the confusion. E.g. the <frame> is often described to be restricted to either "0", "1" or "2", while the example-GFF/GTF on the same site shows the frame to be represended by a "dot" (for unknown or not applicable). Thus, restrictions to the format are not given by the format but by the respective parser.

The most current description of the GFF3 format can be found here (checked 16-Feb-2023):
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

The description of the GTF2.2 format is given here (checked 16-Feb-2023):
https://mblab.wustl.edu/GTF22.html

A history of the GFF/GTF file formats including detailed descriptions of the various changes between versions can be found here (checked 16-Feb-2023):
https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gxf.md

GENOMES

AND

GENOME

ASSEMBLIES

contigs
scaffolds
N50 / L50
GC-content

Genomes and Genome Assemblies

Evaluating genome assemblies

Preferably, a genome assembly consists of as few contigs as possible (ideally just the number of chromosomes and organellar genomes) nevertheless also being as complete as possible. The genome assembly process is a balance between longer contigs versus higher per-base coverage.

contigs

Contigs are contiguous DNA sequences. Usually, these are build by aligning short DNA sequences and assembling overlapping sequences into longer, continuous sequences, the contigs. However, available genome assemblies often contain various numbers of just single DNA reads because these reads also represent the genome. In these datasets, single reads are also termed contigs, althoug they could not be aligned to other reads/contigs.

scaffolds | supercontigs | ultracontigs

Scaffalds (also termed supercontigs) consist of contigs put in correct order and orientation. The gaps between contigs are filled by a set number of "N"-nucleotides, or, if the distance between contigs is known, more accurate estimates of "N"s. The order and orientation of contigs can be obtained from paired-end sequencing, sequencing of bacterial artificial chromosomes (BACs) and optical mapping. In the earlier days of genome sequening, supercontigs ordered and oriented into even larger supercontigs (but not the size of chromosomes) were sometimes termed ultracontigs. Todays techniques allow for the construction of chromosome-scale genome assemblies and therefore the use of the term ultracontig has faded out.

N50 / L50

The N50/L50 are metrics to describe the quality of a genome assembly. If all contigs are sorted by length from longest to shortest and the lengths are added up contig by contig, the

  • N50    is the number of contigs that need to be added up so that their summed length is > 50% of the total genome assembly length, and
  • L50    is the length of that contig by whose addition the length of the summed contig length exceeds 50% of the total genome assembly length.

Often, numbers for further genome completeness ratios are given such as N80/L80 and N90/L90. Because all these numbers can at least slightly be tweaked by applying minimum contig length cut-offs or by adjusting the accuracy bounding for aligning reads, it is best to inspect the trend of contig lengths and numbers over the entire genome assembly.

We are aware that the N50 and L50 are often switched, e.g. the N50 denoting the length of the contig and L50 referring to the number of contigs. Also, the descriptions to compute N/L50 vary between publications and webpages. On our webpages, we use these metrics as defined above.

If the genome size is known, often used metrics are NG50 and LG50, which are identical to N50 and L50 except that the computation is not related to the total genome assembly lenght but the (real) genome size, which is determined/estimated by e.g. biophysical techniques.

GC-content

The GC-content is simply the proportion of guanine and cytosine nucleotides with respect to the genome sequence. The GC-content is very different across eukaryotes with most Apicomplexa and many Microsporidia having GC-contents of <30 %, and many Chlorophyta and Basidiomycota having GC-contents of >60%. Within a genome there is also highly variability with coding regions having a markedly higher proportion of GC nucleotides compared to introns and intergenic regions. Because of these characteristics the GC-content is one indicator of potential sequence contamination and an indicator for gene and coding regions. For the latter reason it is one of the essential parameters in gene prediction.

RELATIONS

BETWEEN

GENES

PROTEINS

homologs
paralogs
orthologs
synteny

Gene/protein origin and similarity

Functional annotation of genes/proteins is based on similarity. Similarity can be based on genome localisation, e.g. the order of genes in a genomic regions, and/or the similarity between DNA/protein sequences. Similarity is a vague term. Because of the low complexity of the sequences (four and twenty characters in uninterrupted strings) compared to language, sequence similarity is a continuum. In addition, most eukaryotic proteins are multi-domain proteins. During functional annotation, information is often transferred from one sequences to another based on the longest domain, that obviously gives the best-scoring hit, irrespective of the potential differences in the remaining sequence regions.

Homologs and analogs

By definition, homologs and analogs distinguish by homologs having a clear common evolutionary origin while analogs have similar biological functions although they do not have a common origin. As for most definitions in biology, there is a large grey zone. The uncertainty comes from similarity being a continuum and from similarity being the major determinant for comparing genes, proteins and derived data such as domain architecture and protein structure. For example, the bacterial FtsZ and eukaryotic tubulins show similarity at the protein structure level, but only low sequence similarity. 

Paralogs and orthologs

Orthology and paralogy describe the relation of genes/proteins if duplicates or multiple copies exist in genomes. Genes/proteins in different species are orthologous to each other, while the duplicates within a species are termed paralogs. The naming becomes confusing  to many readers as soon as the gene/protein history involves many duplications and gene losses. Ensembl defines a "between_species_paralog", which is a contradiction by itsself (a paralog describes duplicates within a species), adding to confusion. As long as a clear functional relationship between any genes/proteins and a relation across species cannot be revealed in such complex multi-duplication scenarios, the terms paralog and ortholog should only be used for unambigous relations while homolog should be used for all others.

Synteny

Synteny (from the Greek; syn = together, taenia = ribbon) is the conserved gene content and order in chromosomal regions of related species. The original relevance of the term, that only meant the presence of two or more loci on the same chromosome, dates to a pre-genomics era when locating genes to chromosomes was accomplished without the advantage of whole-genome mapping technologies. For plants, colinearity could be shown for long chromosomal regions at the genetic and at the gene level in spite of large variations in genome size. In vertebrates, synteny exists not only between closely related species but also over very long evolutionary timescales. Synteny might not only exist between different species but also between chromosomes of the same species.