Genomes and Genome Assemblies
Evaluating genome assemblies
Preferably, a genome assembly consists of as few contigs as possible (ideally just the number of chromosomes and organellar genomes) nevertheless also being as complete as possible. The genome assembly process is a balance between longer contigs versus higher per-base coverage.
Contigs are contiguous DNA sequences. Usually, these are build by aligning short DNA sequences and assembling overlapping sequences into longer, continuous sequences, the contigs. However, available genome assemblies often contain various numbers of just single DNA reads because these reads also represent the genome. In these datasets, single reads are also termed contigs, althoug they could not be aligned to other reads/contigs.
scaffolds | supercontigs | ultracontigs
Scaffalds (also termed supercontigs) consist of contigs put in correct order and orientation. The gaps between contigs are filled by a set number of "N"-nucleotides, or, if the distance between contigs is known, more accurate estimates of "N"s. The order and orientation of contigs can be obtained from paired-end sequencing, sequencing of bacterial artificial chromosomes (BACs) and optical mapping. In the earlier days of genome sequening, supercontigs ordered and oriented into even larger supercontigs (but not the size of chromosomes) were sometimes termed ultracontigs. Todays techniques allow for the construction of chromosome-scale genome assemblies and therefore the use of the term ultracontig has faded out.
N50 / L50
The N50/L50 are metrics to describe the quality of a genome assembly. If all contigs are sorted by length from longest to shortest and the lengths are added up contig by contig, the
- N50 is the number of contigs that need to be added up so that their summed length is > 50% of the total genome assembly length, and
- L50 is the length of that contig by whose addition the length of the summed contig length exceeds 50% of the total genome assembly length.
Often, numbers for further genome completeness ratios are given such as N80/L80 and N90/L90. Because all these numbers can at least slightly be tweaked by applying minimum contig length cut-offs or by adjusting the accuracy bounding for aligning reads, it is best to inspect the trend of contig lengths and numbers over the entire genome assembly.
We are aware that the N50 and L50 are often switched, e.g. the N50 denoting the length of the contig and L50 referring to the number of contigs. Also, the descriptions to compute N/L50 vary between publications and webpages. On our webpages, we use these metrics as defined above.
If the genome size is known, often used metrics are NG50 and LG50, which are identical to N50 and L50 except that the computation is not related to the total genome assembly lenght but the (real) genome size, which is determined/estimated by e.g. biophysical techniques.
The GC-content is simply the proportion of guanine and cytosine nucleotides with respect to the genome sequence. The GC-content is very different across eukaryotes with most Apicomplexa and many Microsporidia having GC-contents of <30 %, and many Chlorophyta and Basidiomycota having GC-contents of >60%. Within a genome there is also highly variability with coding regions having a markedly higher proportion of GC nucleotides compared to introns and intergenic regions. Because of these characteristics the GC-content is one indicator of potential sequence contamination and an indicator for gene and coding regions. For the latter reason it is one of the essential parameters in gene prediction.