The founding team of GOENOMICS has been developing and researching genome annotation software for more than 35 years in academic research. Recognizing the limitations of current approaches, we have joined forces to develop new software. If you have questions about annotating a particular genomic region, contact us and we will provide you with a detailed assessment.
Configure the genome annotation according to your needs
Your advantages
Order the annotations you need, select from 7 annotation packages.
Considerably higher quality of gene prediction by a new annotation approach.
Annotation of genomes at all assembly levels.
Short delivery time
Genome annotation is more than the prediction of protein-coding genes
Genome annotation means that each nucleotide in a genome sequence is assigned a function. Current genome annotations focus on the prediction of protein-coding genes. Untranslated regions (5'- and 3'-UTRs), non-coding regions, transposons and even RNA genes are rarely annotated. GOENOMICS provides all-encompassing genome annotations, but you can also select only the annotations you need. By looking at all types of genes and regions simultaneously, we significantly reduce the false-positive prediction that one type of gene belongs to another type of gene.
Structural gene prediction takes into account various features, such as open reading frames (ORFs), splice sites, start and stop codons, and consensus sequences, to distinguish coding regions from non-coding ones.
Annotation process
Our work starts with the customer's genome assembly and optional supporting RNA-Seq and/or Isoseq data
We deliver, depending on the packages ordered, the annotation results as gff3, fasta, csv/xslx (described below)
Package: Protein-coding genes
Protein-coding gene reconstruction focuses on identifying and annotating the locations of protein-coding genes within a DNA sequence. In the vast landscape of genomic information, the accurate prediction of structural genes is crucial for unraveling the functional elements that govern an organism's biology.
Structural gene prediction takes into account various features, such as open reading frames (ORFs), splice sites, start and stop codons, and consensus sequences, to distinguish coding regions from non-coding ones.
We use a newly developed, proprietary algorithm for homology-based gene reconstruction. To adapt the homology model, we created thousands of genome annotations internally and improved them iteratively.
annotation in GFF3 format
Gene types are denoted as Note in the gff3 attributes of the gene entries. Genes without any issue are termed "protein-coding" and "non-coding" depending on the presence of CDS regions. Genes containing in-frame stop codons or frameshifts are termed "potential pseudogene".
amino acid and CDS sequences in fasta format
The translations of the protein-coding genes are given without the terminal stop. The protein sequences of potential pseudogenes are provided in a separate file. The latter may contain internal stop codons and the letter X at possible frameshift positions.
Package: Functional annotation
Functional gene annotation involves assigning biological functions to the discovered genes within a genome. We assign among others
domain architecture
Domain profiles from Pfam, CDD, ProSitePattern and others.
protein/gene names
Protein naming strongly depends on the reference database used and the quality and completeness of the query sequence. In addition, naming subfamily members within protein families is error prone. In addition to potentially misleading subfamily designations we apply protein family names.
GO terms
GO-terms are derived and combined from gene names and domain profiles.
EC numbers
EC numbers (Enzyme Commission numbers) are a numerical classification scheme for enzymes, based on the chemical reactions they catalyze
annotation added to structural annotation in GFF3 and fasta files
Gene names from best-hit homologs are added to GFF3 and fasta files, incl. organism and accession number of the homolog.
functional annotation in csv/Excel format
Table includes functions related to protein-coding genes, including the closest known homolog with SwissProt-accession, species name, and taxonomy, a standardized majority consensus protein name, EC numbers, GO terms, and protein domains
Package: RNA genes
tRNA annotation types include
cognate
anticodon matches the tRNA isotype; no pseudo genes
non-cognate isotype
anticodon does not match the tRNA isotype
pseudo
score of the tRNA-sequence match too low
undetermined
anticodon contains āNā
truncated
tRNA not complete, sequence at either or both ends missing
suppressor
anticodon matches one of the stop codons
Predicted RNA gene families includ ribosomal RNA, spliceosomal RNA, RNase P, RNase MRP, telomerase and snoRNA U3 genes
annotation in GFF3 format
annotations for tRNA, ribosomal RNA, spliceosomal RNA, telomerase, RNase P, RNase MRP and U3 snoRNA genes.
Package: UTRs / Coverage
The exon regions upstream (5') and downstream (3') of the coding exon regions are called untranslated regions (UTRs), 5'UTRs (also known as leader sequences) and 3'UTRs (also known as trailer sequences) respectively. UTRs can be interrupted by introns or alternatively spliced like all other exon regions.
UTRs are determined by mapping RNASeq data or by prediction. Predictions are made from profiles trained with UTRs based on RNASeq data or from profiles of closely related species.
Coverage by RNASeq data can be defined in two ways: i) Horizontal coverage, i.e. the support of the gene model from the 5ā to the 3' end. In this case, the total length of the exons is defined as 100% and the coverage is the percentage of mRNA covered by the RNASeq data. ii) Vertical coverage, i.e. the support of each nucleotide by RNASeq reads.
Here we use the vertical coverage to indicate whether a feature is supported by experimental data. The value indicates whether there is strong support (high values) or low support (low values). Note: The values must not be used for differential expression analysis!
annotation in GFF3 format
UTRs are added to the gff3 file as additional features to each gene if they can be determined or predicted. Coverage is added as an attribute to mRNA and exon features.
Package: Transposons
Due to their similarity to genes and gene elements, retrotransposons are incorrectly predicted to a considerable extent by common gene prediction programmes. Depending on the genome, 5 to 10 % of the genome annotations in public databases consist of transposons and/or contain parts of transposons in their gene structure.
We use a novel approach to unambiguously identify transposons and transposon-fragments. By annotating all types of genome features, protein-coding, non-coding and RNA-genes, pseudogenes and transposons, we ensure that one feature is not mispredicted as another.
annotation in GFF3 format
Transposons are provided as separate gff3 file.
Package: 5 Genes
You will select 5 genes for final annotation, which we will analyse manually. This includes the confirmation of exons by RNA-Seq data or comparative genomics and the annotation of the most important alternative splice variants.
The full functional annotation as well as RNA-Seq evidence, if available, are presented as separate sections in the annotation report.
annotation in GFF3 format
Manual annotations are provided as separate gff3 file.
Package: Enhanced analysis report
The analysis report contains additional, biology-based analyses to the report including many interactive charts and plots. Biology-based analyses include, for example, the mapping tRNA genes onto the genetic code, a detailed splice-pattern analysis, a detailed functional annotation analysis and many more.
assembly evaluation
N50 plot, A50 plot, GC content per contig plot
genome annotation metrics
gene stats, exon/intron lengths statistics
distribution of genes on strands
genome covered by genes/features
codon usage frequency
exon phase distribution
annotation quality indicators
intron type distribution (3n/3n+1/3n+2)
analysis of introns with respect to in-frame stop codons
splice site pattern distribution
only in combination with UTR package
distribution of genes with UTRs
genes covered by RNA-Seq