Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models

Abstract Background The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative. Results Our results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities. Conclusions In our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative.

Download Full-text

The genome sequence of the European peacock butterfly, Aglais io (Linnaeus, 1758)

Wellcome Open Research ◽

10.12688/wellcomeopenres.17204.1 ◽

2021 ◽

Vol 6 ◽

pp. 258

Author(s):

Konrad Lohse ◽

Alexander Mackintosh ◽

Roger Vila ◽

◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Sex Chromosome ◽

Gene Annotation ◽

Protein Coding ◽

Individual Male ◽

Protein Coding Genes ◽

A Genome ◽

Inachis Io

We present a genome assembly from an individual male Aglais io (also known as Inachis io and Nymphalis io) (the European peacock; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 384 megabases in span. The majority (99.91%) of the assembly is scaffolded into 31 chromosomal pseudomolecules, with the Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 11,420 protein coding genes.

Download Full-text

The Absence of Universally-Conserved Protein-coding Genes

10.1101/842633 ◽

2019 ◽

Author(s):

Change Laura Tan

Keyword(s):

Public Access ◽

Orphan Genes ◽

Protein Coding ◽

Great Opportunity ◽

Protein Coding Genes ◽

Phylogenetic Profiles ◽

Gene Count ◽

A Genome ◽

Wide Scale ◽

Specific Species

AbstractPublic access to thousands of completely sequenced and annotated genomes provides a great opportunity to address the relationships of different organisms, at the molecular level and on a genome-wide scale. Via comparing the phylogenetic profiles of all protein-coding genes in 317 model species described in the OrthoInspector3.0 database, we found that approximately 29.8% of the total protein-coding genes were orphan genes (genes unique to a specific species) while < 0.01% were universal genes (genes with homologs in each of the 317 species analyzed). When weighted by potential birth event, the orphan genes comprised 82% of the total, while the universal genes accounted for less than 0.00008%. Strikingly, as the analyzed genomes increased, the sum total of universal and nearly-universal genes plateaued while that of orphan and nearly-orphan genes grew continuously. When the compared species increased to the inclusion of 3863 bacteria, 711 eukaryotes, and 179 archaea, not one of the universal genes remained. The results speak to a previously unappreciated degree of genetic biodiversity, which we propose to quantify using the birth-event-weighted gene count method.

Download Full-text

Draft Genome Sequence of a Novel Bacterium,Pseudomonassp. Strain MR 02, Capable of Pyomelanin Production, Isolated from the Mahananda River at Siliguri, West Bengal, India

Genome Announcements ◽

10.1128/genomea.01443-17 ◽

2018 ◽

Vol 6 (3) ◽

pp. e01443-17 ◽

Cited By ~ 1

Author(s):

Vivek Kumar Ranjan ◽

Tilak Saha ◽

Shriparna Mukherjee ◽

Ranadhir Chakraborty

Keyword(s):

Genome Sequence ◽

West Bengal ◽

Draft Genome ◽

Homogentisic Acid ◽

Draft Genome Sequence ◽

Gene Length ◽

Protein Coding ◽

Protein Coding Genes ◽

Novel Bacterium ◽

A Genome

ABSTRACTThe draft genome sequence of a novel strain,Pseudomonassp. MR 02, a pyomelanin-producing bacterium isolated from the Mahananda River at Siliguri, West Bengal, India, is reported here. This strain has a genome size of 5.94 Mb, with an overall G+C content of 62.6%. The draft genome reports 5,799 genes (mean gene length, 923 bp), among which 5,503 are protein-coding genes, including the genes required for the catabolism of tyrosine or phenylalanine for the characteristic production of homogentisic acid (HGA). Excess HGA, on excretion, auto-oxidizes and polymerizes to form pyomelanin.

Download Full-text

The genome sequence of the Glanville fritillary, Melitaea cinxia (Linnaeus, 1758)

Wellcome Open Research ◽

10.12688/wellcomeopenres.17283.1 ◽

2021 ◽

Vol 6 ◽

pp. 266

Author(s):

Roger Vila ◽

Alex Hayward ◽

Konrad Lohse ◽

Charlotte Wright ◽

◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Sex Chromosome ◽

Gene Annotation ◽

Melitaea Cinxia ◽

Protein Coding ◽

Individual Male ◽

Protein Coding Genes ◽

A Genome

We present a genome assembly from an individual male Melitaea cinxia (the Glanville fritillary; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 499 megabases in span. The complete assembly is scaffolded into 31 chromosomal pseudomolecules, with the Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 13,666 protein coding genes.

Download Full-text

Draft Genome Assembly and Annotation of Red Raspberry Rubus Idaeus

10.1101/546135 ◽

2019 ◽

Cited By ~ 4

Author(s):

Haley Wight ◽

Junhui Zhou ◽

Muzi Li ◽

Sridhar Hannenhalli ◽

Stephen M. Mount ◽

...

Keyword(s):

De Novo ◽

Draft Genome ◽

Rubus Idaeus ◽

Slow Process ◽

Red Raspberry ◽

Protein Coding ◽

Draft Genome Assembly ◽

Protein Coding Genes ◽

A Genome ◽

Exceptional Value

AbstractThe red raspberry, Rubus idaeus, is widely distributed in all temperate regions of Europe, Asia, and North America and is a major commercial fruit valued for its taste, high antioxidant and vitamin content. However, Rubus breeding is a long and slow process hampered by limited genomic and molecular resources. Genomic resources such as a complete genome sequencing and transcriptome will be of exceptional value to improve research and breeding of this high value crop. Using a hybrid sequence assembly approach including data from both long and short sequence reads, we present the first assembly of the Rubus idaeus genome (Joan J. variety). The de novo assembled genome consists of 2,145 scaffolds with a genome completeness of 95.3% and an N50 score of 638 KB. Leveraging a linkage map, we anchored 80.1% of the genome onto seven chromosomes. Using over 1 billion paired-end RNAseq reads, we annotated 35,566 protein coding genes with a transcriptome completeness score of 97.2%. The Rubus idaeus genome provides an important new resource for researchers and breeders.

Download Full-text

Pandoravirus celtis illustrates the microevolution processes at work in the giant Pandoraviridae genomes

10.1101/500207 ◽

2018 ◽

Cited By ~ 1

Author(s):

Matthieu Legendre ◽

Jean-Marie Alempic ◽

Nadège Philippe ◽

Audrey Lartigue ◽

Sandra Jeudy ◽

...

Keyword(s):

De Novo ◽

Gene Repertoire ◽

Protein Coding ◽

Genomic Changes ◽

Coding Regions ◽

Protein Coding Genes ◽

Intergenic Regions ◽

Mere Existence ◽

Increasing Functions ◽

Similar Gene

AbstractWith genomes of up to 2.7 Mb propagated in µm-long oblong particles and initially predicted to encode more than 2000 proteins, members of the Pandoraviridae family display the most extreme features of the known viral world. The mere existence of such giant viruses raises fundamental questions about their origin and the processes governing their evolution. A previous analysis of six newly available isolates, independently confirmed by a study including 3 others, established that the Pandoraviridae pan-genome is open, meaning that each new strain exhibits protein-coding genes not previously identified in other family members. With an average increment of about 60 proteins, the gene repertoire shows no sign of reaching a limit and remains largely coding for proteins without recognizable homologs in other viruses or cells (ORFans). To explain these results, we proposed that most new protein-coding genes were created de novo, from pre-existing non-coding regions of the G+C rich pandoravirus genomes. The comparison of the gene content of a new isolate, P. celtis, closely related (96% identical genome) to the previously described P. quercus is now used to test this hypothesis by studying genomic changes in a microevolution range. Our results confirm that the differences between these two similar gene contents mostly consist of protein-coding genes without known homologs (ORFans), with statistical signatures close to that of intergenic regions. These newborn proteins are under slight negative selection, perhaps to maintain stable folds and prevent protein aggregation pending the eventual emergence of fitness-increasing functions. Our study also unraveled several insertion events mediated by a transposase of the hAT family, 3 copies of which are found in P. celtis and are presumably active. Members of the Pandoraviridae are presently the first viruses known to encode this type of transposase.

Download Full-text

The genome sequence of the heath fritillary, Melitaea athalia (Rottemburg, 1775)

Wellcome Open Research ◽

10.12688/wellcomeopenres.17280.1 ◽

2021 ◽

Vol 6 ◽

pp. 304

Author(s):

Alex Hayward ◽

Roger Vila ◽

Dominik R. Laetsch ◽

Konrad Lohse ◽

Tobias Baril ◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Sex Chromosome ◽

Gene Annotation ◽

Protein Coding ◽

Individual Female ◽

Protein Coding Genes ◽

A Genome

We present a genome assembly from an individual female Melitaea athalia (also known as Mellicta athalia; the heath fritillary; Arthropoda; Insecta; Lepidoptera; Nymphalidae). The genome sequence is 610 megabases in span. In total, 99.98% of the assembly is scaffolded into 32 chromosomal pseudomolecules, with the W and Z sex chromosome assembled. Gene annotation of this assembly on Ensembl has identified 12,824 protein coding genes.

Download Full-text

Chromosome-level genome assembly of a butterflyfish, Chelmon rostratus

10.1101/719187 ◽

2019 ◽

Author(s):

Xiaoyun Huang ◽

Yue Song ◽

Suyu Zhang ◽

A Yunga ◽

Mengqi Zhang ◽

...

Keyword(s):

Molecular Mechanisms ◽

Repetitive Sequences ◽

Ecological Environment ◽

Protein Coding ◽

Protein Coding Genes ◽

A Genome ◽

Genome Information ◽

Adaptation Evolution ◽

Core Genes ◽

Chromosome Level

AbstractChelmon rostratus (Teleostei, Perciformes, Chaetodontidae) is a copperband butterflyfish. As an ornamental fish, the genome information for this species might help understanding the genome evolution of Chaetodontidae and adaptation/evolution of coral reef fish.In this study, using the stLFR co-Barcode reads data, we assembled a genome of 638.70 Mb in size with contig and scaffold N50 sizes of 294.41 kb and 2.61 Mb, respectively. 94.40% of scaffold sequences were assigned to 24 chromosomes using Hi-C data and BUSCO analysis showed that 97.3% (2,579) of core genes were found in our assembly. Up to 21.47 % of the genome was found to be repetitive sequences and 21,375 protein-coding genes were annotated. Among these annotated protein-coding genes, 20,163 (94.33%) proteins were assigned with possible functions.As the first genome for Chaetodontidae family, the information of these data helpfully to improve the essential to the further understanding and exploration of marine ecological environment symbiosis with coral and the genomic innovations and molecular mechanisms contributing to its unique morphology and physiological features.

Download Full-text

Whole Genome Sequencing of Sunflower Root-Associated Bacillus cereus

Evolutionary Bioinformatics ◽

10.1177/11769343211038948 ◽

2021 ◽

Vol 17 ◽

pp. 117693432110389

Author(s):

Olubukola Oluranti Babalola ◽

Bartholomew Saanu Adeleke ◽

Ayansina Segun Ayangbenro

Keyword(s):

Metabolic Pathways ◽

Endophytic Bacteria ◽

Read Count ◽

Bacillus Species ◽

Whole Genome ◽

Organic Substrates ◽

Protein Coding ◽

Protein Coding Genes ◽

A Genome ◽

Microbe Interactions

In recent times, diverse agriculturally important endophytic bacteria colonizing plant endosphere have been identified. Harnessing the potential of Bacillus species from sunflower could reveal their biotechnological and agricultural importance. Here, we present genomic insights into B. cereus T4S isolated from sunflower sourced from Lichtenburg, South Africa. Genome analysis revealed a sequence read count of 7 255 762, a genome size of 5 945 881 bp, and G + C content of 34.8%. The genome contains various protein-coding genes involved in various metabolic pathways. The detection of genes involved in the metabolism of organic substrates and chemotaxis could enhance plant-microbe interactions in the synthesis of biological products with biotechnological and agricultural importance.

Download Full-text

Genome Sequence of Gordonia Phage Yvonnetastic

Genome Announcements ◽

10.1128/genomea.00594-16 ◽

2016 ◽

Vol 4 (4) ◽

Cited By ~ 1

Author(s):

Welkin H. Pope ◽

Anshika Bandyopadhyay ◽

Meghan L. Carlton ◽

Meghan T. Kane ◽

Niyati J. Panchal ◽

...

Keyword(s):

Genome Sequence ◽

Sequence Similarity ◽

Trna Genes ◽

Protein Coding ◽

Protein Coding Genes ◽

A Genome

Gordonia bacteriophage Yvonnetastic was isolated from soil in Pittsburgh, PA, using Gordonia terrae 3612 as a host. Yvonnetastic has siphoviral morphology and a genome of 98,136 bp, with 198 predicted protein-coding genes and five tRNA genes. Yvonnetastic does not share substantial sequence similarity with other sequenced bacteriophage genomes.

Download Full-text