scholarly journals OrthoFiller: utilising data from multiple species to improve the completeness of genome annotations

2017 ◽  
Author(s):  
Michael P. Dunne ◽  
Steven Kelly

AbstractBackroundComplete and accurate annotation of sequenced genomes is of paramount importance to their utility and analysis. Differences in gene prediction pipelines mean that genome sequences for a species can differ considerably in the quality and quantity of their predicted genes. Furthermore, genes that are present in genome sequences sometimes fail to be detected by computational gene prediction methods. Erroneously unannotated genes can lead to oversights and inaccurate assertions in biological investigations, especially for smaller-scale genome projects which rely heavily on computational prediction.ResultsHere we present OrthoFiller, a tool designed to address the problem of finding and adding such missing genes to genome annotations. OrthoFiller leverages information from multiple related species to identify those genes whose existence can be verified through comparison with known gene families, but which have not been predicted. By simulating missing gene annotations in real sequence datasets from both plants and fungi we demonstrate the accuracy and utility of OrthoFiller for finding missing genes and improving genome annotation. Furthermore, we show that applying OrthoFiller to existing “complete” genome annotations can identify and correct substantial numbers of erroneously missing genes in these two sets of species.ConclusionsWe show that significant improvements in the completeness of genome annotations can be made by leveraging information from multiple species.

2019 ◽  
Author(s):  
Alex Trouern-Trend ◽  
Taylor Falk ◽  
Sumaira Zaman ◽  
Madison Caballero ◽  
David B. Neale ◽  
...  

ABSTRACTJuglans (walnuts), the most speciose genus in the walnut family (Juglandaceae) represents most of the family’s commercially valuable fruit and wood-producing trees and includes several species used as rootstock in agriculture for their resistance to various abiotic and biotic stressors. We present the full structural and functional genome annotations of six Juglans species and one outgroup within Juglandaceae (Juglans regia, J. cathayensis, J. hindsii, J. microcarpa, J. nigra, J. sigillata and Pterocarya stenoptera) produced using BRAKER2 semi-unsupervised gene prediction pipeline and additional in-house developed tools. For each annotation, gene predictors were trained using 19 tissue-specific J. regia transcriptomes aligned to the genomes. Additional functional evidence and filters were applied to multiexonic and monoexonic putative genes to yield between 27,000 and 44,000 high-confidence gene models per species. Comparison of gene models to the BUSCO embryophyta dataset suggested that, on average, genome annotation completeness was 89.6%. We utilized these high quality annotations to assess gene family evolution within Juglans and among Juglans and selected Eurosid species, which revealed significant contractions in several gene families in J. hindsii including disease resistance-related Wall-associated Kinase (WAK) and Catharanthus roseus Receptor-like Kinase (CrRLK1L) and others involved in abiotic stress response. Finally, we confirmed an ancient whole genome duplication that took place in a common ancestor of Juglandaceae using site substitution comparative analysis.SIGNIFICANCEHigh-quality full genome annotations for six species of walnut (Juglans) and a wingnut (Pterocarya) outgroup were constructed using semi-unsupervised gene prediction followed by gene model filtering and functional characterization. These annotations represent the most comprehensive set for any hardwood genus to date. Comparative analyses based on the gene models uncovered rapid evolution in multiple gene families related to disease-response and a whole genome duplication in a Juglandaceae common ancestor.


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Tyler Alioto ◽  
Ernesto Picardi ◽  
Roderic Guigó ◽  
Graziano Pesole

New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurateab initiogene prediction methods. However, it is apparent that fullyab initiomethods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entireC. elegansgenome and the 44 ENCODE human pilot regions.


2016 ◽  
Vol 113 (48) ◽  
pp. 13815-13820 ◽  
Author(s):  
Mi Ok Lee ◽  
Susanne Bornelöv ◽  
Leif Andersson ◽  
Susan J. Lamont ◽  
Junfeng Chen ◽  
...  

Defensins constitute an evolutionary conserved family of cationic antimicrobial peptides that play a key role in host innate immune responses to infection. Defensin genes generally reside in complex genomic regions that are prone to structural variation, and defensin genes exhibit extensive copy number variation in humans and in other species. Copy number variation of defensin genes was examined in inbred lines of Leghorn and Fayoumi chickens, and a duplication ofdefensin7was discovered in the Fayoumi breed. Analysis of junction sequences confirmed the occurrence of a simple tandem duplication ofdefensin7with sequence identity at the junction, suggesting nonallelic homologous recombination betweendefensin7anddefensin6. The duplication event generated two chimeric promoters that are best explained by gene conversion followed by homologous recombination. Expression ofdefensin7was not elevated in animals with two genes despite both genes being transcribed in the tissues examined. Computational prediction of promoter regions revealed the presence of several putative transcription factor binding sites generated by the duplication event. These data provide insight into the evolution and possible function of large gene families and specifically, the defensins.


Author(s):  
Almut Heinken ◽  
Stefanía Magnúsdóttir ◽  
Ronan M T Fleming ◽  
Ines Thiele

Abstract Motivation Manual curation of genome-scale reconstructions is laborious, yet existing automated curation tools do not typically take species-specific experimental and curated genomic data into account. Results We developed DEMETER, a COBRA Toolbox extension, that enables the efficient, simultaneous refinement of thousands of draft genome-scale reconstructions, while ensuring adherence to the quality standards in the field, agreement with available experimental data, and refinement of pathways based on manually refined genome annotations. Availability DEMETER and tutorials are freely available at https://github.com/opencobra.


2022 ◽  
Author(s):  
Caroline M. Weisman ◽  
Andrew M. Murray ◽  
Sean R Eddy

Comparisons of genomes of different species are used to identify lineage-specific genes, those genes that appear unique to one species or clade. Lineage-specific genes are often thought to represent genetic novelty that underlies unique adaptations. Identification of these genes depends not only on genome sequences, but also on inferred gene annotations. Comparative analyses typically use available genomes that have been annotated using different methods, increasing the risk that orthologous DNA sequences may be erroneously annotated as a gene in one species but not another, appearing lineage-specific as a result. To evaluate the impact of such 'annotation heterogeneity', we identified four clades of species with sequenced genomes with more than one publicly available gene annotation, allowing us to compare the number of lineage-specific genes inferred when differing annotation methods are used to those resulting when annotation method is uniform across the clade. In these case studies, annotation heterogeneity increases the apparent number of lineage-specific genes by up to 15-fold, suggesting that annotation heterogeneity is a substantial source of potential artifact.


2018 ◽  
Author(s):  
Madison Caballero ◽  
Jill Wegrzyn

AbstractPublished genome annotations are filled with erroneous gene models that represent issues associated with frame, start side identification, splice sites, and related structural features. The source of these inconsistencies can often be traced to translated text file formats designed to describe long read alignments and predicted gene structures. The majority of gene prediction frameworks do not provide downstream filtering to remove problematic gene annotations, nor do they represent these annotations in a format consistent with current file standards. In addition, these frameworks lack consideration for functional attributes, such as the presence or absence of protein domains which can be used for gene model validation. To provide oversight to the increasing number of published genome annotations, we present gFACs as a software package to filter, analyze, and convert predicted gene models and alignments. gFACs operates across a wide range of alignment, analysis, and gene prediction software inputs with a flexible framework for defining gene models with reliable structural and functional attributes. gFACs supports common downstream applications, including genome browsers and generates extensive details on the filtering process, including distributions that can be visualized to further assess the proposed gene space.


2018 ◽  
Author(s):  
A. K. M. Azad

AbstractCell-cell communication via pathway cross-talks within a single species have been studied in silico recently to decipher various disease phenotype. However, computational prediction of pathway cross-talks among multiple species in a data-driven manner is yet to be explored. In this article, I present XTalkiiS (Cross-talks between inter-/intra species pathways), a tool to automatically predict pathway cross-talks from data-driven models of pathway network, both within the same organism (intra-species) and between two organisms (inter-species). XTalkiiS starts with retrieving and listing up-to-date pathway information in all the species available in KEGG database using RESTful APIs (exploiting KEGG web services) and an in-house built web crawler. I hypothesize that data-driven network models can be built by simultaneously quantifying co-expression of pathway components (i.e. genes/proteins) in matched samples in multiple organisms. Next, XTalkiiS loads a data-driven pathway network and applies a novel cross-talk modelling approach to determine interactions among known KEGG pathways in selected organisms. The potentials of XTalkiiS are huge as it paves the way of finding novel insights into mechanisms how pathways from two species (ideally host-parasite) may interact that may contribute to the various phenotype of interests such as malaria disease. XTalkiiS is made open sourced at https://github.com/Akmazad/XTalkiiS and its binary files are freely available for downloading from https://sourceforge.net/projects/xtalkiis/.


Sign in / Sign up

Export Citation Format

Share Document