scholarly journals gFACs: Filtering, Analysis, and Conversion to Unify Genome Annotations Across Alignment and Gene Prediction Frameworks

2018 ◽  
Author(s):  
Madison Caballero ◽  
Jill Wegrzyn

AbstractPublished genome annotations are filled with erroneous gene models that represent issues associated with frame, start side identification, splice sites, and related structural features. The source of these inconsistencies can often be traced to translated text file formats designed to describe long read alignments and predicted gene structures. The majority of gene prediction frameworks do not provide downstream filtering to remove problematic gene annotations, nor do they represent these annotations in a format consistent with current file standards. In addition, these frameworks lack consideration for functional attributes, such as the presence or absence of protein domains which can be used for gene model validation. To provide oversight to the increasing number of published genome annotations, we present gFACs as a software package to filter, analyze, and convert predicted gene models and alignments. gFACs operates across a wide range of alignment, analysis, and gene prediction software inputs with a flexible framework for defining gene models with reliable structural and functional attributes. gFACs supports common downstream applications, including genome browsers and generates extensive details on the filtering process, including distributions that can be visualized to further assess the proposed gene space.

2016 ◽  
Author(s):  
Brent S. Pedersen ◽  
Ryan M. Layer ◽  
Aaron R. Quinlan

ABSTRACTBackgroundThe integration of genome annotations and reference databases is critical to the identification of genetic variants that may be of interest in studies of disease or other traits. However, comprehensive variant annotation with diverse file formats is difficult with existing methods.ResultsWe have developed vcfanno as a flexible toolset that simplifies the annotation of genetic variants in VCF format. Vcfanno can extract and summarize multiple attributes from one or more annotation files and append the resulting annotations to the INFO field of the original VCF file. Vcfanno also integrates the lua scripting language so that users can easily develop custom annotations and metrics. By leveraging a new parallel “chromosome sweeping” algorithm, it enables rapid annotation of both whole-exome and whole-genome datasets. We demonstrate this performance by annotating over 85.3 million variants in less than 17 minutes (>85,000 variants per second) with 50 attributes from 17 commonly used genome annotation resources.ConclusionsVcfanno is a flexible software package that provides researchers with the ability to annotate genetic variation with a wide range of datasets and reference databases in diverse genomic formats.AvailabilityThe vcfanno source code is available at https://github.com/brentp/vcfanno under the MIT license, and platform-specific binaries are available at https://github.com/brentp/vcfanno/releases. Detailed documentation is available at http://brentp.github.io/vcfanno/, and the code underlying the analyses presented can be found at https://github.com/brentp/vcfanno/tree/master/scripts/paper.


2020 ◽  
Author(s):  
Nicolas J Wheeler ◽  
Paul M. Airs ◽  
Mostafa Zamanian

AbstractFilarial nematodes (Filarioidea) cause substantial disease burden to humans and animals around the world. Recently there has been a coordinated global effort to generate and curate genomic data from nematode species of medical and veterinary importance. This has resulted in two chromosome-level assemblies (Brugia malayi and Onchocerca volvulus) and 10 additional draft genomes from Filarioidea. These reference assemblies facilitate comparative genomics to explore basic helminth biology and prioritize new drug and vaccine targets. While the continual improvement of genome contiguity and completeness advances these goals, experimental functional annotation of genes is often hindered by poor gene models. Short-read RNA sequencing data and expressed sequence tags, in cooperation with ab initio prediction algorithms, are employed for gene prediction, but these can result in missing clade-specific genes, fragmented models, imperfect mapping of gene ends, and lack of isoform resolution. Long-read RNA sequencing can overcome these drawbacks and greatly improve gene model quality. Here, we present Iso-Seq data for B. malayi and Dirofilaria immitis, etiological agents of lymphatic filariasis and canine heartworm disease, respectively. These data cover approximately half of the known coding genomes and substantially improve gene models by extending untranslated regions, cataloging novel splice junctions from novel isoforms, and correcting mispredicted junctions. Furthermore, we validated computationally predicted operons, identified new operons, and merged fragmented gene models. We carried out analyses of poly(A) tails in both species, leading to the identification of non-canonical poly(A) signals. Finally, we prioritized and assessed known and putative anthelmintic targets, correcting or validating gene models for molecular cloning and target-based antiparasitic screening efforts. Overall, these data significantly improve the catalog of gene models for two important parasites, and they demonstrate how long-read RNA sequencing should be prioritized for future improvement of parasitic nematode genome assemblies.


2019 ◽  
Author(s):  
Alex Trouern-Trend ◽  
Taylor Falk ◽  
Sumaira Zaman ◽  
Madison Caballero ◽  
David B. Neale ◽  
...  

ABSTRACTJuglans (walnuts), the most speciose genus in the walnut family (Juglandaceae) represents most of the family’s commercially valuable fruit and wood-producing trees and includes several species used as rootstock in agriculture for their resistance to various abiotic and biotic stressors. We present the full structural and functional genome annotations of six Juglans species and one outgroup within Juglandaceae (Juglans regia, J. cathayensis, J. hindsii, J. microcarpa, J. nigra, J. sigillata and Pterocarya stenoptera) produced using BRAKER2 semi-unsupervised gene prediction pipeline and additional in-house developed tools. For each annotation, gene predictors were trained using 19 tissue-specific J. regia transcriptomes aligned to the genomes. Additional functional evidence and filters were applied to multiexonic and monoexonic putative genes to yield between 27,000 and 44,000 high-confidence gene models per species. Comparison of gene models to the BUSCO embryophyta dataset suggested that, on average, genome annotation completeness was 89.6%. We utilized these high quality annotations to assess gene family evolution within Juglans and among Juglans and selected Eurosid species, which revealed significant contractions in several gene families in J. hindsii including disease resistance-related Wall-associated Kinase (WAK) and Catharanthus roseus Receptor-like Kinase (CrRLK1L) and others involved in abiotic stress response. Finally, we confirmed an ancient whole genome duplication that took place in a common ancestor of Juglandaceae using site substitution comparative analysis.SIGNIFICANCEHigh-quality full genome annotations for six species of walnut (Juglans) and a wingnut (Pterocarya) outgroup were constructed using semi-unsupervised gene prediction followed by gene model filtering and functional characterization. These annotations represent the most comprehensive set for any hardwood genus to date. Comparative analyses based on the gene models uncovered rapid evolution in multiple gene families related to disease-response and a whole genome duplication in a Juglandaceae common ancestor.


2020 ◽  
Vol 14 (11) ◽  
pp. e0008869
Author(s):  
Nicolas J Wheeler ◽  
Paul M. Airs ◽  
Mostafa Zamanian

Filarial parasitic nematodes (Filarioidea) cause substantial disease burden to humans and animals around the world. Recently there has been a coordinated global effort to generate, annotate, and curate genomic data from nematode species of medical and veterinary importance. This has resulted in two chromosome-level assemblies (Brugia malayi and Onchocerca volvulus) and 11 additional draft genomes from Filarioidea. These reference assemblies facilitate comparative genomics to explore basic helminth biology and prioritize new drug and vaccine targets. While the continual improvement of genome contiguity and completeness advances these goals, experimental functional annotation of genes is often hindered by poor gene models. Short-read RNA sequencing data and expressed sequence tags, in cooperation with ab initio prediction algorithms, are employed for gene prediction, but these can result in missing clade-specific genes, fragmented models, imperfect mapping of gene ends, and lack of isoform resolution. Long-read RNA sequencing can overcome these drawbacks and greatly improve gene model quality. Here, we present Iso-Seq data for B. malayi and Dirofilaria immitis, etiological agents of lymphatic filariasis and canine heartworm disease, respectively. These data cover approximately half of the known coding genomes and substantially improve gene models by extending untranslated regions, cataloging novel splice junctions from novel isoforms, and correcting mispredicted junctions. Furthermore, we validated computationally predicted operons, manually curated new operons, and merged fragmented gene models. We carried out analyses of poly(A) tails in both species, leading to the identification of non-canonical poly(A) signals. Finally, we prioritized and assessed known and putative anthelmintic targets, correcting or validating gene models for molecular cloning and target-based anthelmintic screening efforts. Overall, these data significantly improve the catalog of gene models for two important parasites, and they demonstrate how long-read RNA sequencing should be prioritized for ongoing improvement of parasitic nematode genome assemblies.


2020 ◽  
Author(s):  
Richard Kuo ◽  
Yuanyuan Cheng ◽  
Runxuan Zhang ◽  
John W.S. Brown ◽  
Jacqueline Smith ◽  
...  

Abstract Background The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide stronger evidence for genes that were previously either undetectable or impossible to differentiate from sequencing noise such as rare transcripts, mono-exonic, and non-coding genes.Results We analyzed Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using the Transcriptome Annotation by Modular Algorithms (TAMA) software. We found that the convention of using mapping identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction leads to the thousands of erroneous gene models. Using genome assembly based error correction and gene feature evidence, we identified thousands of potentially functional novel genes.Conclusions The standard of using inter-read error correction for long read RNA sequencing data could be responsible for genome annotations with thousands of biologically inaccurate gene models. More than half of all real genes in the human genome may still be missing in current public annotations. We require better methods for differentiating sequencing noise from real genes in long read RNA sequencing data.


2019 ◽  
Vol 26 (10) ◽  
pp. 743-750 ◽  
Author(s):  
Remya Radha ◽  
Sathyanarayana N. Gummadi

Background:pH is one of the decisive macromolecular properties of proteins that significantly affects enzyme structure, stability and reaction rate. Change in pH may protonate or deprotonate the side group of aminoacid residues in the protein, thereby resulting in changes in chemical and structural features. Hence studies on the kinetics of enzyme deactivation by pH are important for assessing the bio-functionality of industrial enzymes. L-asparaginase is one such important enzyme that has potent applications in cancer therapy and food industry.Objective:The objective of the study is to understand and analyze the influence of pH on deactivation and stability of Vibrio cholerae L-asparaginase.Methods:Kinetic studies were conducted to analyze the effect of pH on stability and deactivation of Vibrio cholerae L-asparaginase. Circular Dichroism (CD) and Differential Scanning Calorimetry (DSC) studies have been carried out to understand the pH-dependent conformational changes in the secondary structure of V. cholerae L-asparaginase.Results:The enzyme was found to be least stable at extreme acidic conditions (pH< 4.5) and exhibited a gradual increase in melting temperature from 40 to 81 °C within pH range of 4.0 to 7.0. Thermodynamic properties of protein were estimated and at pH 7.0 the protein exhibited ΔG37of 26.31 kcal mole-1, ΔH of 204.27 kcal mole-1 and ΔS of 574.06 cal mole-1 K-1.Conclusion:The stability and thermodynamic analysis revealed that V. cholerae L-asparaginase was highly stable over a wide range of pH, with the highest stability in the pH range of 5.0–7.0.


2020 ◽  
Vol 09 ◽  
Author(s):  
Minita Ojha ◽  
R. K. Bansal

Background: During the last two decades, horizon of research in the field of Nitrogen Heterocyclic Carbenes (NHC) has widened remarkably. NHCs have emerged as ubiquitous species having applications in a broad range of fields, including organocatalysis and organometallic chemistry. The NHC-induced non-asymmetric catalysis has turned out to be a really fruitful area of research in recent years. Methods: By manipulating structural features and selecting appropriate substituent groups, it has been possible to control the kinetic and thermodynamic stability of a wide range of NHCs, which can be tolerant to a variety of functional groups and can be used under mild conditions. NHCs are produced by different methods, such as deprotonation of Nalkylhetrocyclic salt, transmetallation, decarboxylation and electrochemical reduction. Results: The NHCs have been used successfully as catalysts for a wide range of reactions making a large number of building blocks and other useful compounds accessible. Some of these reactions are: benzoin condensation, Stetter reaction, Michael reaction, esterification, activation of esters, activation of isocyanides, polymerization, different cycloaddition reactions, isomerization, etc. The present review includes all these examples published during the last 10 years, i.e. from 2010 till date. Conclusion: The NHCs have emerged as versatile and powerful organocatalysts in synthetic organic chemistry. They provide the synthetic strategy which does not burden the environment with metal pollutants and thus fit in the Green Chemistry.


2021 ◽  
Vol 11 (4) ◽  
Author(s):  
Yury A Barbitoff ◽  
Andrew G Matveenko ◽  
Anton B Matiiv ◽  
Evgeniia M Maksiutenko ◽  
Svetlana E Moskalenko ◽  
...  

Abstract Thousands of yeast genomes have been sequenced with both traditional and long-read technologies, and multiple observations about modes of genome evolution for both wild and laboratory strains have been drawn from these sequences. In our study, we applied Oxford Nanopore and Illumina technologies to assemble complete genomes of two widely used members of a distinct laboratory yeast lineage, the Peterhof Genetic Collection (PGC), and investigate the structural features of these genomes including transposable element content, copy number alterations, and structural rearrangements. We identified numerous notable structural differences between genomes of PGC strains and the reference S288C strain. We discovered a substantial enrichment of mid-length insertions and deletions within repetitive coding sequences, such as in the SCH9 gene or the NUP100 gene, with possible impact of these variants on protein amyloidogenicity. High contiguity of the final assemblies allowed us to trace back the history of reciprocal unbalanced translocations between chromosomes I, VIII, IX, XI, and XVI of the PGC strains. We show that formation of hybrid alleles of the FLO genes during such chromosomal rearrangements is likely responsible for the lack of invasive growth of yeast strains. Taken together, our results highlight important features of laboratory yeast strain evolution using the power of long-read sequencing.


Materials ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 1486
Author(s):  
Eugene B. Caldona ◽  
Ernesto I. Borrego ◽  
Ketki E. Shelar ◽  
Karl M. Mukeba ◽  
Dennis W. Smith

Many desirable characteristics of polymers arise from the method of polymerization and structural features of their repeat units, which typically are responsible for the polymer’s performance at the cost of processability. While linear alternatives are popular, polymers composed of cyclic repeat units across their backbones have generally been shown to exhibit higher optical transparency, lower water absorption, and higher glass transition temperatures. These specifically include polymers built with either substituted alicyclic structures or aromatic rings, or both. In this review article, we highlight two useful ring-forming polymer groups, perfluorocyclobutyl (PFCB) aryl ether polymers and ortho-diynylarene- (ODA) based thermosets, both demonstrating outstanding thermal stability, chemical resistance, mechanical integrity, and improved processability. Different synthetic routes (with emphasis on ring-forming polymerization) and properties for these polymers are discussed, followed by their relevant applications in a wide range of aspects.


2021 ◽  
Vol 11 (2) ◽  
Author(s):  
Suzanne V Saenko ◽  
Dick S J Groenenberg ◽  
Angus Davison ◽  
Menno Schilthuizen

Abstract Studies on the shell color and banding polymorphism of the grove snail Cepaea nemoralis and the sister taxon Cepaea hortensis have provided compelling evidence for the fundamental role of natural selection in promoting and maintaining intraspecific variation. More recently, Cepaea has been the focus of citizen science projects on shell color evolution in relation to climate change and urbanization. C. nemoralis is particularly useful for studies on the genetics of shell polymorphism and the evolution of “supergenes,” as well as evo-devo studies of shell biomineralization, because it is relatively easily maintained in captivity. However, an absence of genomic resources for C. nemoralis has generally hindered detailed genetic and molecular investigations. We therefore generated ∼23× coverage long-read data for the ∼3.5 Gb genome, and produced a draft assembly composed of 28,537 contigs with the N50 length of 333 kb. Genome completeness, estimated by BUSCO using the metazoa dataset, was 91%. Repetitive regions cover over 77% of the genome. A total of 43,519 protein-coding genes were predicted in the assembled genome, and 97.3% of these were functionally annotated from either sequence homology or protein signature searches. This first assembled and annotated genome sequence for a helicoid snail, a large group that includes edible species, agricultural pests, and parasite hosts, will be a core resource for identifying the loci that determine the shell polymorphism, as well as in a wide range of analyses in evolutionary and developmental biology, and snail biology in general.


Sign in / Sign up

Export Citation Format

Share Document