scholarly journals Balrog: A universal protein model for prokaryotic gene prediction

2021 ◽  
Vol 17 (2) ◽  
pp. e1008727
Author(s):  
Markus J. Sommer ◽  
Steven L. Salzberg

Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.

2020 ◽  
Author(s):  
Markus J. Sommer ◽  
Steven L. Salzberg

AbstractLow-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.Author summaryAnnotating the protein-coding genes in a newly sequenced prokaryotic genome is a critical part of describing their biological function. Relative to eukaryotic genomes, prokaryotic genomes are small and structurally simple, with 90% of their DNA typically devoted to protein-coding genes. Current computational gene finding tools are therefore able to achieve close to 99% sensitivity to known genes using species-specific gene models.Though highly sensitive at finding known genes, all current prokaryotic gene finders also predict large numbers of additional genes, which are labelled as “hypothetical protein” in GenBank and other annotation databases. Many hypothetical gene predictions likely represent true protein-coding sequence, but it is not known how many of them represent false positives. Additionally, all current gene finding tools must be trained specifically for each genome as a preliminary step in order to achieve high sensitivity. This requirement limits their ability to detect genes in fragmented sequences commonly seen in metagenomic samples.We took a data-driven approach to prokaryotic gene finding, relying on the large and diverse collection of already-sequenced genomes. By training a single, universal model of bacterial genes on protein sequences from many different species, we were able to match the sensitivity of current gene finders while reducing the overall number of gene predictions. Our model does not need to be refit on any new genome. Balrog (Bacterial Annotation by Learned Representation of Genes) represents a fundamentally different yet effective method for prokaryotic gene finding.


Author(s):  
Romesh Kumar Salgotra ◽  
Rafiq Ahmad Bhat ◽  
Deyue Yu ◽  
Javaid Akhter Bhat

Abstract: Over the past two decades, the advances in the next generation sequencing (NGS) platforms have led to the identification of numerous genes/QTLs at high-resolution for their potential use in crop improvement. The genomic resources generated through these high-throughput sequencing techniques have been efficiently used in screening of particular gene of interest particularly for numerous types of plant stresses and quality traits. Subsequently, the identified-markers linked to a particular trait have been used in marker-assisted backcross breeding (MABB) activities. Besides, these markers are also being used to catalogue the food crops for detection of adulteration to improve the quality of food. With the advancement of technologies, the genomic resources are originating with new markers; however, to use these markers efficiently in crop breeding, high-throughput techniques (HTT) such as multiplex PCR and capillary electrophoresis (CE) can be exploited. Robustness, ease of operation, good reproducibility and low cost are the main advantages of multiplex PCR and CE. The CE is capable of separating and characterizing proteins with simplicity, speed and small sample requirements. Keeping in view the availability of vast data generated through NGS techniques and development of numerous markers, there is a need to use these resources efficiently in crop improvement programmes. In summary, this review describes the use of molecular markers in the screening of resistance genes in breeding programmes and detection of adulterations in food crops using high-throughput techniques.


2021 ◽  
Author(s):  
Jiaqi Li ◽  
Lei Wei ◽  
Xianglin Zhang ◽  
Wei Zhang ◽  
Haochen Wang ◽  
...  

ABSTRACTDetecting cancer signals in cell-free DNA (cfDNA) high-throughput sequencing data is emerging as a novel non-invasive cancer detection method. Due to the high cost of sequencing, it is crucial to make robust and precise prediction with low-depth cfDNA sequencing data. Here we propose a novel approach named DISMIR, which can provide ultrasensitive and robust cancer detection by integrating DNA sequence and methylation information in plasma cfDNA whole genome bisulfite sequencing (WGBS) data. DISMIR introduces a new feature termed as “switching region” to define cancer-specific differentially methylated regions, which can enrich the cancer-related signal at read-resolution. DISMIR applies a deep learning model to predict the source of every single read based on its DNA sequence and methylation state, and then predicts the risk that the plasma donor is suffering from cancer. DISMIR exhibited high accuracy and robustness on hepatocellular carcinoma detection by plasma cfDNA WGBS data even at ultra-low sequencing depths. Analysis showed that DISMIR tends to be insensitive to alterations of single CpG sites’ methylation states, which suggests DISMIR could resist to technical noise of WGBS. All these results showed DISMIR with the potential to be a precise and robust method for low-cost early cancer detection.


2020 ◽  
Author(s):  
Jacob Bien ◽  
Xiaohan Yan ◽  
Léo Simpson ◽  
Christian L. Müller

AbstractModern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven, parameter-free, and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling making user-defined aggregation obsolete while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human-gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbial ecologists gain insights into the structure and functioning of the underlying ecosystem of interest.


2021 ◽  
Vol 18 (1) ◽  
Author(s):  
Jian Zeng ◽  
Yan Wang ◽  
Ju Zhang ◽  
Shixing Yang ◽  
Wen Zhang

AbstractMembers of the family Inoviridae (inoviruses) are characterized by their unique filamentous morphology and infection cycle. The viral genome of inovirus is able to integrate into the host genome and continuously releases virions without lysing the host, establishing chronic infection. A large number of inoviruses have been obtained from microbial genomes and metagenomes recently, but putative novel inoviruses remaining to be identified. Here, using viral metagenomics, we identified four novel inoviruses from cloacal swab samples of wild and breeding birds. The circular genome of those four inoviruses are 6732 to 7709 nt in length with 51.4% to 56.5% GC content and encodes 9 to 13 open reading frames, respectively. The zonula occludens toxin gene implicated in the virulence of pathogenic host bacteria were identified in all four inoviruses and shared the highest amino acid sequences identity (< 37.3%) to other reference strains belonging to different genera of the family Inoviridae and among themselves. Phylogenetic analysis indicated that all the four inoviruses were genetically far away from other strains belonging to the family Inoviridae and formed an independent clade. According to the genetic distance-based criteria, all the four inoviruses identified in the present study respectively belong to four novel putative genera in the family Inoviridae.


2018 ◽  
Vol 64 (10) ◽  
pp. 761-773 ◽  
Author(s):  
Joost T.P. Verhoeven ◽  
Marta Canuti ◽  
Hannah J. Munro ◽  
Suzanne C. Dufour ◽  
Andrew S. Lang

High-throughput sequencing (HTS) technologies are becoming increasingly important within microbiology research, but aspects of library preparation, such as high cost per sample or strict input requirements, make HTS difficult to implement in some niche applications and for research groups on a budget. To answer these necessities, we developed ViDiT, a customizable, PCR-based, extremely low-cost (less than US$5 per sample), and versatile library preparation method, and CACTUS, an analysis pipeline designed to rely on cloud computing power to generate high-quality data from ViDiT-based experiments without the need of expensive servers. We demonstrate here the versatility and utility of these methods within three fields of microbiology: virus discovery, amplicon-based viral genome sequencing, and microbiome profiling. ViDiT–CACTUS allowed the identification of viral fragments from 25 different viral families from 36 oropharyngeal–cloacal swabs collected from wild birds, the sequencing of three almost complete genomes of avian influenza A viruses (>90% coverage), and the characterization and functional profiling of the complete microbial diversity (bacteria, archaea, viruses) within a deep-sea carnivorous sponge. ViDiT–CACTUS demonstrated its validity in a wide range of microbiology applications, and its simplicity and modularity make it easily implementable in any molecular biology laboratory, towards various research goals.


2019 ◽  
Vol 35 (17) ◽  
pp. 2932-2940 ◽  
Author(s):  
Subrata Saha ◽  
Jethro Johnson ◽  
Soumitra Pal ◽  
George M Weinstock ◽  
Sanguthevar Rajasekaran

Abstract Motivation Metagenomics is the study of genetic materials directly sampled from natural habitats. It has the potential to reveal previously hidden diversity of microscopic life largely due to the existence of highly parallel and low-cost next-generation sequencing technology. Conventional approaches align metagenomic reads onto known reference genomes to identify microbes in the sample. Since such a collection of reference genomes is very large, the approach often needs high-end computing machines with large memory which is not often available to researchers. Alternative approaches follow an alignment-free methodology where the presence of a microbe is predicted using the information about the unique k-mers present in the microbial genomes. However, such approaches suffer from high false positives due to trading off the value of k with the computational resources. In this article, we propose a highly efficient metagenomic sequence classification (MSC) algorithm that is a hybrid of both approaches. Instead of aligning reads to the full genomes, MSC aligns reads onto a set of carefully chosen, shorter and highly discriminating model sequences built from the unique k-mers of each of the reference sequences. Results Microbiome researchers are generally interested in two objectives of a taxonomic classifier: (i) to detect prevalence, i.e. the taxa present in a sample, and (ii) to estimate their relative abundances. MSC is primarily designed to detect prevalence and experimental results show that MSC is indeed a more effective and efficient algorithm compared to the other state-of-the-art algorithms in terms of accuracy, memory and runtime. Moreover, MSC outputs an approximate estimate of the abundances. Availability and implementation The implementations are freely available for non-commercial purposes. They can be downloaded from https://drive.google.com/open?id=1XirkAamkQ3ltWvI1W1igYQFusp9DHtVl.


2019 ◽  
Vol 75 (2) ◽  
pp. 296-299 ◽  
Author(s):  
Mark van der Linden ◽  
Rafael Mamede ◽  
Natascha Levina ◽  
Peter Helwig ◽  
Pedro Vila-Cerqueira ◽  
...  

Abstract Objectives Streptococcus agalactiae [group B streptococci (GBS)] have been considered uniformly susceptible to penicillin. However, increasing reports from Asia and North America are documenting penicillin-non-susceptible GBS (PRGBS) with mutations in pbp genes. Here we report, to the best of our knowledge, the first two PRGBS isolates recovered in Europe (AC-13238-1 and AC-13238-2), isolated from the same patient. Methods Two different colony morphologies of GBS were noted from a surgical abscess drainage sample. Both were serotyped and antimicrobial susceptibility testing was performed by different methodologies. High-throughput sequencing was done to compare the isolates at the genomic level, to identify their capsular type and ST, to evaluate mutations in the pbp genes and to compare the isolates with the genomes of other PRGBS isolates sharing the same serotype and ST. Results Isolates AC-13238-1 and AC-13238-2 presented MICs above the EUCAST and CLSI breakpoints for penicillin susceptibility. Both shared the capsular type Ia operon and ST23. Genomic analysis uncovered differences between the two isolates in seven genes, including altered pbp genes. Deduced amino acid sequences revealed critical substitutions in PBP2X in both isolates. Comparison with serotype Ia clonal complex 23 PRGBS from the USA reinforced the similarity between AC-13238-1 and AC-13238-2, and their divergence from the US strains. Conclusions Our results support the in-host evolution of β-lactam-resistant GBS, with two PRGBS variants being isolated from one patient.


Planta Medica ◽  
2019 ◽  
Vol 85 (14/15) ◽  
pp. 1168-1176
Author(s):  
Yingfang Wang ◽  
Mengyuan Peng ◽  
Yanlin Chen ◽  
Wenjuan Wang ◽  
Zhihua He ◽  
...  

Abstract Panax ginseng has been widely and effectively used as medicine for thousands of years. However, only limited studies have been conducted to date on ginseng miRNAs. In the present study, we collected 3 ginseng samples from the Changbai Mountain in China. Small RNA libraries were constructed and sequenced on the Illumina HiSeq platform. Sequencing analyses identified 3798 miRNAs, including 298 known miRNAs and 3500 potentially novel miRNAs. The miR166, miR159, and miR396 families were among the most highly expressed miRNAs in all libraries. The results of miRNA expression analyses were validated by qRT-PCR. Target gene prediction through computational and pathway annotation analyses revealed that the primary pathways were related to plant development, including metabolic processes and single-organism processes. It has been reported that plant miRNAs might be one of the hidden bioactive ingredients in medicinal plants. Based on the combined use of RNAhybrid, Miranda, and TargetScan software, a total of 50,992 potential human genes were predicted as the putative targets of 2868 miRNAs. Interestingly, the enriched KEGG pathways were associated with some human diseases, especially cancer, immune system diseases, and neurological disorders, and this could support the clinical use of ginseng. However, the human targets of ginseng miRNAs should be confirmed by further experimental validation. Our results provided valuable insight into ginseng miRNAs and the putative roles of these miRNAs.


2020 ◽  
Vol 6 (17) ◽  
pp. eaay9093 ◽  
Author(s):  
Hidetaka Tanno ◽  
Jonathan R. McDaniel ◽  
Christopher A. Stevens ◽  
William N. Voss ◽  
Jie Li ◽  
...  

Natively paired sequencing (NPS) of B cell receptors [variable heavy (VH) and light (VL)] and T cell receptors (TCRb and TCRa) is essential for the understanding of adaptive immunity in health and disease. Despite many recent technical advances, determining the VH:VL or TCRb:a repertoire with high accuracy and throughput remains challenging. We discovered that the recently engineered xenopolymerase, RTX, is exceptionally resistant to cell lysate inhibition in single-cell emulsion droplets. We capitalized on the characteristics of this enzyme to develop a simple, rapid, and inexpensive in-droplet overlap extension reverse transcription polymerase chain reaction method for NPS not requiring microfluidics or other specialized equipment. Using this technique, we obtained high yields (5000 to >20,000 per sample) of paired VH:VL or TCRb:a clonotypes at low cost. As a demonstration, we performed NPS on peripheral blood plasmablasts and T follicular helper cells following seasonal influenza vaccination and discovered high-affinity influenza-specific antibodies and TCRb:a.


Sign in / Sign up

Export Citation Format

Share Document