gene finding Latest Research Papers

Abstract To address a need for improved tools for annotation and comparative genomics of bacteriophage genomes, we developed multiPhATE2. As an extension of multiPhATE, a functional annotation code released previously, multiPhATE2 performs gene finding using multiple algorithms, compares the results of the algorithms, performs functional annotation of coding sequences, and incorporates additional search algorithms and databases to extend the search space of the original code. MultiPhATE2 performs gene matching among sets of closely related bacteriophage genomes, and uses multiprocessing to speed computations. MultiPhATE2 can be re-started at multiple points within the workflow to allow the user to examine intermediate results and adjust the subsequent computations accordingly. In addition, multiPhATE2 accommodates custom gene calls and sequence databases, again adding flexibility. MultiPhATE2 was implemented in Python 3.7 and runs as a command-line code under Linux or MAC operating systems. Full documentation is provided as a README file and a Wiki website.

Download Full-text

Balrog: A universal protein model for prokaryotic gene prediction

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008727 ◽

2021 ◽

Vol 17 (2) ◽

pp. e1008727

Author(s):

Markus J. Sommer ◽

Steven L. Salzberg

Keyword(s):

High Throughput Sequencing ◽

Gene Prediction ◽

Low Cost ◽

Amino Acid Sequences ◽

Gene Finding ◽

Convolutional Network ◽

Microbial Genomes ◽

Protein Model ◽

Public Archives ◽

Universal Protein

Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.

Download Full-text

The power of next-generation sequencing and machine learning for causal gene finding and prediction of phenotypes.

10.1079/9781789249095.0041 ◽

2021 ◽

pp. 401-410

Author(s):

Anna S. Sowa ◽

Lisa Dussling ◽

Jörg Hagmann ◽

Sebastian J. Schultheiss

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Single Step ◽

Gene Finding ◽

Next Generation ◽

Causal Gene ◽

Efficient Manner ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Generation Sequencing

Abstract The wide application of next-generation sequencing (NGS) has facilitated and accelerated causal gene finding and breeding in the field of plant sciences. A wide variety of techniques and computational strategies is available that needs to be appropriately tailored to the species, genetic architecture of the trait of interest, breeding system and available resources. Utilizing these NGS methods, the typical computational steps of marker discovery, genetic mapping and identification of causal mutations can be achieved in a single step in a cost- and time-efficient manner. Rather than focusing on a few high-impact genetic variants that explain phenotypes, increased computational power allows modelling of phenotypes based on genome-wide molecular markers, known as genomic selection (GS). Solely based on this genotype information, modern GS approaches can accurately predict breeding values for a given trait (the average effects of alleles over all loci that are anticipated to be transferred from the parent to the progeny) based on a large training population of genotyped and phenotyped individuals (Crossa et al., 2017). Once trained, the model offers great reductions in breeding speed and costs. We advocate for improving conventional GS methods by applying advanced techniques based on machine learning (ML) and outline how this approach can also be used for causal gene finding. Subsequent to genetic causes of agronomically important traits, epigenetic mechanisms such as DNA methylation play a crucial role in shaping phenotypes and can become interesting targets in breeding pipelines. We highlight an ML approach shown to detect functional methylation changes sensitively from NGS data. We give an overview about commonly applied strategies and provide practical considerations in choosing and performing NGS-based gene finding and NGS-assisted breeding.

Download Full-text

StartLink+: Prediction of Gene Starts in Prokaryotic Genomes by an Algorithm Integrating Independent Sources of Evidence

10.1101/2020.10.25.352625 ◽

2020 ◽

Author(s):

Karl Gemayel ◽

Alexandre Lomsadze ◽

Mark Borodovsky

Keyword(s):

Ab Initio ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Set ◽

Gene Finding ◽

Multiple Sequence ◽

Gene Start ◽

Prokaryotic Genomes

AbstractAlgorithms of ab initio gene finding were shown to make sufficiently accurate predictions in prokaryotic genomes. Nonetheless, for up to 15-25% of genes per genome the gene start predictions might differ even when made by the supposedly most accurate tools. To address this discrepancy, we have introduced StartLink+, an approach combining ab initio and multiple sequence alignment based methods. StartLink+ makes predictions for a majority of genes per genome (73% on average); in tests on sets of genes with experimentally verified starts the StartLink+ accuracy was shown to be 98-99%. When StartLink+ predictions made for a large set of prokaryotic genomes were compared with the database annotations we observed that on average the gene start annotations deviated from the predictions for ~5% of genes in AT-rich genomes and for 10-15% of genes in GC-rich genomes.

Download Full-text

Balrog: A universal protein model for prokaryotic gene prediction

10.1101/2020.09.06.285304 ◽

2020 ◽

Author(s):

Markus J. Sommer ◽

Steven L. Salzberg

Keyword(s):

High Throughput Sequencing ◽

Prokaryotic Genome ◽

Amino Acid Sequences ◽

Specific Gene ◽

Universal Model ◽

Gene Finding ◽

Convolutional Network ◽

Protein Coding ◽

Microbial Genomes ◽

Protein Coding Genes

AbstractLow-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.Author summaryAnnotating the protein-coding genes in a newly sequenced prokaryotic genome is a critical part of describing their biological function. Relative to eukaryotic genomes, prokaryotic genomes are small and structurally simple, with 90% of their DNA typically devoted to protein-coding genes. Current computational gene finding tools are therefore able to achieve close to 99% sensitivity to known genes using species-specific gene models.Though highly sensitive at finding known genes, all current prokaryotic gene finders also predict large numbers of additional genes, which are labelled as “hypothetical protein” in GenBank and other annotation databases. Many hypothetical gene predictions likely represent true protein-coding sequence, but it is not known how many of them represent false positives. Additionally, all current gene finding tools must be trained specifically for each genome as a preliminary step in order to achieve high sensitivity. This requirement limits their ability to detect genes in fragmented sequences commonly seen in metagenomic samples.We took a data-driven approach to prokaryotic gene finding, relying on the large and diverse collection of already-sequenced genomes. By training a single, universal model of bacterial genes on protein sequences from many different species, we were able to match the sensitivity of current gene finders while reducing the overall number of gene predictions. Our model does not need to be refit on any new genome. Balrog (Bacterial Annotation by Learned Representation of Genes) represents a fundamentally different yet effective method for prokaryotic gene finding.

Download Full-text

AssessORF: combining evolutionary conservation and proteomics to assess prokaryotic gene predictions

Bioinformatics ◽

10.1093/bioinformatics/btz714 ◽

2019 ◽

Author(s):

Deepank R Korandla ◽

Jacob M Wozniak ◽

Anaamika Campeau ◽

David J Gonzalez ◽

Erik S Wright

Keyword(s):

R Package ◽

Evolutionary Conservation ◽

Supplementary Information ◽

Bioconductor Package ◽

Gene Finding ◽

Proteomics Data ◽

Protein Coding ◽

New Approach ◽

Protein Coding Genes ◽

Clear Winner

Abstract Motivation A core task of genomics is to identify the boundaries of protein coding genes, which may cover over 90% of a prokaryote's genome. Several programs are available for gene finding, yet it is currently unclear how well these programs perform and whether any offers superior accuracy. This is in part because there is no universal benchmark for gene finding and, therefore, most developers select their own benchmarking strategy. Results Here, we introduce AssessORF, a new approach for benchmarking prokaryotic gene predictions based on evidence from proteomics data and the evolutionary conservation of start and stop codons. We applied AssessORF to compare gene predictions offered by GenBank, GeneMarkS-2, Glimmer and Prodigal on genomes spanning the prokaryotic tree of life. Gene predictions were 88–95% in agreement with the available evidence, with Glimmer performing the worst but no clear winner. All programs were biased towards selecting start codons that were upstream of the actual start. Given these findings, there remains considerable room for improvement, especially in the detection of correct start sites. Availability and implementation AssessORF is available as an R package via the Bioconductor package repository. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text