PHANOTATE: a novel approach to gene identification in phage genomes

Katelyn McNair; Carol Zhou; Elizabeth A Dinsdale; Brian Souza; Robert A Edwards

doi:10.1093/bioinformatics/btz265

PHANOTATE: a novel approach to gene identification in phage genomes

Bioinformatics ◽

10.1093/bioinformatics/btz265 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4537-4542 ◽

Cited By ~ 24

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Elizabeth A Dinsdale ◽

Brian Souza ◽

Robert A Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Supplementary Information ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

THEA: A novel approach to gene identification in phage genomes

10.1101/265983 ◽

2018 ◽

Cited By ~ 3

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Brian Souza ◽

Robert A. Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach ◽

Novel Method

AbstractMotivationCurrently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap, and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present THEA (The Algorithm), a novel method for gene calling specifically designed for phage genomes. While the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use graph theory to find the optimal path.ResultsWe compare THEA to other gene callers by annotating a set of 2,133 complete phage genomes from GenBank, using THEA and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with THEA predicting significantly more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and sequence read archive, and found that they are present at levels that suggest that these are functional protein coding genes.Availability and ImplementationThe source code and all files can be found at: https://github.com/deprekate/THEAContactKatelyn McNair: [email protected]

Download Full-text

Identification of Proteins Associated with Murine Cytomegalovirus Virions

Journal of Virology ◽

10.1128/jvi.78.20.11187-11197.2004 ◽

2004 ◽

Vol 78 (20) ◽

pp. 11187-11197 ◽

Cited By ~ 105

Author(s):

Lisa M. Kattenhorn ◽

Ryan Mills ◽

Markus Wagner ◽

Alexandre Lomsadze ◽

Vsevolod Makeev ◽

...

Keyword(s):

Gene Prediction ◽

Polyacrylamide Gel Electrophoresis ◽

Sodium Dodecyl ◽

Open Reading Frames ◽

Murine Cytomegalovirus ◽

Prediction Algorithm ◽

Sequencing Analysis ◽

Protein Coding ◽

Coding Potential ◽

Reading Frames

ABSTRACT Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3′-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORFc225441-226898 (m166.5) and ORF105932-106072. Immunoblot experiments confirmed expression of m166.5 during viral infection.

Download Full-text

ArtiFuse—computational validation of fusion gene detection tools without relying on simulated reads

Bioinformatics ◽

10.1093/bioinformatics/btz613 ◽

2019 ◽

Author(s):

Patrick Sorn ◽

Christoph Holtsträter ◽

Martin Löwer ◽

Ugur Sahin ◽

David Weber

Keyword(s):

Fusion Gene ◽

Gene Prediction ◽

Supplementary Information ◽

Fusion Genes ◽

Rna Seq ◽

High Coverage ◽

Prediction Tools ◽

Novel Approach ◽

Tool Performance ◽

Transcriptional Variants

Abstract Motivation Gene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples. Results Here, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset without the need for any simulated reads. We demonstrate our approach on eight RNA-seq datasets for three fusion gene prediction tools: average recall values peak for all three tools between 0.4 and 0.56 for high-quality and high-coverage datasets. As ArtiFuse affords total control over involved genes and breakpoint position, we also assessed performance with regard to gene-related properties, showing a drop-in recall value for low-expressed genes in high-coverage samples and genes with co-expressed paralogues. Overall tool performance assessed from ArtiFusions is lower compared to previously reported estimates on simulated reads. Due to the use of real RNA-seq datasets, we believe that ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings. Availability and implementation ArtiFuse is implemented in Python. The source code and documentation are available at https://github.com/TRON-Bioinformatics/ArtiFusion. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Utilizing Amino Acid Composition and Entropy of Potential Open Reading Frames to Identify Protein-Coding Genes

Microorganisms ◽

10.3390/microorganisms9010129 ◽

2021 ◽

Vol 9 (1) ◽

pp. 129

Author(s):

Katelyn McNair ◽

Carol L. Ecale Zhou ◽

Brian Souza ◽

Stephanie Malfatti ◽

Robert A. Edwards

Keyword(s):

Amino Acid ◽

Gene Prediction ◽

Training Model ◽

Entropy Density ◽

Open Reading Frames ◽

Initial Training ◽

Training Set ◽

Protein Coding ◽

Protein Coding Genes ◽

Reading Frames

One of the main steps in gene-finding in prokaryotes is determining which open reading frames encode for a protein, and which occur by chance alone. There are many different methods to differentiate the two; the most prevalent approach is using shared homology with a database of known genes. This method presents many pitfalls, most notably the catch that you only find genes that you have seen before. The four most popular prokaryotic gene-prediction programs (GeneMark, Glimmer, Prodigal, Phanotate) all use a protein-coding training model to predict protein-coding genes, with the latter three allowing for the training model to be created ab initio from the input genome. Different methods are available for creating the training model, and to increase the accuracy of such tools, we present here GOODORFS, a method for identifying protein-coding genes within a set of all possible open reading frames (ORFS). Our workflow begins with taking the amino acid frequencies of each ORF, calculating an entropy density profile (EDP), using KMeans to cluster the EDPs, and then selecting the cluster with the lowest variation as the coding ORFs. To test the efficacy of our method, we ran GOODORFS on 14,179 annotated phage genomes, and compared our results to the initial training-set creation step of four other similar methods (Glimmer, MED2, PHANOTATE, Prodigal). We found that GOODORFS was the most accurate (0.94) and had the best F1-score (0.85), while Glimmer had the highest precision (0.92) and PHANOTATE had the highest recall (0.96).

Download Full-text

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

Database ◽

10.1093/database/baaa088 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Alejandro Rubio ◽

Pablo Mier ◽

Miguel A Andrade-Navarro ◽

Andrés Garzón ◽

Juan Jiménez ◽

...

Keyword(s):

Gene Prediction ◽

Biological Sequences ◽

Protein Database ◽

Computational Tools ◽

Protein Coding ◽

Homology Searching ◽

Secondary Analyses ◽

Source Of Error ◽

Gene Finder ◽

Erroneous Data

Abstract The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.

Download Full-text

Annotating high-impact 5’untranslated region variants with the UTRannotator

10.1101/2020.06.03.132266 ◽

2020 ◽

Cited By ~ 1

Author(s):

Xiaolei Zhang ◽

Matthew Wakeling ◽

James Ware ◽

Nicola Whiffin

Keyword(s):

Open Reading Frames ◽

Supplementary Information ◽

Untranslated Regions ◽

Protein Coding ◽

Pathogenic Variants ◽

Uncertain Significance ◽

Upstream Open Reading Frames ◽

The Impact ◽

Reading Frames

AbstractSummaryCurrent tools to annotate the predicted effect of genetic variants are heavily biased towards protein-coding sequence. Variants outside of these regions may have a large impact on protein expression and/or structure and can lead to disease, but this effect can be challenging to predict. Consequently, these variants are poorly annotated using standard tools. We have developed a plugin to the Ensembl Variant Effect Predictor, the UTRannotator, that annotates variants in 5’untranslated regions (5’UTR) that create or disrupt upstream open reading frames (uORFs). We investigate the utility of this tool using the ClinVar database, providing an annotation for 30.8% of all 5’UTR (likely) pathogenic variants, and highlighting 31 variants of uncertain significance as candidates for further follow-up. We will continue to update the UTR annotator as we gain new knowledge on the impact of variants in UTRs.Availability and implementationUTRannotator is freely available on Github: https://github.com/ImperialCardioGenetics/UTRannotatorSupplementary informationSupplementary data are available at bioRxiv.

Download Full-text

Using AnABlast for intergenic sORF prediction in the Caenorhabditis elegans genome

Bioinformatics ◽

10.1093/bioinformatics/btaa608 ◽

2020 ◽

Vol 36 (19) ◽

pp. 4827-4832

Author(s):

C S Casimiro-Soriguer ◽

M M Rigual ◽

A M Brokate-Llanos ◽

M J Muñoz ◽

A Garzón ◽

...

Keyword(s):

Caenorhabditis Elegans ◽

Genome Sequence ◽

Dna Sequences ◽

Sequence Similarity ◽

Open Reading Frames ◽

Supplementary Information ◽

Protein Coding ◽

Reading Frame ◽

A Genome ◽

Caenorhabditis Elegans Genome

Abstract Motivation Short bioactive peptides encoded by small open reading frames (sORFs) play important roles in eukaryotes. Bioinformatics prediction of ORFs is an early step in a genome sequence analysis, but sORFs encoding short peptides, often using non-AUG initiation codons, are not easily discriminated from false ORFs occurring by chance. Results AnABlast is a computational tool designed to highlight putative protein-coding regions in genomic DNA sequences. This protein-coding finder is independent of ORF length and reading frame shifts, thus making of AnABlast a potentially useful tool to predict sORFs. Using this algorithm, here, we report the identification of 82 putative new intergenic sORFs in the Caenorhabditis elegans genome. Sequence similarity, motif presence, expression data and RNA interference experiments support that the underlined sORFs likely encode functional peptides, encouraging the use of AnABlast as a new approach for the accurate prediction of intergenic sORFs in annotated eukaryotic genomes. Availability and implementation AnABlast is freely available at http://www.bioinfocabd.upo.es/ab/. The C.elegans genome browser with AnABlast results, annotated genes and all data used in this study is available at http://www.bioinfocabd.upo.es/celegans. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Annotating high-impact 5′untranslated region variants with the UTRannotator

Bioinformatics ◽

10.1093/bioinformatics/btaa783 ◽

2020 ◽

Author(s):

Xiaolei Zhang ◽

Matthew Wakeling ◽

James Ware ◽

Nicola Whiffin

Keyword(s):

Open Reading Frames ◽

Supplementary Information ◽

Untranslated Regions ◽

Protein Coding ◽

Pathogenic Variants ◽

Uncertain Significance ◽

Upstream Open Reading Frames ◽

The Impact ◽

Reading Frames

Abstract Summary Current tools to annotate the predicted effect of genetic variants are heavily biased towards protein-coding sequence. Variants outside of these regions may have a large impact on protein expression and/or structure and can lead to disease, but this effect can be challenging to predict. Consequently, these variants are poorly annotated using standard tools. We have developed a plugin to the Ensembl Variant Effect Predictor, the UTRannotator, that annotates variants in 5′untranslated regions (5′UTR) that create or disrupt upstream open reading frames. We investigate the utility of this tool using the ClinVar database, providing an annotation for 31.9% of all 5′UTR (likely) pathogenic variants, and highlighting 31 variants of uncertain significance as candidates for further follow-up. We will continue to update the UTRannotator as we gain new knowledge on the impact of variants in UTRs. Availability and implementation UTRannotator is freely available on Github: https://github.com/ImperialCardioGenetics/UTRannotator. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An integrative strategy to identify the entire protein coding potential of prokaryotic genomes by proteogenomics

10.1101/153213 ◽

2017 ◽

Cited By ~ 1

Author(s):

Ulrich Omasits ◽

Adithi R. Varadarajan ◽

Michael Schmid ◽

Sandra Goetze ◽

Damianos Melidis ◽

...

Keyword(s):

Gene Prediction ◽

Bartonella Henselae ◽

Prokaryotic Genome ◽

Gc Content ◽

Laboratory Strain ◽

Open Reading Frames ◽

General Applicability ◽

Protein Coding ◽

Prokaryotic Genomes ◽

Coding Potential

AbstractAccurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations.Our strategy towards accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources,ab initiogene prediction algorithms andin silicoORFs in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensiveBartonella henselaeproteomics dataset against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and variants identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin, and release iPtgxDBs forB. henselae,Bradyrhozibium diazoefficiensandEscherichia colias well as the software to generate such proteogenomics search databases for any prokaryote.

Download Full-text

MiTPeptideDB: a proteogenomic resource for the discovery of novel peptides

Bioinformatics ◽

10.1093/bioinformatics/btz530 ◽

2019 ◽

Vol 36 (1) ◽

pp. 205-211 ◽

Cited By ~ 1

Author(s):

Elizabeth Guruceaga ◽

Alba Garin-Muga ◽

Victor Segura

Keyword(s):

Open Reading Frames ◽

Supplementary Information ◽

Rna Seq ◽

Protein Coding ◽

Clinical Biomarkers ◽

Computational Performance ◽

Novel Transcripts ◽

Peptide Detectability ◽

Small Open Reading Frames

Abstract Motivation The principal lines of research in MS/MS based Proteomics have been directed toward the molecular characterization of the proteins including their biological functions and their implications in human diseases. Recent advances in this field have also allowed the first attempts to apply these techniques to the clinical practice. Nowadays, the main progress in Computational Proteomics is based on the integration of genomic, transcriptomic and proteomic experimental data, what is known as Proteogenomics. This methodology is being especially useful for the discovery of new clinical biomarkers, small open reading frames and microproteins, although their validation is still challenging. Results We detected novel peptides following a proteogenomic workflow based on the MiTranscriptome human assembly and shotgun experiments. The annotation approach generated three custom databases with the corresponding peptides of known and novel transcripts of both protein coding genes and non-coding genes. In addition, we used a peptide detectability filter to improve the computational performance of the proteomic searches, the statistical analysis and the robustness of the results. These innovative additional filters are specially relevant when noisy next generation sequencing experiments are used to generate the databases. This resource, MiTPeptideDB, was validated using 43 cell lines for which RNA-Seq experiments and shotgun experiments were available. Availability and implementation MiTPeptideDB is available at http://bit.ly/MiTPeptideDB. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text