CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

Database ◽

10.1093/database/baaa088 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Alejandro Rubio ◽

Pablo Mier ◽

Miguel A Andrade-Navarro ◽

Andrés Garzón ◽

Juan Jiménez ◽

...

Keyword(s):

Gene Prediction ◽

Biological Sequences ◽

Protein Database ◽

Computational Tools ◽

Protein Coding ◽

Homology Searching ◽

Secondary Analyses ◽

Source Of Error ◽

Gene Finder ◽

Erroneous Data

Abstract The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.

PHANOTATE: a novel approach to gene identification in phage genomes

Bioinformatics ◽

10.1093/bioinformatics/btz265 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4537-4542 ◽

Cited By ~ 24

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Elizabeth A Dinsdale ◽

Brian Souza ◽

Robert A Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Supplementary Information ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach

Abstract Motivation Currently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design, they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present PHANOTATE, a novel method for gene calling specifically designed for phage genomes. Although the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use dynamic programing to find the optimal path. Results We compare PHANOTATE to other gene callers by annotating a set of 2133 complete phage genomes from GenBank, using PHANOTATE and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with PHANOTATE predicting more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and all of the metagenomes in the sequence read archive, and found that they are present at levels that suggest that these are functional protein-coding genes. Availability and implementation https://github.com/deprekate/PHANOTATE Supplementary information Supplementary data are available at Bioinformatics online.

THEA: A novel approach to gene identification in phage genomes

10.1101/265983 ◽

2018 ◽

Cited By ~ 3

Author(s):

Katelyn McNair ◽

Carol Zhou ◽

Brian Souza ◽

Robert A. Edwards

Keyword(s):

Gene Prediction ◽

Optimal Path ◽

Genome Structure ◽

Weighted Graph ◽

Open Reading Frames ◽

Functional Protein ◽

Protein Database ◽

Protein Coding ◽

Novel Approach ◽

Novel Method

AbstractMotivationCurrently there are no tools specifically designed for annotating genes in phages. Several tools are available that have been adapted to run on phage genomes, but due to their underlying design they are unable to capture the full complexity of phage genomes. Phages have adapted their genomes to be extremely compact, having adjacent genes that overlap, and genes completely inside of other longer genes. This non-delineated genome structure makes it difficult for gene prediction using the currently available gene annotators. Here we present THEA (The Algorithm), a novel method for gene calling specifically designed for phage genomes. While the compact nature of genes in phages is a problem for current gene annotators, we exploit this property by treating a phage genome as a network of paths: where open reading frames are favorable, and overlaps and gaps are less favorable, but still possible. We represent this network of connections as a weighted graph, and use graph theory to find the optimal path.ResultsWe compare THEA to other gene callers by annotating a set of 2,133 complete phage genomes from GenBank, using THEA and the three most popular gene callers. We found that the four programs agree on 82% of the total predicted genes, with THEA predicting significantly more genes than the other three. We searched for these extra genes in both GenBank’s non-redundant protein database and sequence read archive, and found that they are present at levels that suggest that these are functional protein coding genes.Availability and ImplementationThe source code and all files can be found at: https://github.com/deprekate/THEAContactKatelyn McNair: [email protected]

BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database

10.1101/2020.08.10.245134 ◽

2020 ◽

Cited By ~ 5

Author(s):

Tomáš Brůna ◽

Katharina J. Hoff ◽

Alexandre Lomsadze ◽

Mario Stanke ◽

Mark Borodovsky

Keyword(s):

Genome Annotation ◽

Prediction Accuracy ◽

Gene Prediction ◽

Accurate Method ◽

Eukaryotic Genome ◽

Structural Annotation ◽

Protein Database ◽

Protein Coding ◽

Annotation Pipeline ◽

The One

AbstractFull automation of gene prediction has become an important bioinformatics task since the advent of next generation sequencing. The eukaryotic genome annotation pipeline BRAKER1 had combined self-training GeneMark-ET with AUGUSTUS to generate genes’ coordinates with support of transcriptomic data. Here, we introduce BRAKER2, a pipeline with GeneMark-EP+ and AUGUSTUS externally supported by cross-species protein sequences aligned to the genome. Among the challenges addressed in the development of the new pipeline was generation of reliable hints to the locations of protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. Under equal conditions, the gene prediction accuracy of BRAKER2 was shown to be higher than the one of MAKER2, yet another genome annotation pipeline. Also, in comparison with BRAKER1 supported by a large volume of transcript data, BRAKER2 could produce a better gene prediction accuracy if the evolutionary distances to the reference species in the protein database were rather small. All over, our tests demonstrated that fully automatic BRAKER2 is a fast and accurate method for structural annotation of novel eukaryotic genomes.

Gene prediction by multiple syntenic alignment

Journal of Integrative Bioinformatics ◽

10.1515/jib-2005-13 ◽

2005 ◽

Vol 2 (1) ◽

pp. 38-47

Author(s):

Said S. Adi ◽

Carlos E. Ferreira

Keyword(s):

Computer Program ◽

Genomic Dna ◽

Gene Prediction ◽

Genomic Sequences ◽

Prediction Problem ◽

Huge Amount ◽

Protein Coding ◽

Coding Regions ◽

Conserved Regions ◽

Similarity Information

Summary Given the increasing number of available genomic sequences, one now faces the task of identifying their functional parts, like the protein coding regions. The gene prediction problem can be addressed in several ways. One of the most promising methods makes use of similarity information between the genomic DNA and previously annotated sequences (proteins, cDNAs and ESTs). Recently, given the huge amount of newly sequenced genomes, new similarity-based methods are being successfully applied in the task of gene prediction. The so-called comparative-based methods lie in the similarities shared by regions of two evolutionary related genomic sequences. Despite the number of different gene prediction approaches in the literature, this problem remains challenging. In this paper we present a new comparative-based approach to the gene prediction problem. It is based on a syntenic alignment of three or more genomic sequences. With syntenic alignment we mean an alignment that is constructed taking into account the fact that the involved sequences include conserved regions intervened by unconserved ones. We have implemented the proposed algorithm in a computer program and confirm the validity of the approach on a benchmark including triples of human, mouse and rat genomic sequences.

GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa026 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 3

Author(s):

Tomáš Brůna ◽

Alexandre Lomsadze ◽

Mark Borodovsky

Keyword(s):

Prediction Accuracy ◽

Gene Prediction ◽

External Information ◽

Model Parameters ◽

Protein Database ◽

Sequencing Project ◽

Automated Method ◽

Eukaryotic Gene ◽

Estimation Of Model Parameters ◽

Eukaryotic Genomes

Abstract We have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient ab initio gene finding, GeneMark-ES, with parameters trained in iterative unsupervised mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads. Now we describe GeneMark-EP, a tool that utilizes another source of external information, a protein database, readily available prior to the start of a sequencing project. A new specialized pipeline, ProtHint, initiates massive protein mapping to genome and extracts hints to splice sites and translation start and stop sites of potential genes. GeneMark-EP uses the hints to improve estimation of model parameters as well as to adjust coordinates of predicted genes if they disagree with the most reliable hints (the -EP+ mode). Tests of GeneMark-EP and -EP+ demonstrated improvements in gene prediction accuracy in comparison with GeneMark-ES, while the GeneMark-EP+ showed higher accuracy than GeneMark-ET. We have observed that the most pronounced improvements in gene prediction accuracy happened in large eukaryotic genomes.

Identification of Proteins Associated with Murine Cytomegalovirus Virions

Journal of Virology ◽

10.1128/jvi.78.20.11187-11197.2004 ◽

2004 ◽

Vol 78 (20) ◽

pp. 11187-11197 ◽

Cited By ~ 105

Author(s):

Lisa M. Kattenhorn ◽

Ryan Mills ◽

Markus Wagner ◽

Alexandre Lomsadze ◽

Vsevolod Makeev ◽

...

Keyword(s):

Gene Prediction ◽

Polyacrylamide Gel Electrophoresis ◽

Sodium Dodecyl ◽

Open Reading Frames ◽

Murine Cytomegalovirus ◽

Prediction Algorithm ◽

Sequencing Analysis ◽

Protein Coding ◽

Coding Potential ◽

Reading Frames

ABSTRACT Proteins associated with the murine cytomegalovirus (MCMV) viral particle were identified by a combined approach of proteomic and genomic methods. Purified MCMV virions were dissociated by complete denaturation and subjected to either separation by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and in-gel digestion or treated directly by in-solution tryptic digestion. Peptides were separated by nanoflow liquid chromatography and analyzed by tandem mass spectrometry (LC-MS/MS). The MS/MS spectra obtained were searched against a database of MCMV open reading frames (ORFs) predicted to be protein coding by an MCMV-specific version of the gene prediction algorithm GeneMarkS. We identified 38 proteins from the capsid, tegument, glycoprotein, replication, and immunomodulatory protein families, as well as 20 genes of unknown function. Observed irregularities in coding potential suggested possible sequence errors in the 3′-proximal ends of m20 and M31. These errors were experimentally confirmed by sequencing analysis. The MS data further indicated the presence of peptides derived from the unannotated ORFs ORFc225441-226898 (m166.5) and ORF105932-106072. Immunoblot experiments confirmed expression of m166.5 during viral infection.

TSEBRA: transcript selector for BRAKER

BMC Bioinformatics ◽

10.1186/s12859-021-04482-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lars Gabriel ◽

Katharina J. Hoff ◽

Tomáš Brůna ◽

Mark Borodovsky ◽

Mario Stanke

Keyword(s):

Statistical Models ◽

Gene Prediction ◽

Software Tool ◽

Genome Project ◽

Rna Seq ◽

Protein Coding ◽

Homologous Protein ◽

Protein Coding Genes ◽

Overlapping Transcripts ◽

Eukaryotic Genomes

Abstract Background BRAKER is a suite of automatic pipelines, BRAKER1 and BRAKER2, for the accurate annotation of protein-coding genes in eukaryotic genomes. Each pipeline trains statistical models of protein-coding genes based on provided evidence and, then predicts protein-coding genes in genomic sequences using both the extrinsic evidence and statistical models. For training and prediction, BRAKER1 and BRAKER2 incorporate complementary extrinsic evidence: BRAKER1 uses only RNA-seq data while BRAKER2 uses only a database of cross-species proteins. The BRAKER suite has so far not been able to reliably exceed the accuracy of BRAKER1 and BRAKER2 when incorporating both types of evidence simultaneously. Currently, for a novel genome project where both RNA-seq and protein data are available, the best option is to run both pipelines independently, and to pick one, likely better output. Therefore, one or another type of the extrinsic evidence would remain unexploited. Results We present TSEBRA, a software that selects gene predictions (transcripts) from the sets generated by BRAKER1 and BRAKER2. TSEBRA uses a set of rules to compare scores of overlapping transcripts based on their support by RNA-seq and homologous protein evidence. We show in computational experiments on genomes of 11 species that TSEBRA achieves higher accuracy than either BRAKER1 or BRAKER2 running alone and that TSEBRA compares favorably with the combiner tool EVidenceModeler. Conclusion TSEBRA is an easy-to-use and fast software tool. It can be used in concert with the BRAKER pipeline to generate a gene prediction set supported by both RNA-seq and homologous protein evidence.

DARTS: An Algorithm for Domain-Associated Retrotransposon Search in Genome Assemblies

Genes ◽

10.3390/genes13010009 ◽

2021 ◽

Vol 13 (1) ◽

pp. 9

Author(s):

Mikhail Biryukov ◽

Kirill Ustyantsev

Keyword(s):

Phylogenetic Analyses ◽

Amino Acid Sequences ◽

Protein Domain ◽

Computational Tools ◽

Protein Coding ◽

Domain Composition ◽

Repeat Masking ◽

Genome Assemblies ◽

Eukaryotic Genomes ◽

Higher Sensitivity

Retrotransposons comprise a substantial fraction of eukaryotic genomes, reaching the highest proportions in plants. Therefore, identification and annotation of retrotransposons is an important task in studying the regulation and evolution of plant genomes. The majority of computational tools for mining transposable elements (TEs) are designed for subsequent genome repeat masking, often leaving aside the element lineage classification and its protein domain composition. Additionally, studies focused on the diversity and evolution of a particular group of retrotransposons often require substantial customization efforts from researchers to adapt existing software to their needs. Here, we developed a computational pipeline to mine sequences of protein-coding retrotransposons based on the sequences of their conserved protein domains—DARTS (Domain-Associated Retrotransposon Search). Using the most abundant group of TEs in plants—long terminal repeat (LTR) retrotransposons (LTR-RTs)—we show that DARTS has radically higher sensitivity for LTR-RT identification compared to the widely accepted tool LTRharvest. DARTS can be easily customized for specific user needs. As a result, DARTS returns a set of structurally annotated nucleotide and amino acid sequences which can be readily used in subsequent comparative and phylogenetic analyses. DARTS may facilitate researchers interested in the discovery and detailed analysis of the diversity and evolution of retrotransposons, LTR-RTs, and other protein-coding TEs.

Conserved novel ORFs in the mitochondrial genome of the ctenophore Beroe forskalii

PeerJ ◽

10.7717/peerj.8356 ◽

2020 ◽

Vol 8 ◽

pp. e8356

Author(s):

Darrin T. Schultz ◽

Jordan M. Eizenga ◽

Russell B. Corbett-Detig ◽

Warren R. Francis ◽

Lynne M. Christianson ◽

...

Keyword(s):

Mitochondrial Genome ◽

Hypothesis Test ◽

Open Reading Frames ◽

Mitochondrial Genomes ◽

Intergenic Sequence ◽

Computational Tools ◽

Protein Coding ◽

Bayesian Hypothesis Test ◽

And Function ◽

Reading Frames

To date, five ctenophore species’ mitochondrial genomes have been sequenced, and each contains open reading frames (ORFs) that if translated have no identifiable orthologs. ORFs with no identifiable orthologs are called unidentified reading frames (URFs). If truly protein-coding, ctenophore mitochondrial URFs represent a little understood path in early-diverging metazoan mitochondrial evolution and metabolism. We sequenced and annotated the mitochondrial genomes of three individuals of the beroid ctenophore Beroe forskalii and found that in addition to sharing the same canonical mitochondrial genes as other ctenophores, the B. forskalii mitochondrial genome contains two URFs. These URFs are conserved among the three individuals but not found in other sequenced species. We developed computational tools called pauvre and cuttlery to determine the likelihood that URFs are protein coding. There is evidence that the two URFs are under negative selection, and a novel Bayesian hypothesis test of trinucleotide frequency shows that the URFs are more similar to known coding genes than noncoding intergenic sequence. Protein structure and function prediction of all ctenophore URFs suggests that they all code for transmembrane transport proteins. These findings, along with the presence of URFs in other sequenced ctenophore mitochondrial genomes, suggest that ctenophores may have uncharacterized transmembrane proteins present in their mitochondria.

Draft Genome Assembly of the Sheep Scab Mite, Psoroptes ovis

Genome Announcements ◽

10.1128/genomea.00265-18 ◽

2018 ◽

Vol 6 (16) ◽

pp. e00265-18 ◽

Cited By ~ 8

Author(s):

Stewart T. G. Burgess ◽

Kathryn Bartley ◽

Edward J. Marr ◽

Harry W. Wright ◽

Robert J. Weaver ◽

...

Keyword(s):

Genome Assembly ◽

De Novo ◽

Gene Prediction ◽

Draft Genome ◽

Psoroptes Ovis ◽

Protein Coding ◽

Content Type ◽

Sheep Scab ◽

Draft Genome Assembly ◽

Intense Pruritus

ABSTRACT Sheep scab, caused by infestation with Psoroptes ovis, is highly contagious, results in intense pruritus, and represents a major welfare and economic concern. Here, we report the first draft genome assembly and gene prediction of P. ovis based on PacBio de novo sequencing. The ∼63.2-Mb genome encodes 12,041 protein-coding genes.