scholarly journals ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs

Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Tatiana Dvorkina ◽  
Anton Bankevich ◽  
Alexei Sorokin ◽  
Fan Yang ◽  
Boahemaa Adu-Oppong ◽  
...  

Abstract Background Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics. Methods Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. Results We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes. Conclusions We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes “hidden” in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes.

2019 ◽  
Vol 20 (S15) ◽  
Author(s):  
Prapaporn Techa-Angkoon ◽  
Kevin L. Childs ◽  
Yanni Sun

Abstract Background Gene is a key step in genome annotation. Ab initio gene prediction enables gene annotation of new genomes regardless of availability of homologous sequences. There exist a number of ab initio gene prediction tools and they have been widely used for gene annotation for various species. However, existing tools are not optimized for identifying genes with highly variable GC content. In addition, some genes in grass genomes exhibit a sharp 5 ′- 3′ decreasing GC content gradient, which is not carefully modeled by available gene prediction tools. Thus, there is still room to improve the sensitivity and accuracy for predicting genes with GC gradients. Results In this work, we designed and implemented a new hidden Markov model (HMM)-based ab initio gene prediction tool, which is optimized for finding genes with highly variable GC contents, such as the genes with negative GC gradients in grass genomes. We tested the tool on three datasets from Arabidopsis thaliana and Oryza sativa. The results showed that our tool can identify genes missed by existing tools due to the highly variable GC contents. Conclusions GPRED-GC can effectively predict genes with highly variable GC contents without manual intervention. It provides a useful complementary tool to existing ones such as Augustus for more sensitive gene discovery. The source code is freely available at https://sourceforge.net/projects/gpred-gc/.


2016 ◽  
Vol 80 (3) ◽  
pp. iii-iii ◽  
Author(s):  
Maissa Chakroun ◽  
Núria Banyuls ◽  
Yolanda Bel ◽  
Baltasar Escriche ◽  
Juan Ferré

2019 ◽  
Author(s):  
Patrick Sorn ◽  
Christoph Holtsträter ◽  
Martin Löwer ◽  
Ugur Sahin ◽  
David Weber

Abstract Motivation Gene fusions are an important class of transcriptional variants that can influence cancer development and can be predicted from RNA sequencing (RNA-seq) data by multiple existing tools. However, the real-world performance of these tools is unclear due to the lack of known positive and negative events, especially with regard to fusion genes in individual samples. Often simulated reads are used, but these cannot account for all technical biases in RNA-seq data generated from real samples. Results Here, we present ArtiFuse, a novel approach that simulates fusion genes by sequence modification to the genomic reference, and therefore, can be applied to any RNA-seq dataset without the need for any simulated reads. We demonstrate our approach on eight RNA-seq datasets for three fusion gene prediction tools: average recall values peak for all three tools between 0.4 and 0.56 for high-quality and high-coverage datasets. As ArtiFuse affords total control over involved genes and breakpoint position, we also assessed performance with regard to gene-related properties, showing a drop-in recall value for low-expressed genes in high-coverage samples and genes with co-expressed paralogues. Overall tool performance assessed from ArtiFusions is lower compared to previously reported estimates on simulated reads. Due to the use of real RNA-seq datasets, we believe that ArtiFuse provides a more realistic benchmark that can be used to develop more accurate fusion gene prediction tools for application in clinical settings. Availability and implementation ArtiFuse is implemented in Python. The source code and documentation are available at https://github.com/TRON-Bioinformatics/ArtiFusion. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Chander Jyoti ◽  
Sandeep Saini ◽  
Varinder Kumar ◽  
Kajal Abrol ◽  
Kanchan Pandey ◽  
...  

1995 ◽  
Vol 41 (9) ◽  
pp. 792-799 ◽  
Author(s):  
George W. Sundin ◽  
Dave E. Monks ◽  
Carol L. Bender

The distribution of the strA–strB streptomycin-resistance (Smr) genes associated with Tn5393 was examined in bacteria isolated from the phylloplane and soil of ornamental pear and tomato. Two ornamental pear nurseries received previous foliar applications of streptomycin, whereas the tomato fields had no prior exposure to streptomycin bactericides. Although the recovery of culturable Smrbacteria was generally higher from soil, the highest occurrence of Smrwas observed in phylloplane bacteria of an ornamental pear nursery that received 15 annual applications of streptomycin during the previous 2 years. Twenty-two and 12% of 143 Gram-negative phylloplane and 163 Gram-negative soil isolates, respectively, contained sequences that hybridized to probes specific for the strA–strB Smrgenes and for the transposase and resolvase genes of Tn5393. These sequences were located on large plasmids (>60 kb) in 74% of the isolates. The 77 SmrGram-positive bacteria isolated in the present study showed no homology to the Tn5393-derived probes. Although the repeated use of a single antibiotic in clinical situations is known to favor the development of strains with resistance to other antibiotics, we found no evidence that intensive streptomycin usage in agricultural habitats favors the development of resistance to tetracycline, an antibiotic also registered for disease control on plants. The detection of Tn5393 in bacteria with no prior exposure to streptomycin suggests that this transposon is indigenous to both phylloplane and soil microbial communities.Key words: streptomycin, tetracycline, antibiotic resistance, phylloplane, transposon.


2021 ◽  
Author(s):  
Nicholas J. Dimonaco ◽  
Wayne Aubrey ◽  
Kim Kenobi ◽  
Amanda Clare ◽  
Christopher J. Creevey

Motivation: The biases in Open Reading Frame (ORF) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any ORF prediction tool and allow them to choose the right tool for their analysis. Results: We present an evaluation framework ("ORForise") based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of ORF prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 it ab initio and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.


2020 ◽  
Author(s):  
Shan Sun ◽  
Roshonda B. Jones ◽  
Anthony A. Fodor

Abstract Background: Despite recent decreases in the cost of sequencing, shotgun metagenome sequencing remains more expensive compared with 16S rRNA amplicon sequencing. Methods have been developed to predict the functional profiles of microbial communities based on their taxonomic composition. In this study, we evaluated the performance of three commonly used metagenome prediction tools (PICRUSt, PICRUSt2 and Tax4Fun) by comparing the significance of the differential abundance of predicted functional gene profiles to those from shotgun metagenome sequencing across different environments. Results: We selected 7 datasets of human, non-human animal and environmental (soil) samples that have publicly available 16S rRNA and shotgun metagenome sequences. As we would expect based on previous literature, strong Spearman correlations were observed between predicted gene compositions and gene relative abundance measured with shotgun metagenome sequencing. However, these strong correlations were preserved even when the abundance of genes were permuted across samples. This suggests that simple correlation coefficient is a highly unreliable measure for the performance of metagenome prediction tools. As an alternative, we compared the performance of genes predicted with PICRUSt, PICRUSt2 and Tax4Fun to sequenced metagenome genes in inference models associated with metadata within each dataset. With this approach, we found reasonable performance for human datasets, with the metagenome prediction tools performing better for inference on genes related to “house-keeping” functions. However, their performance degraded sharply outside of human datasets when used for inference. Conclusion: We conclude that the utility of PICRUSt, PICRUSt2 and Tax4Fun for inference with the default database is likely limited outside of human samples and that development of tools for gene prediction specific to different non-human and environmental samples is warranted.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e9762
Author(s):  
Andres Benavides ◽  
Friman Sanchez ◽  
Juan F. Alzate ◽  
Felipe Cabarcas

Background A prime objective in metagenomics is to classify DNA sequence fragments into taxonomic units. It usually requires several stages: read’s quality control, de novo assembly, contig annotation, gene prediction, etc. These stages need very efficient programs because of the number of reads from the projects. Furthermore, the complexity of metagenomes requires efficient and automatic tools that orchestrate the different stages. Method DATMA is a pipeline for fast metagenomic analysis that orchestrates the following: sequencing quality control, 16S rRNA-identification, reads binning, de novo assembly and evaluation, gene prediction, and taxonomic annotation. Its distributed computing model can use multiple computing resources to reduce the analysis time. Results We used a controlled experiment to show DATMA functionality. Two pre-annotated metagenomes to compare its accuracy and speed against other metagenomic frameworks. Then, with DATMA we recovered a draft genome of a novel Anaerolineaceae from a biosolid metagenome. Conclusions DATMA is a bioinformatics tool that automatically analyzes complex metagenomes. It is faster than similar tools and, in some cases, it can extract genomes that the other tools do not. DATMA is freely available at https://github.com/andvides/DATMA.


1986 ◽  
Vol 51 (1) ◽  
pp. 44-51 ◽  
Author(s):  
David J. Hardman ◽  
Peter C. Gowland ◽  
J. Howard Slater

Sign in / Sign up

Export Citation Format

Share Document