scholarly journals Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure

2021 ◽  
Author(s):  
Lotte J U Pronk ◽  
Marnix H Medema

Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic. However, because of marked differences in gene structure, prokaryotic gene prediction tools fail to accurately predict eukaryotic genes. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in gene structure. We first developed a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated accuracy of 97%, this classifier with principled features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By re-training our classifier with Tiara predictions as additional feature, weaknesses of both types of classifiers are compensated; the result is an enhanced classifier that outperforms all individual classifiers, with an F1-score of 1.00 on precision, recall and accuracy for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endosphere microbial community, we show how using Whokaryote to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Our enhanced classifier, which we call ′Whokaryote′, is wrapped in an easily installable package and is freely available from https://git.wageningenur.nl/lotte.pronk/whokaryote.

2020 ◽  
Vol 117 (24) ◽  
pp. 13800-13809 ◽  
Author(s):  
Hans-Wilhelm Nützmann ◽  
Daniel Doerr ◽  
América Ramírez-Colmenero ◽  
Jesús Emiliano Sotelo-Fonseca ◽  
Eva Wegel ◽  
...  

While colocalization within a bacterial operon enables coexpression of the constituent genes, the mechanistic logic of clustering of nonhomologous monocistronic genes in eukaryotes is not immediately obvious. Biosynthetic gene clusters that encode pathways for specialized metabolites are an exception to the classical eukaryote rule of random gene location and provide paradigmatic exemplars with which to understand eukaryotic cluster dynamics and regulation. Here, using 3C, Hi-C, and Capture Hi-C (CHi-C) organ-specific chromosome conformation capture techniques along with high-resolution microscopy, we investigate how chromosome topology relates to transcriptional activity of clustered biosynthetic pathway genes inArabidopsis thaliana. Our analyses reveal that biosynthetic gene clusters are embedded in local hot spots of 3D contacts that segregate cluster regions from the surrounding chromosome environment. The spatial conformation of these cluster-associated domains differs between transcriptionally active and silenced clusters. We further show that silenced clusters associate with heterochromatic chromosomal domains toward the periphery of the nucleus, while transcriptionally active clusters relocate away from the nuclear periphery. Examination of chromosome structure at unrelated clusters in maize, rice, and tomato indicates that integration of clustered pathway genes into distinct topological domains is a common feature in plant genomes. Our results shed light on the potential mechanisms that constrain coexpression within clusters of nonhomologous eukaryotic genes and suggest that gene clustering in the one-dimensional chromosome is accompanied by compartmentalization of the 3D chromosome.


2019 ◽  
Author(s):  
Nicolas Scalzitti ◽  
Anne Jeannin-Girardon ◽  
Pierre Collet ◽  
Olivier Poch ◽  
Julie Dawn Thompson

Abstract Background: The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations. Results: We describe the construction of a new benchmark, called G3PO (benchmark for Gene and Protein Prediction PrOgrams), designed to represent many of the typical challenges faced by current genome annotation projects. The benchmark is based on a carefully validated and curated set of real eukaryotic genes from 147 phylogenetically disperse organisms, and a number of test sets are defined to evaluate the effects of different features, including genome sequence quality, gene structure complexity, protein length, etc. We used the benchmark to perform an independent comparative analysis of the most widely used ab initio gene prediction programs and identified the main strengths and weaknesses of the programs. More importantly, we highlight a number of features that could be exploited in order to improve the accuracy of current prediction tools. Conclusions: The experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. We believe that the baseline results associated with the complex gene test sets in G3PO provide useful guidelines for future studies.


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Lei Liu ◽  
Yulin Wang ◽  
Yu Yang ◽  
Depeng Wang ◽  
Suk Hang Cheng ◽  
...  

Abstract Background Long-read sequencing has shown its tremendous potential to address genome assembly challenges, e.g., achieving the first telomere-to-telomere assembly of a gapless human chromosome. However, many issues remain unresolved when leveraging error-prone long reads to characterize high-complexity metagenomes, for instance, complete/high-quality genome reconstruction from highly complex systems. Results Here, we developed an iterative haplotype-resolved hierarchical clustering-based hybrid assembly (HCBHA) approach that capitalizes on a hybrid (error-prone long reads and high-accuracy short reads) sequencing strategy to reconstruct (near-) complete genomes from highly complex metagenomes. Using the HCBHA approach, we first phase short and long reads from the highly complex metagenomic dataset into different candidate bacterial haplotypes, then perform hybrid assembly of each bacterial genome individually. We reconstructed 557 metagenome-assembled genomes (MAGs) with an average N50 of 574 Kb from a deeply sequenced, highly complex activated sludge (AS) metagenome. These high-contiguity MAGs contained 14 closed genomes and 111 high-quality (HQ) MAGs including full-length rRNA operons, which accounted for 61.1% of the microbial community. Leveraging the near-complete genomes, we also profiled the metabolic potential of the AS microbiome and identified 2153 biosynthetic gene clusters (BGCs) encoded within the recovered AS MAGs. Conclusion Our results established the feasibility of an iterative haplotype-resolved HCBHA approach to reconstruct (near-) complete genomes from highly complex ecosystems, providing new insights into “complete metagenomics”. The retrieved high-contiguity MAGs illustrated that various biosynthetic gene clusters (BGCs) were harbored in the AS microbiome. The high diversity of BGCs highlights the potential to discover new natural products biosynthesized by the AS microbial community, aside from the traditional function (e.g., organic carbon and nitrogen removal) in wastewater treatment.


2020 ◽  
Author(s):  
Nicolas Scalzitti ◽  
Anne Jeannin-Girardon ◽  
Pierre Collet ◽  
Olivier Poch ◽  
Julie Dawn Thompson

Abstract Background: The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations. Results: We describe the construction of a new benchmark, called G3PO (benchmark for Gene and Protein Prediction PrOgrams), designed to represent many of the typical challenges faced by current genome annotation projects. The benchmark is based on a carefully validated and curated set of real eukaryotic genes from 147 phylogenetically disperse organisms, and a number of test sets are defined to evaluate the effects of different features, including genome sequence quality, gene structure complexity, protein length, etc. We used the benchmark to perform an independent comparative analysis of the most widely used ab initio gene prediction programs and identified the main strengths and weaknesses of the programs. More importantly, we highlight a number of features that could be exploited in order to improve the accuracy of current prediction tools. Conclusions: The experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. We believe that the baseline results associated with the complex gene test sets in G3PO provide useful guidelines for future studies.


Author(s):  
Patrick Videau ◽  
Kaitlyn Wells ◽  
Arun Singh ◽  
Jessie Eiting ◽  
Philip Proteau ◽  
...  

Cyanobacteria are prolific producers of natural products and genome mining has shown that many orphan biosynthetic gene clusters can be found in sequenced cyanobacterial genomes. New tools and methodologies are required to investigate these biosynthetic gene clusters and here we present the use of <i>Anabaena </i>sp. strain PCC 7120 as a host for combinatorial biosynthesis of natural products using the indolactam natural products (lyngbyatoxin A, pendolmycin, and teleocidin B-4) as a test case. We were able to successfully produce all three compounds using codon optimized genes from Actinobacteria. We also introduce a new plasmid backbone based on the native <i>Anabaena</i>7120 plasmid pCC7120ζ and show that production of teleocidin B-4 can be accomplished using a two-plasmid system, which can be introduced by co-conjugation.


eLife ◽  
2015 ◽  
Vol 4 ◽  
Author(s):  
Zachary Charlop-Powers ◽  
Jeremy G Owen ◽  
Boojala Vijay B Reddy ◽  
Melinda A Ternei ◽  
Denise O Guimarães ◽  
...  

Recent bacterial (meta)genome sequencing efforts suggest the existence of an enormous untapped reservoir of natural-product-encoding biosynthetic gene clusters in the environment. Here we use the pyro-sequencing of PCR amplicons derived from both nonribosomal peptide adenylation domains and polyketide ketosynthase domains to compare biosynthetic diversity in soil microbiomes from around the globe. We see large differences in domain populations from all except the most proximal and biome-similar samples, suggesting that most microbiomes will encode largely distinct collections of bacterial secondary metabolites. Our data indicate a correlation between two factors, geographic distance and biome-type, and the biosynthetic diversity found in soil environments. By assigning reads to known gene clusters we identify hotspots of biomedically relevant biosynthetic diversity. These observations not only provide new insights into the natural world, they also provide a road map for guiding future natural products discovery efforts.


2021 ◽  
Author(s):  
Xuhua Mo ◽  
Tobias A. M. Gulder

Over 30 biosynthetic gene clusters for natural tetramate have been identified. This highlight reviews the biosynthetic strategies for formation of tetramic acid unit for the first time, discussing the individual molecular mechanism in detail.


Sign in / Sign up

Export Citation Format

Share Document