scholarly journals CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Author(s):  
Donovan H Parks ◽  
Michael Imelfort ◽  
Connor T Skennerton ◽  
Philip Hugenholtz ◽  
Gene W Tyson

Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of ‘marker’ genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.

Author(s):  
Donovan H Parks ◽  
Michael Imelfort ◽  
Connor T Skennerton ◽  
Philip Hugenholtz ◽  
Gene W Tyson

Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of ‘marker’ genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree along with information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities. CheckM is open source software available at http://ecogenomics.github.io/CheckM.


Author(s):  
Donovan H Parks ◽  
Michael Imelfort ◽  
Connor T Skennerton ◽  
Philip Hugenholtz ◽  
Gene W Tyson

Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. While this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of ‘marker’ genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate, single cell and metagenome derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination, and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.


2020 ◽  
Vol 36 (10) ◽  
pp. 3011-3017 ◽  
Author(s):  
Olga Mineeva ◽  
Mateo Rojas-Carulla ◽  
Ruth E Ley ◽  
Bernhard Schölkopf ◽  
Nicholas D Youngblut

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 7 (6) ◽  
pp. 161 ◽  
Author(s):  
Ming-Hsin Tsai ◽  
Yen-Yi Liu ◽  
Von-Wun Soo ◽  
Chih-Chieh Chen

Microbial diversity has always presented taxonomic challenges. With the popularity of next-generation sequencing technology, more unculturable bacteria have been sequenced, facilitating the discovery of additional new species and complicated current microbial classification. The major challenge is to assign appropriate taxonomic names. Hence, assessing the consistency between taxonomy and genomic relatedness is critical. We proposed and applied a genome comparison approach to a large-scale survey to investigate the distribution of genomic differences among microorganisms. The approach applies a genome-wide criterion, homologous coverage ratio (HCR), for describing the homology between species. The survey included 7861 microbial genomes that excluded plasmids, and 1220 pairs of genera exhibited ambiguous classification. In this study, we also compared the performance of HCR and average nucleotide identity (ANI). The results indicated that HCR and ANI analyses yield comparable results, but a few examples suggested that HCR has a superior clustering effect. In addition, we used the Genome Taxonomy Database (GTDB), the gold standard for taxonomy, to validate our analysis. The GTDB offers 120 ubiquitous single-copy proteins as marker genes for species classification. We determined that the analysis of the GTDB still results in classification boundary blur between some genera and that the marker gene-based approach has limitations. Although the choice of marker genes has been quite rigorous, the bias of marker gene selection remains unavoidable. Therefore, methods based on genomic alignment should be considered for use for species classification in order to avoid the bias of marker gene selection. On the basis of our observations of microbial diversity, microbial classification should be re-examined using genome-wide comparisons.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Samira Melki ◽  
Moncef Gueddari

The production of phosphoric acid by the Tunisian Chemical Group, in Sfax, Tunisia, led to the degradation of the groundwater quality of the Sfax-Agareb aquifer mainly by the phosphogypsum leachates infiltration. Spatiotemporal monitoring of the quality of groundwater was carried out by performing bimonthly sampling between October 2013 and October 2014. Samples culled in the current study were subject to physicochemical parameters measurements and analysis of the major elements, orthophosphates, fluorine, trace metals, and stable isotopes (18O, 2H). The obtained results show that the phosphogypsum leachates infiltration has a major effect on the downstream part of the aquifer, where the highest values of conductivity, SO42-, Ortho-P, and F-, and the lowest pH were recorded. In addition, these results indicated that phosphogypsum leachates contained much higher amount of Cr, Cd, Zn, Cu, Fe, and Al compared to the groundwater. Spatiotemporal variation of the conductivity and concentrations of major elements is linked to the phosphogypsum leachates infiltration as well as to a wide range of factors such as the natural conditions of feeding and the water residence time. Contents of O18 and 2H showed that the water of the Sfax-Agareb aquifer undergoes a large scale evaporation process originated from recent rainfall.


2018 ◽  
Vol 175 ◽  
pp. 03001
Author(s):  
Han Yang ◽  
Chen Kerui ◽  
Li Yang ◽  
Qu Bao

In twenty-first Century, China vigorously promoted the research and construction of AC and DC transmission technology in order to ensure the optimal allocation of energy resources in a large scale[1]. In the construction of AC UHV transmission line, the welding quality of tower and stiffening plate as the load bearing tower and the tension of the welded structure plays an important role in the overall quality of the steel structure. In the past, the welding process of semi automatic carbon dioxide solid core welding wire often has the characteristics of weld spatter not easy to clean up and low efficiency of welding. The semi-automatic CO2 flux cored arc welding, has the characteristics of current and voltage to adapt to a wide range, melting speed, has important significance for improving the process, this paper describes the technology in practical engineering applications, and developed the basic strategy of training for grid steel structure welding technicians. This paper also lists both V groove plate butt FCAW welding typical welding project, hope this welding process will continue to spread.


2019 ◽  
Author(s):  
Salim Bougarn ◽  
Sabri Boughorbel ◽  
Damien Chaussabel ◽  
Nico Marr

ABSTRACTPrimary immunodeficiencies (PIDs) are a heterogeneous group of inherited disorders, frequently caused by loss-of-function and less commonly by gain-of-function mutations, which can result in susceptibility to a broad or a very narrow range of infections but also in inflammatory, allergic or malignant diseases. Owing to the wide range in clinical manifestations and variability in penetrance and expressivity, there is an urgent need to better understand the underlying molecular, cellular and immunological phenotypes in PID patients in order to improve clinical diagnosis and management. Here we have compiled a manually curated collection of public transcriptome datasets mainly obtained from human whole blood, peripheral blood mononuclear cells (PBMCs) or fibroblasts of patients with PIDs and of control subjects for subsequent meta-analysis, query and interpretation. A total of nineteen (19) datasets derived from studies of PID patients were identified and retrieved from the NCBI Gene Expression Omnibus (GEO) database and loaded in GXB, a custom web application designed for interactive query and visualization of integrated large-scale data. The dataset collection includes samples from well characterized PID patients that were stimulated ex vivo under a variety of conditions to assess the molecular consequences of the underlying, naturally occurring gene defects on a genome-wide scale. Multiple sample groupings and rank lists were generated to facilitate comparisons of the transcriptional responses between different PID patients and control subjects. The GXB tool enables browsing of a single transcript across studies, thereby providing new perspectives on the role of a given molecule across biological systems and PID patients. This dataset collection is available at: http://pid.gxbsidra.org/dm3/geneBrowser/list.


Author(s):  
С.И. Носков ◽  
М.П. Базилевский ◽  
Ю.А. Трофимов ◽  
А. Буяннэмэх

В статье рассматривается проблема разработки (формирования) функции эффективности (агрегированного критерия, свертки критериев) входящих в состав Улан-Баторской железной дороги (УБЖД) участков, которая содержала бы специальным образом взвешенные частные характеристики качества функционирования этих участков. Решение этой проблемы осуществляется на основе разработанной в Иркутском государственном университете путей сообщения информационно-вычислительной технологии (ИВТ) многокритериального оценивания эффективности функционирования сложных социально-экономических и технических систем. ИВТ позволяет на модельном уровне оценивать эту эффективность одним числом (выраженным, например, в процентах), что открывает широкие возможности в управлении этими системами, поскольку позволяет выполнять, в частности, масштабный многофакторный сравнительный анализ деятельности однородных организационных и других структур и принимать на этой основе решения самого различного характера. Построена функция эффективности функционирования участков УБЖД, включающая в свой состав взвешенные частные индикаторы такой эффективности: погрузка, статическая нагрузка, выгрузка, отправление вагонов, перевозка пассажиров, простои вагонов с одной переработкой, простои местных вагонов, простои транзитных вагонов с переработкой, простои транзитных вагонов без переработки. На основе этой функции рассчитана масштабированная на сто процентов эффективность каждого участка. При этом все показатели предпочтения упорядочены по убыванию значимости. Подобная информация, формируемая с годичной периодичностью, может быть весьма полезна руководству УБЖД для принятия широкого спектра управленческих, в том числе кадровых, решений. Аналогичная работа может быть выполнена в интересах РАО РЖД. The article discusses the problem of developing (forming) an efficiency function (aggregated criterion, convolution of criteria) of the sections included in the Ulan Bator Railway (UBZhD), which would contain specially weighted private characteristics of the quality of the functioning of these sections. The solution to this problem is carried out on the basis of the information and computational technology (ICT) developed at the Irkutsk State University of communication lines for multi-criteria assessment of the effectiveness of the functioning of complex socio-economic and technical systems. IWT makes it possible at the model level to evaluate this efficiency by one number (expressed, for example, as a percentage), which opens up ample opportunities in the management of these systems, since it allows performing, in particular, a large-scale multifactorial comparative analysis of the activities of homogeneous organizational and other structures and on this basis solutions of the most varied nature. The function of the effectiveness of the functioning of the UBZhD sections has been built, which includes weighted private indicators of such efficiency: loading, static load, unloading, dispatch of cars, transportation of passengers, idle time of cars with one processing, idle time of local cars, idle time of transit cars with processing, idle time of transit cars without processing. Based on this function, a 100% scaled efficiency is calculated for each site. Moreover, all preference indicators are sorted in descending order of importance. Such information, generated on a yearly basis, can be very useful to the UBZhD leadership for making a wide range of managerial, including personnel, decisions. Similar work can be performed in the interests of RAO Russian Railways.


2018 ◽  
Author(s):  
Nikos Konstantinides ◽  
Katarina Kapuralin ◽  
Chaimaa Fadil ◽  
Luendreo Barboza ◽  
Rahul Satija ◽  
...  

SummaryTranscription factors regulate the molecular, morphological, and physiological characters of neurons and generate their impressive cell type diversity. To gain insight into general principles that govern how transcription factors regulate cell type diversity, we used large-scale single-cell mRNA sequencing to characterize the extensive cellular diversity in the Drosophila optic lobes. We sequenced 55,000 single optic lobe neurons and glia and assigned them to 52 clusters of transcriptionally distinct single cells. We validated the clustering and annotated many of the clusters using RNA sequencing of characterized FACS-sorted single cell types, as well as marker genes specific to given clusters. To identify transcription factors responsible for inducing specific terminal differentiation features, we used machine-learning to generate a ‘random forest’ model. The predictive power of the model was confirmed by showing that two transcription factors expressed specifically in cholinergic (apterous) and glutamatergic (traffic-jam) neurons are necessary for the expression of ChAT and VGlut in many, but not all, cholinergic or glutamatergic neurons, respectively. We used a transcriptome-wide approach to show that the same terminal characters, including but not restricted to neurotransmitter identity, can be regulated by different transcription factors in different cell types, arguing for extensive phenotypic convergence. Our data provide a deep understanding of the developmental and functional specification of a complex brain structure.


2019 ◽  
Author(s):  
Gaëtan Benoit ◽  
Mahendra Mariadassou ◽  
Stéphane Robin ◽  
Sophie Schbath ◽  
Pierre Peterlongo ◽  
...  

Abstract Motivation De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. Results We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. Availability and implementation https://github.com/GATB/simka. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document