scholarly journals MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun meta-genomic data

2017 ◽  
Author(s):  
Florian Plaza Oñate ◽  
Emmanuelle Le Chatelier ◽  
Mathieu Almeida ◽  
Ales-sandra C. L. Cervino ◽  
Franck Gauthier ◽  
...  

AbstractMotivationAnalysis toolkits for shotgun metagenomic data achieve strain-level characterization of complex microbial communities by capturing intra-species gene content variation. Yet, these tools are hampered by the extent of reference genomes that are far from covering all microbial variability, as many species are still not sequenced or have only few strains available. Binning co-abundant genes obtained from de novo assembly is a powerful reference-free technique to discover and reconstitute gene repertoire of microbial species. While current methods accurately identify species core parts, they miss many accessory genes or split them into small gene groups that remain unassociated to core clusters.ResultsWe introduce MSPminer, a computationally efficient software tool that reconstitutes Metagenomic Species Pan-genomes (MSPs) by binning co-abundant genes across metagenomic samples. MSPminer relies on a new robust measure of proportionality coupled with an empirical classifier to group and distinguish not only species core genes but accessory genes also. Applied to a large scale metagenomic dataset, MSPminer successfully delineates in a few hours the gene repertoires of 1 661 microbial species with similar specificity and higher sensitivity than existing tools. The taxonomic annotation of MSPs reveals microorganisms hitherto unknown and brings coherence in the nomenclature of the species of the human gut microbiota. The provided MSPs can be readily used for taxonomic profiling and biomarkers discovery in human gut metagenomic samples. In addition, MSPminer can be applied on gene count tables from other ecosystems to perform similar analyses.AvailabilityThe binary is freely available for non-commercial users at enterome.fr/site/downloads/ Contact: [email protected] informationAvailable in the file named Supplementary Information.pdf

2019 ◽  
Author(s):  
Gaëtan Benoit ◽  
Mahendra Mariadassou ◽  
Stéphane Robin ◽  
Sophie Schbath ◽  
Pierre Peterlongo ◽  
...  

Abstract Motivation De novo comparative metagenomics is one of the most straightforward ways to analyze large sets of metagenomic data. Latest methods use the fraction of shared k-mers to estimate genomic similarity between read sets. However, those methods, while extremely efficient, are still limited by computational needs for practical usage outside of large computing facilities. Results We present SimkaMin, a quick comparative metagenomics tool with low disk and memory footprints, thanks to an efficient data subsampling scheme used to estimate Bray-Curtis and Jaccard dissimilarities. One billion metagenomic reads can be analyzed in <3 min, with tiny memory (1.09 GB) and disk (≈0.3 GB) requirements and without altering the quality of the downstream comparative analyses, making of SimkaMin a tool perfectly tailored for very large-scale metagenomic projects. Availability and implementation https://github.com/GATB/simka. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (3) ◽  
pp. 380-388 ◽  
Author(s):  
Wei Zheng ◽  
Qi Mao ◽  
Robert J Genco ◽  
Jean Wactawski-Wende ◽  
Michael Buck ◽  
...  

Abstract Motivation The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. Results In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method. Availability and implementation Open-source software for the proposed method is freely available at https://www.acsu.buffalo.edu/~yijunsun/lab/SLAD.html. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Hamid Mohamadi ◽  
Justin Chu ◽  
Lauren Coombe ◽  
Rene Warren ◽  
Inanc Birol

AbstractMotivationRepeat elements such as satellites, transposons, high number of gene copies, and segmental duplications are abundant in eukaryotic genomes. They often induce many local alignments, complicating sequence assembly and comparisons between genomes and analysis of large-scale duplications and rearrangements. Hence, identification and classification of repeats is a fundamental step in many genomics applications and their downstream analysis tools.ResultsIn this work, we present an efficient streaming algorithm and software tool, ntHits, for de novo repeat identification based on the statistical analysis of the k-mer content profile of large-scale DNA sequencing data. In the proposed algorithm, we first obtain the k-mer coverage histograms of input datasets using the ntCard algorithm, an efficient streaming algorithm for estimating the k-mer coverage histograms. From the obtained k-mer coverage histogram, the repetitive k-mers would present a long tail to the distribution of k-mer coverage profile. Experimental results show that ntHits can efficiently and accurately identify the repeat content in large-scale DNA sequencing data. For example, ntHits accurately identifies the repeat k-mers in the white spruce sequencing data set with 96× sequencing coverage in about 12 hours and using less than 150GB of memory, while using the exact methods for reporting the repeated k-mers takes several days and terabytes of memory and disk space.AvailabilityntHits is written in C++ and is released under the MIT License. It is freely available at https://github.com/bcgsc/[email protected]


2019 ◽  
Vol 35 (19) ◽  
pp. 3861-3863 ◽  
Author(s):  
Chuang Li ◽  
Kenli Li ◽  
Tao Chen ◽  
Yunping Zhu ◽  
Qiang He

Abstract Summary Tandem mass spectrometry based database searching is a widely acknowledged and adopted method that identifies peptide sequence in shotgun proteomics. However, database searching is extremely computationally expensive, which can take days even weeks to process a large spectra dataset. To address this critical issue, this paper presents SW-Tandem, a new tool for large-scale peptide sequencing. SW-Tandem parallelizes the spectrum dot product scoring algorithm and leverages the advantages of Sunway TaihuLight, the No. 1 supercomputer in the world in 2017. Sunway TaihuLight is powered by the brand new many-core SW26010 processors and provides a peak computation performance greater than 100PFlops. To fully utilize the Sunway TaihuLights capacity, SW-Tandem employs three mechanisms to accelerate large-scale peptide identification, memory-access optimizations, double buffering and vectorization. The results of experiments conducted on multiple datasets demonstrate the performance of SW-Tandem against three state-of-the-art tools for peptide identification, including X!! Tandem, MR-Tandem and MSFragger. In addition, it shows high scalability in the experiments on extremely large datasets sized up to 12 GB. Availability and implementation SW-Tandem is an open source software tool implemented in C++. The source code and the parameter settings are available at https://github.com/Logic09/SW-Tandem. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (9) ◽  
pp. 2862-2871
Author(s):  
Chiung-Ting Wu ◽  
Yizhi Wang ◽  
Yinxue Wang ◽  
Timothy Ebbels ◽  
Ibrahim Karaman ◽  
...  

Abstract Motivation Liquid chromatography–mass spectrometry (LC-MS) is a standard method for proteomics and metabolomics analysis of biological samples. Unfortunately, it suffers from various changes in the retention times (RT) of the same compound in different samples, and these must be subsequently corrected (aligned) during data processing. Classic alignment methods such as in the popular XCMS package often assume a single time-warping function for each sample. Thus, the potentially varying RT drift for compounds with different masses in a sample is neglected in these methods. Moreover, the systematic change in RT drift across run order is often not considered by alignment algorithms. Therefore, these methods cannot effectively correct all misalignments. For a large-scale experiment involving many samples, the existence of misalignment becomes inevitable and concerning. Results Here, we describe an integrated reference-free profile alignment method, neighbor-wise compound-specific Graphical Time Warping (ncGTW), that can detect misaligned features and align profiles by leveraging expected RT drift structures and compound-specific warping functions. Specifically, ncGTW uses individualized warping functions for different compounds and assigns constraint edges on warping functions of neighboring samples. Validated with both realistic synthetic data and internal quality control samples, ncGTW applied to two large-scale metabolomics LC-MS datasets identifies many misaligned features and successfully realigns them. These features would otherwise be discarded or uncorrected using existing methods. The ncGTW software tool is developed currently as a plug-in to detect and realign misaligned features present in standard XCMS output. Availability and implementation An R package of ncGTW is freely available at Bioconductor and https://github.com/ChiungTingWu/ncGTW. A detailed user’s manual and a vignette are provided within the package. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (19) ◽  
pp. 4960-4962
Author(s):  
Lauren Marazzi ◽  
Andrew Gainer-Dewar ◽  
Paola Vera-Licona

Abstract Summary OCSANA+ is a Cytoscape app for identifying nodes to drive the system toward a desired long-term behavior, prioritizing combinations of interventions in large-scale complex networks, and estimating the effects of node perturbations in signaling networks, all based on the analysis of the network’s structure. OCSANA+ includes an update to optimal combinations of interventions from network analysis software tool with cutting-edge and rigorously tested algorithms, together with recently developed structure-based control algorithms for non-linear systems and an algorithm for estimating signal flow. All these algorithms are based on the network’s topology. OCSANA+ is implemented as a Cytoscape app to enable a user interface for running analyses and visualizing results. Availability and implementation OCSANA+ app and its tutorial can be downloaded from the Cytoscape App Store or https://veraliconaresearchgroup.github.io/OCSANA-Plus/. The source code and computations are available in https://github.com/VeraLiconaResearchGroup/OCSANA-Plus_SourceCode. Supplementary information Supplementary data are available at Bioinformatics online.


Nature ◽  
2021 ◽  
Author(s):  
Marsha C. Wibowo ◽  
Zhen Yang ◽  
Maxime Borry ◽  
Alexander Hübner ◽  
Kun D. Huang ◽  
...  

AbstractLoss of gut microbial diversity1–6 in industrial populations is associated with chronic diseases7, underscoring the importance of studying our ancestral gut microbiome. However, relatively little is known about the composition of pre-industrial gut microbiomes. Here we performed a large-scale de novo assembly of microbial genomes from palaeofaeces. From eight authenticated human palaeofaeces samples (1,000–2,000 years old) with well-preserved DNA from southwestern USA and Mexico, we reconstructed 498 medium- and high-quality microbial genomes. Among the 181 genomes with the strongest evidence of being ancient and of human gut origin, 39% represent previously undescribed species-level genome bins. Tip dating suggests an approximate diversification timeline for the key human symbiont Methanobrevibacter smithii. In comparison to 789 present-day human gut microbiome samples from eight countries, the palaeofaeces samples are more similar to non-industrialized than industrialized human gut microbiomes. Functional profiling of the palaeofaeces samples reveals a markedly lower abundance of antibiotic-resistance and mucin-degrading genes, as well as enrichment of mobile genetic elements relative to industrial gut microbiomes. This study facilitates the discovery and characterization of previously undescribed gut microorganisms from ancient microbiomes and the investigation of the evolutionary history of the human gut microbiota through genome reconstruction from palaeofaeces.


2021 ◽  
Author(s):  
Menglei Shuai ◽  
Guoqing Zhang ◽  
Fang-fang Zeng ◽  
Yuanqing Fu ◽  
Xinxiu Liang ◽  
...  

Objective: To investigate the association between human gut antibiotic resistome and the progression of type 2 diabetes (T2D) in a large cohort study. Design: The present study included 1210 participants from the Guangzhou Nutrition and Health Study. We depicted the landscape of human gut antibiotic resistome with shotgun metagenomic data and examined its association with T2D and cardiometabolic traits. The co-occurrence network analysis was used to explore the associations between T2D-related ARGs and gut microbial species. We also examined the associations between gut antibiotic resistome features and fecal metabolome. Results: There was a significant overall shift in gut antibiotic resistome structure among healthy, prediabetes and T2D groups (p = 0.004). We found that larger ARG diversity was associated with a higher risk of T2D (all p < 0.05). Based on the association found between resistomes and T2D, we developed diabetes ARG score and demonstrated its creative use as a new predictor for T2D progression manifested by the change of insulin resistance. Further network analysis showed the co-occurrence association between T2D-related ARGs and gut microbial species, which indicated the potential bacterial hosts of these ARG biomarkers. Resistome-metabolome co-analysis suggests a potential link of ARGs with fecal metabolites, which may reflect the host-microbial metabolic adaptation. Conclusion: Our data depict the landscape of gut antibiotic resistome diversity and composition and uncover a close relationship between human gut antibiotic resistome and T2D progression.


2018 ◽  
Vol 35 (7) ◽  
pp. 1249-1251 ◽  
Author(s):  
Kai Li ◽  
Marc Vaudel ◽  
Bing Zhang ◽  
Yan Ren ◽  
Bo Wen

Abstract Summary Data visualization plays critical roles in proteomics studies, ranging from quality control of MS/MS data to validation of peptide identification results. Herein, we present PDV, an integrative proteomics data viewer that can be used to visualize a wide range of proteomics data, including database search results, de novo sequencing results, proteogenomics files, MS/MS data in mzML/mzXML format and data from public proteomics repositories. PDV is a lightweight visualization tool that enables intuitive and fast exploration of diverse, large-scale proteomics datasets on standard desktop computers in both graphical user interface and command line modes. Availability and implementation PDV software and the user manual are freely available at http://pdv.zhang-lab.org. The source code is available at https://github.com/wenbostar/PDV and is released under the GPL-3 license. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 12 ◽  
Author(s):  
Huaihai Chen ◽  
Kayan Ma ◽  
Yu Huang ◽  
Zhiyuan Yao ◽  
Chengjin Chu

Anthropogenic disturbances and global climate change are causing large-scale biodiversity loss and threatening ecosystem functions. However, due to the lack of knowledge on microbial species loss, our understanding on how functional profiles of soil microbes respond to diversity decline is still limited. Here, we evaluated the biotic homogenization of global soil metagenomic data to examine whether microbial functional structure is resilient to significant diversity reduction. Our results showed that although biodiversity loss caused a decrease in taxonomic species by 72%, the changes in the relative abundance of diverse functional categories were limited. The stability of functional structures associated with microbial species richness decline in terrestrial systems suggests a decoupling of taxonomy and function. The changes in functional profile with biodiversity loss were function-specific, with broad-scale metabolism functions decreasing and typical nutrient-cycling functions increasing. Our results imply high levels of microbial physiological versatility in the face of significant biodiversity decline, which, however, does not necessarily mean that a loss in total functional abundance, such as microbial activity, can be overlooked in the background of unprecedented species extinction.


Sign in / Sign up

Export Citation Format

Share Document