scholarly journals Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju

2020 ◽  
Vol 48 (16) ◽  
pp. e93-e93
Author(s):  
Anna Tovo ◽  
Peter Menzel ◽  
Anders Krogh ◽  
Marco Cosentino Lagomarsino ◽  
Samir Suweis

Abstract Characterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. Determining microbiomes diversity implies the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and shotgun sequencing to three mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on various mock communities and we show that Core-Kaiju reliably predicts both number of taxa and abundances. Finally, we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and a fresh view on real microbiomes.

2020 ◽  
Author(s):  
Anna Tovo ◽  
Peter Menzel ◽  
Anders Krogh ◽  
Marco Cosentino Lagomarsino ◽  
Samir Suweis

ABSTRACTCharacterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. However, determining microbiomes diversity implies the classification of taxa composition within the sampled community, which is often done via the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and whole genome shotgun sequencing for taxonomic classification to three small mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on the three small mock communities and also on medium- and highly complex mock community datasets taken from the Critical Assessment of Metagenome Interpretation challenge. We show that Core-Kaiju reliably predicts both number of taxa and abundance of the analysed mock bacterial communities. Finally we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and fresh view on real microbiomes.


2019 ◽  
Author(s):  
Michael R. McLaren ◽  
Amy D. Willis ◽  
Benjamin J. Callahan

AbstractMeasurements of biological communities by marker-gene and metagenomic sequencing are biased: The measured relative abundances of taxa or their genes are systematically distorted from their true values because each step in the experimental workflow preferentially detects some taxa over others. Bias can lead to qualitatively incorrect conclusions and makes measurements from different protocols quantitatively incomparable. A rigorous understanding of bias is therefore essential. Here we propose, test, and apply a simple mathematical model of how bias distorts marker-gene and metagenomics measurements: Bias multiplies the true relative abundances within each sample by taxon-and protocol-specific factors that describe the different efficiencies with which taxa are detected by the workflow. Critically, these factors are consistent across samples with different compositions, allowing bias to be estimated and corrected. We validate this model in 16S rRNA gene and shotgun metagenomics data from bacterial communities with defined compositions. We use it to reason about the effects of bias on downstream statistical analyses, finding that analyses based on taxon ratios are less sensitive to bias than analyses based on taxon proportions. Finally, we demonstrate how this model can be used to quantify bias from samples of defined composition, partition bias into steps such as DNA extraction and PCR amplification, and to correct biased measurements. Our model improves on previous models by providing a better fit to experimental data and by providing a composition-independent approach to analyzing, measuring, and correcting bias.


2020 ◽  
Author(s):  
Po-E Li ◽  
Joseph A. Russell ◽  
David Yarmosh ◽  
Alan G. Shteyman ◽  
Kyle Parker ◽  
...  

ABSTRACTMetagenomics is emerging as an important tool in biosurveillance, public health, and clinical applications. However, ease-of-use for execution and data analysis remains a barrier-of-entry to the adoption of metagenomics in applied health and forensics settings. In addition, these venues often have more stringent requirements for reporting, accuracy, and precision than the traditional ecological research role of the technology. Here, we present PanGIA (Pan-Genomics for Infectious Agents), a novel bioinformatics analysis platform for hosting, processing, analyzing, and reporting shotgun metagenomics data of complex samples suspected of containing one or more pathogens. PanGIA was developed to address gaps that often preclude clinicians, medical technicians, forensics personnel, or other non-expert end-users from the routine application of metagenomics for pathogen identification. Though primarily designed to detect pathogenic microorganisms within clinical and environmental metagenomics data, PanGIA also serves as an analytical framework for microbial community profiling and comparative metagenomics. To provide statistical confidence in PanGIA’s taxonomic assignments, the system provides two independent estimations of probability for species and strain level detection. First, PanGIA integrates coverage data with ‘uniqueness’ information mapped across each reference genome for a stand-alone determination of confidence for each query sequence at each taxonomy level. Second, if a negative-control sample is provided, PanGIA compares this sample with a corresponding experimental unknown sample and determines a measure of confidence associated with ‘detection above background’. An integrated graphical user interface allows interactive interrogation and enables users to summarize multiple sample results by confidence score, normalized read abundance, reference genome linear coverage, depth-of-coverage, RPKM, and other metrics to detect specific organisms-of-interest. Comparison testing of the PanGIA algorithm against a number of recent k-mer, read-mapping, and marker-gene based taxonomy classifiers across various real-world datasets with spiked targets shows superior mean positive predictive value, sensitivity, and specificity. PanGIA can process a five million paired-end read dataset in under 1 hour on commodity computational hardware. The source code and documentation are publicly available at https://github.com/LANL-Bioinformatics/PanGIA or https://github.com/mriglobal/PanGIA. The database for PanGIA can be downloaded from ftp://bioinformatics.mriglobal.org/. The full GUI-based PanGIA analysis environment is available in a Docker container and can be installed from https://hub.docker.com/r/poeli/pangia/.


eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Michael R McLaren ◽  
Amy D Willis ◽  
Benjamin J Callahan

Marker-gene and metagenomic sequencing have profoundly expanded our ability to measure biological communities. But the measurements they provide differ from the truth, often dramatically, because these experiments are biased toward detecting some taxa over others. This experimental bias makes the taxon or gene abundances measured by different protocols quantitatively incomparable and can lead to spurious biological conclusions. We propose a mathematical model for how bias distorts community measurements based on the properties of real experiments. We validate this model with 16S rRNA gene and shotgun metagenomics data from defined bacterial communities. Our model better fits the experimental data despite being simpler than previous models. We illustrate how our model can be used to evaluate protocols, to understand the effect of bias on downstream statistical analyses, and to measure and correct bias given suitable calibration controls. These results illuminate new avenues toward truly quantitative and reproducible metagenomics measurements.


2018 ◽  
Author(s):  
Simone Marini ◽  
Francesca Vitali ◽  
Sara Rampazzi ◽  
Andrea Demartini ◽  
Tatsuya Akutsu

AbstractMotivationProtein cleavage is an important cellular event, involved in a myriad of processes, from apoptosis to immune response. Bioinformatics provides in silico tools, such as machine learning-based models, to guide target discovery. State-of-the-art models have a scope limited to specific protease families (such as Caspases), and do not explicitly include biological or medical knowledge (such as the hierarchical protein domain similarity, or gene-gene interactions). To fill this gap, we present a novel approach for protease target prediction based on data integration.ResultsBy representing protease-protein target information in the form of relational matrices, we design a model that: (a) is general, i.e., not limited to a single protease family; and (b) leverages on the available knowledge, managing extremely sparse data from heterogeneous data sources, including primary sequence, pathways, domains, and interactions from nine databases. When compared to other algorithms on test data, our approach provides a better performance even for models specifically focusing on a single protease family.Availabilityhttps://gitlab.com/smarini/MaDDA/ (Matlab code and utilized data.)[email protected], or [email protected]


2020 ◽  
Vol 36 (13) ◽  
pp. 4088-4090 ◽  
Author(s):  
Benjamin Hillmann ◽  
Gabriel A Al-Ghalith ◽  
Robin R Shields-Cutler ◽  
Qiyun Zhu ◽  
Rob Knight ◽  
...  

Abstract Summary The software pipeline SHOGUN profiles known taxonomic and gene abundances of short-read shotgun metagenomics sequencing data. The pipeline is scalable, modular and flexible. Data analysis and transformation steps can be run individually or together in an automated workflow. Users can easily create new reference databases and can select one of three DNA alignment tools, ranging from ultra-fast low-RAM k-mer-based database search to fully exhaustive gapped DNA alignment, to best fit their analysis needs and computational resources. The pipeline includes an implementation of a published method for taxonomy assignment disambiguation with empirical Bayesian redistribution. The software is installable via the conda resource management framework, has plugins for the QIIME2 and QIITA packages and produces both taxonomy and gene abundance profile tables with a single command, thus promoting convenient and reproducible metagenomics research. Availability and implementation https://github.com/knights-lab/SHOGUN.


2020 ◽  
Vol 9 (3) ◽  
Author(s):  
Anke Stüken ◽  
Thomas H. A. Haverkamp

We announce five shotgun metagenomics data sets from two Norwegian premise plumbing systems. The samples were shotgun sequenced on two lanes of an Illumina HiSeq 3000 instrument (THRUplex chemistry, 151 bp, paired-end reads), providing an extensive resource for sequence analyses of tap water and biofilm microbial communities.


1991 ◽  
Vol 11 (2) ◽  
pp. 611-619 ◽  
Author(s):  
J T Olesen ◽  
J D Fikes ◽  
L Guarente

The fission yeast Schizosaccharomyces pombe is immensely diverged from budding yeast (Saccharomyces cerevisiae) on an evolutionary time scale. We have used a fission yeast library to clone a homolog of S. cerevisiae HAP2, which along with HAP3 and HAP4 forms a transcriptional activation complex that binds to the CCAAT box. The S. pombe homolog php2 (S. pombe HAP2) was obtained by functional complementation in an S. cerevisiae hap2 mutant and retains the ability to associate with HAP3 and HAP4. We have previously demonstrated that the HAP2 subunit of the CCAAT-binding transcriptional activation complex from S. cerevisiae contains a 65-amino-acid "essential core" structure that is divisible into subunit association and DNA recognition domains. Here we show that Php2 contains a 60-amino-acid block that is 82% identical to this core. The remainder of the 334-amino-acid protein is completely without homology to HAP2. The function of php2 in S. pombe was investigated by disrupting the gene. Strikingly, like HAP2 in S. cerevisiae, the S. pombe gene is specifically involved in mitochondrial function. This contrasts to the situation in mammals, in which the homologous CCAAT-binding complex is a global transcriptional activator.


2016 ◽  
Vol 82 (24) ◽  
pp. 7217-7226 ◽  
Author(s):  
D. Lee Taylor ◽  
William A. Walters ◽  
Niall J. Lennon ◽  
James Bochicchio ◽  
Andrew Krohn ◽  
...  

ABSTRACTWhile high-throughput sequencing methods are revolutionizing fungal ecology, recovering accurate estimates of species richness and abundance has proven elusive. We sought to design internal transcribed spacer (ITS) primers and an Illumina protocol that would maximize coverage of the kingdom Fungi while minimizing nontarget eukaryotes. We inspected alignments of the 5.8S and large subunit (LSU) ribosomal genes and evaluated potential primers using PrimerProspector. We tested the resulting primers using tiered-abundance mock communities and five previously characterized soil samples. We recovered operational taxonomic units (OTUs) belonging to all 8 members in both mock communities, despite DNA abundances spanning 3 orders of magnitude. The expected and observed read counts were strongly correlated (r= 0.94 to 0.97). However, several taxa were consistently over- or underrepresented, likely due to variation in rRNA gene copy numbers. The Illumina data resulted in clustering of soil samples identical to that obtained with Sanger sequence clone library data using different primers. Furthermore, the two methods produced distance matrices with a Mantel correlation of 0.92. Nonfungal sequences comprised less than 0.5% of the soil data set, with most attributable to vascular plants. Our results suggest that high-throughput methods can produce fairly accurate estimates of fungal abundances in complex communities. Further improvements might be achieved through corrections for rRNA copy number and utilization of standardized mock communities.IMPORTANCEFungi play numerous important roles in the environment. Improvements in sequencing methods are providing revolutionary insights into fungal biodiversity, yet accurate estimates of the number of fungal species (i.e., richness) and their relative abundances in an environmental sample (e.g., soil, roots, water, etc.) remain difficult to obtain. We present improved methods for high-throughput Illumina sequencing of the species-diagnostic fungal ribosomal marker gene that improve the accuracy of richness and abundance estimates. The improvements include new PCR primers and library preparation, validation using a known mock community, and bioinformatic parameter tuning.


2001 ◽  
Vol 82 (9) ◽  
pp. 2235-2241 ◽  
Author(s):  
Hyun Jin Kwun ◽  
Eun Young Jung ◽  
Ji Young Ahn ◽  
Mi Nam Lee ◽  
Kyung Lib Jang

Hepatitis C virus (HCV) NS3 protein is known to affect normal cellular functions, such as cell proliferation and cell death, and to be involved, either directly or indirectly, in HCV hepatocarcinogenesis. In this study, we demonstrated that NS3 protein could specifically repress the promoter activity of p21 in a dose-dependent manner. The effect was not cell type-specific and was synergistic when combined with HCV core protein. Repression of the p21 promoter by NS3 was almost completely lost when p53 binding sites present on the p21 promoter were removed. Furthermore, p53 binding sites were sufficient to confer a strong NS3 responsiveness to an heterologous promoter, suggesting that NS3 represses the transcription of p21 by modulating the activity of p53. Although the NS3 protein domain required for the majority of p21 repression was located on the protease domain, the proteinase activity itself does not seem to be necessary for repression. Both transcription and protein stability of p53 were unaffected by NS3, suggesting that NS3 might repress transcription of p21 by inhibiting the regulatory activity of p53 via protein–protein interaction(s). Finally, the growth rate of NS3-expressing cell lines was at least twice as fast as that of the parent NIH 3T3 cells, indicating that the repression of p21 is actually reflected by the stimulation of cell growth.


Sign in / Sign up

Export Citation Format

Share Document