Quantifying and cataloguing unknown sequences within human microbiomes

AbstractAdvances in genome sequencing technologies and lower costs have enabled the exploration of a multitude of known and novel environments and microbiomes. This has led to an exponential growth in the raw sequence data that is deposited in online repositories. Metagenomic and metatranscriptomic data sets are typically analysed with regards to a specific biological question. However, it is widely acknowledged that these data sets are comprised of a proportion of sequences that bear no similarity to any currently known biological sequence, and this so-called ‘dark matter’ is often excluded from downstream analyses. In this study, a systematic framework was developed to assemble, identify, and measure the proportion of unknown sequences present in distinct human microbiomes. This framework was applied to forty distinct studies, comprising 963 samples, and covering ten different human microbiomes including fecal, oral, lung, skin and circulatory system microbiomes. The framework was used to determine the proportion of taxonomically unknown sequences present within samples, and to compare such sequences both within and across assembled metagenomes. We found that whilst the human microbiome is one of the most extensively studied, on average 2% of assembled sequences have not yet been taxonomically defined. However, this proportion varied extensively among different microbiomes and was as high as 25% for skin and oral microbiomes that have more interactions with the environment. The publicly available data sets used have not previously been systematically mined to quantify and compare such dark matter. Typically, these unknown sequences are found in several microbiomes and potentially belong to unidentified novel microbes that we interact with on a daily basis. A cross-study comparison led to the identification of similar unknown sequences in different samples and/or microbiomes. A rate of taxonomic characterisation of 1.64% of unknown sequences being characterised per month was calculated from these taxonomically unknown sequences discovered in this study. Additionally, the approach led to the discovery of several potentially novel viral genomes that bear no similarity to sequences in the public databases. Both our computational framework and the novel unknown sequences produced are publicly available for future cross-referencing.

Download Full-text

A Need for Improved Cellulase Identification from Metagenomic Sequence Data

Applied and Environmental Microbiology ◽

10.1128/aem.01928-20 ◽

2020 ◽

Vol 87 (1) ◽

Author(s):

Rebecca Co ◽

Laura A. Hug

Keyword(s):

Sequence Data ◽

Industrial Applications ◽

Data Sets ◽

Metagenomic Sequence ◽

Environmental Sequence ◽

Sequencing Technologies ◽

Current Classification ◽

Applied Microbiology ◽

Environmental Surveys ◽

Metagenomic Sequence Data

ABSTRACT Improved sequencing technologies and the maturation of metagenomic approaches allow the identification of gene variants with potential industrial applications, including cellulases. Cellulase identification from metagenomic environmental surveys is complicated by inconsistent nomenclature and multiple categorization systems. Here, we summarize the current classification and nomenclature systems, with recommendations for improvements to these systems. Addressing the issues described will strengthen the annotation of cellulose-active enzymes from environmental sequence data sets—a rapidly growing resource in environmental and applied microbiology.

Download Full-text

MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization

Briefings in Bioinformatics ◽

10.1093/bib/bbx108 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1160-1166 ◽

Cited By ~ 989

Author(s):

Kazutaka Katoh ◽

John Rozewicki ◽

Kazunori D Yamada

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Data ◽

Large Data ◽

Relevant Information ◽

Data Sets ◽

Online Service ◽

Multiple Sequence ◽

Biologically Relevant ◽

Sequencing Technologies

Abstract This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs.

Download Full-text

A hybrid pipeline for reconstruction and analysis of viral genomes at multi-organ level

GigaScience ◽

10.1093/gigascience/giaa086 ◽

2020 ◽

Vol 9 (8) ◽

Cited By ~ 4

Author(s):

Diogo Pratas ◽

Mari Toppinen ◽

Lari Pyöriä ◽

Klaus Hedman ◽

Antti Sajantila ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Ex Vivo ◽

Genomic Variation ◽

Nucleotide Polymorphisms ◽

Sensitive Data ◽

Viral Genomes ◽

Sequencing Technologies ◽

Multiple Organs ◽

Research Perspectives

Abstract Background Advances in sequencing technologies have enabled the characterization of multiple microbial and host genomes, opening new frontiers of knowledge while kindling novel applications and research perspectives. Among these is the investigation of the viral communities residing in the human body and their impact on health and disease. To this end, the study of samples from multiple tissues is critical, yet, the complexity of such analysis calls for a dedicated pipeline. We provide an automatic and efficient pipeline for identification, assembly, and analysis of viral genomes that combines the DNA sequence data from multiple organs. TRACESPipe relies on cooperation among 3 modalities: compression-based prediction, sequence alignment, and de novo assembly. The pipeline is ultra-fast and provides, additionally, secure transmission and storage of sensitive data. Findings TRACESPipe performed outstandingly when tested on synthetic and ex vivo datasets, identifying and reconstructing all the viral genomes, including those with high levels of single-nucleotide polymorphisms. It also detected minimal levels of genomic variation between different organs. Conclusions TRACESPipe’s unique ability to simultaneously process and analyze samples from different sources enables the evaluation of within-host variability. This opens up the possibility to investigate viral tissue tropism, evolution, fitness, and disease associations. Moreover, additional features such as DNA damage estimation and mitochondrial DNA reconstruction and analysis, as well as exogenous-source controls, expand the utility of this pipeline to other fields such as forensics and ancient DNA studies. TRACESPipe is released under GPLv3 and is available for free download at https://github.com/viromelab/tracespipe.

Download Full-text

High Performance Pattern Matching on Heterogeneous Platform

Journal of Integrative Bioinformatics ◽

10.1515/jib-2014-253 ◽

2014 ◽

Vol 11 (3) ◽

pp. 88-98 ◽

Cited By ~ 1

Author(s):

Shima Soroushnia ◽

Masoud Daneshtalab ◽

Juha Plosila ◽

Tapio Pahikkala ◽

Pasi Liljeberg

Keyword(s):

Pattern Matching ◽

High Performance ◽

Sequence Data ◽

Good Choice ◽

Data Sets ◽

Biological Sequence ◽

Computational Molecular Biology ◽

Heterogeneous Architectures ◽

Protein Sequence Data ◽

Gpu Architecture

Summary Pattern discovery is one of the fundamental tasks in bioinformatics and pattern recognition is a powerful technique for searching sequence patterns in the biological sequence databases. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology since the significant increase in the number of DNA and protein sequences expand the need for raising the performance of pattern matching algorithms. For this purpose, heterogeneous architectures can be a good choice due to their potential for high performance and energy efficiency. In this paper we present an efficient implementation of Aho-Corasick (AC) which is a well known exact pattern matching algorithm with linear complexity, and Parallel Failureless Aho-Corasick (PFAC) algorithm which is the massively parallelized version of AC algorithm without failure transitions, on a heterogeneous CPU/GPU architecture. We progressively redesigned the algorithms and data structures to fit on the GPU architecture. Our results on different protein sequence data sets show that the new implementation runs 15 times faster compared to the original implementation of the PFAC algorithm.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text

Fundamental resource trade-offs for encoded distributed optimization

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa026 ◽

2020 ◽

Author(s):

A Salman Avestimehr ◽

Seyed Mohammadreza Mousavi Kalan ◽

Mahdi Soltanolkotabi

Keyword(s):

Computational Time ◽

Massive Data ◽

Data Sets ◽

Massive Data Sets ◽

Computational Framework ◽

Data Set ◽

Trade Offs ◽

Major Bottleneck ◽

Computing Environments ◽

Analyze Data

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.

Download Full-text

Determinants of adenine-mutagenesis in diversity-generating retroelements

Nucleic Acids Research ◽

10.1093/nar/gkaa1240 ◽

2020 ◽

Author(s):

Sumit Handa ◽

Andres Reyna ◽

Timothy Wiryaman ◽

Partho Ghosh

Keyword(s):

Amino Acids ◽

Dark Matter ◽

Reverse Transcription ◽

Genetic Information ◽

Human Microbiome ◽

Protein Sequences ◽

Catalytic Efficiency ◽

Natural World ◽

In Vitro System

Abstract Diversity-generating retroelements (DGRs) vary protein sequences to the greatest extent known in the natural world. These elements are encoded by constituents of the human microbiome and the microbial ‘dark matter’. Variation occurs through adenine-mutagenesis, in which genetic information in RNA is reverse transcribed faithfully to cDNA for all template bases but adenine. We investigated the determinants of adenine-mutagenesis in the prototypical Bordetella bacteriophage DGR through an in vitro system composed of the reverse transcriptase bRT, Avd protein, and a specific RNA. We found that the catalytic efficiency for correct incorporation during reverse transcription by the bRT-Avd complex was strikingly low for all template bases, with the lowest occurring for adenine. Misincorporation across a template adenine was only somewhat lower in efficiency than correct incorporation. We found that the C6, but not the N1 or C2, purine substituent was a key determinant of adenine-mutagenesis. bRT-Avd was insensitive to the C6 amine of adenine but recognized the C6 carbonyl of guanine. We also identified two bRT amino acids predicted to nonspecifically contact incoming dNTPs, R74 and I181, as promoters of adenine-mutagenesis. Our results suggest that the overall low catalytic efficiency of bRT-Avd is intimately tied to its ability to carry out adenine-mutagenesis.

Download Full-text

Modeling the Process of Event Sequence Data Generated for Working Condition Diagnosis

Mathematical Problems in Engineering ◽

10.1155/2015/693450 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13

Author(s):

Jianwei Ding ◽

Yingbo Liu ◽

Li Zhang ◽

Jianmin Wang

Keyword(s):

Working Condition ◽

Sequence Data ◽

A Priori ◽

Real Data ◽

Data Sets ◽

Main Task ◽

Event Sequence ◽

Telemetry Data ◽

Condition Monitoring Systems ◽

Condition Diagnosis

Condition monitoring systems are widely used to monitor the working condition of equipment, generating a vast amount and variety of telemetry data in the process. The main task of surveillance focuses on analyzing these routinely collected telemetry data to help analyze the working condition in the equipment. However, with the rapid increase in the volume of telemetry data, it is a nontrivial task to analyze all the telemetry data to understand the working condition of the equipment without any a priori knowledge. In this paper, we proposed a probabilistic generative model called working condition model (WCM), which is capable of simulating the process of event sequence data generated and depicting the working condition of equipment at runtime. With the help of WCM, we are able to analyze how the event sequence data behave in different working modes and meanwhile to detect the working mode of an event sequence (working condition diagnosis). Furthermore, we have applied WCM to illustrative applications like automated detection of an anomalous event sequence for the runtime of equipment. Our experimental results on the real data sets demonstrate the effectiveness of the model.

Download Full-text

Sequencing and Computational Approaches to Identification and Characterization of Microbial Organisms

Biomedical Engineering and Computational Biology ◽

10.4137/becb.s10886 ◽

2013 ◽

Vol 5 ◽

pp. BECB.S10886 ◽

Cited By ~ 2

Author(s):

Brijesh Singh Yadav ◽

Venkateswarlu Ronda ◽

Dinesh P. Vashista ◽

Bhaskar Sharma

Keyword(s):

Sequence Data ◽

Microbial Interactions ◽

Microbial Pathogens ◽

Nucleotide Sequence Data ◽

Computational Approaches ◽

Microbial Detection ◽

Sequencing Technologies ◽

Sequencing Platforms ◽

Identification And Characterization

The recent advances in sequencing technologies and computational approaches are propelling scientists ever closer towards complete understanding of human-microbial interactions. The powerful sequencing platforms are rapidly producing huge amounts of nucleotide sequence data which are compiled into huge databases. This sequence data can be retrieved, assembled, and analyzed for identification of microbial pathogens and diagnosis of diseases. In this article, we present a commentary on how the metagenomics incorporated with microarray and new sequencing techniques are helping microbial detection and characterization.

Download Full-text

Numerous uncharacterized and highly divergent microbes which colonize humans are revealed by circulating cell-free DNA

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1707009114 ◽

2017 ◽

Vol 114 (36) ◽

pp. 9623-9628 ◽

Cited By ~ 69

Author(s):

Mark Kowarsky ◽

Joan Camunas-Soler ◽

Michael Kertesz ◽

Iwijn De Vlaminck ◽

Winston Koh ◽

...

Keyword(s):

Human Body ◽

Sequence Data ◽

Human Microbiome ◽

Pcr Amplification ◽

The Body ◽

Direct Pcr ◽

Cell Free Dna ◽

Free Dna ◽

Novel Taxa ◽

Circulating Cell Free Dna

Blood circulates throughout the human body and contains molecules drawn from virtually every tissue, including the microbes and viruses which colonize the body. Through massive shotgun sequencing of circulating cell-free DNA from the blood, we identified hundreds of new bacteria and viruses which represent previously unidentified members of the human microbiome. Analyzing cumulative sequence data from 1,351 blood samples collected from 188 patients enabled us to assemble 7,190 contiguous regions (contigs) larger than 1 kbp, of which 3,761 are novel with little or no sequence homology in any existing databases. The vast majority of these novel contigs possess coding sequences, and we have validated their existence both by finding their presence in independent experiments and by performing direct PCR amplification. When their nearest neighbors are located in the tree of life, many of the organisms represent entirely novel taxa, showing that microbial diversity within the human body is substantially broader than previously appreciated.

Download Full-text