Quantifying and cataloguing unknown sequences within human microbiomes
AbstractAdvances in genome sequencing technologies and lower costs have enabled the exploration of a multitude of known and novel environments and microbiomes. This has led to an exponential growth in the raw sequence data that is deposited in online repositories. Metagenomic and metatranscriptomic data sets are typically analysed with regards to a specific biological question. However, it is widely acknowledged that these data sets are comprised of a proportion of sequences that bear no similarity to any currently known biological sequence, and this so-called ‘dark matter’ is often excluded from downstream analyses. In this study, a systematic framework was developed to assemble, identify, and measure the proportion of unknown sequences present in distinct human microbiomes. This framework was applied to forty distinct studies, comprising 963 samples, and covering ten different human microbiomes including fecal, oral, lung, skin and circulatory system microbiomes. The framework was used to determine the proportion of taxonomically unknown sequences present within samples, and to compare such sequences both within and across assembled metagenomes. We found that whilst the human microbiome is one of the most extensively studied, on average 2% of assembled sequences have not yet been taxonomically defined. However, this proportion varied extensively among different microbiomes and was as high as 25% for skin and oral microbiomes that have more interactions with the environment. The publicly available data sets used have not previously been systematically mined to quantify and compare such dark matter. Typically, these unknown sequences are found in several microbiomes and potentially belong to unidentified novel microbes that we interact with on a daily basis. A cross-study comparison led to the identification of similar unknown sequences in different samples and/or microbiomes. A rate of taxonomic characterisation of 1.64% of unknown sequences being characterised per month was calculated from these taxonomically unknown sequences discovered in this study. Additionally, the approach led to the discovery of several potentially novel viral genomes that bear no similarity to sequences in the public databases. Both our computational framework and the novel unknown sequences produced are publicly available for future cross-referencing.