Handling of spurious sequences affects the outcome of high-throughput 16S rRNA gene amplicon profiling
Abstract Background: 16S rRNA gene amplicon sequencing is a very popular approach for studying microbiomes. However, varying standards exist for sample and data processing and some basic concepts such as the occurrence of spurious sequences have not been investigated in a comprehensive manner, which was done in the present study. Methods: Using defined communities of bacteria in vitro and in vivo , we searched for sequences not matching the expected species ( i.e. , spurious taxa) and determine a threshold of occurrence relevant for adequate data analysis. The origin of spurious taxa was then investigated via large-scale amplicon queries. We also assessed the impact of varying sequence filtering stringency on diversity readouts in human fecal and peat soil communities. Results: 16S rRNA gene amplicon data processing based on Operational Taxonomic Units (OTUs) clustering and singleton removal, a commonly used approach that discards any taxa represented by only one sequence across all samples, delivered approx. 50% (mock communities) to 80% (gnotobiotic mice) spurious taxa on average. This spurious fraction of taxa was lower based on amplicon sequence variants (ASVs) analysis but varied depending on the gene region targeted and the barcoding system used. A relative abundance of 0.25% was identified as a threshold below which the analysis of spurious taxa can be prevented to a large extent. Most spurious taxa (approx. 70%) detected in simplified communities occurred in samples multiplexed in the same sequencing run and were present in only one of ten runs. Use of the 0.25% relative abundance threshold decreased the coefficient of variations calculated on richness in the same six human fecal samples across seven sequencing runs by 38% compared with singleton filtering. The output of beta -diversity analyses of human fecal communities was markedly affected by both the filtering strategy and the type of phylogenetic distances used for comparing samples. Importantly, major findings were confirmed by using data generated in a second sequencing facility. Conclusions: Handling of artifact sequences during bioinformatic processing of 16S rRNA gene amplicon data requires careful attention to avoid the generation of misleading findings. A threshold of relative abundance of 0.25% is more appropriate than singleton removal, although study-specific analysis strategies are mandatory. We propose the concept of effective richness, which will help comparing results across studies.