Minimizing spurious features in 16S rRNA gene amplicon sequencing
The 16S rRNA gene amplicon sequencing is a widely used high-throughput method for the taxonomic inference in microbial communities. Many data analysis pipelines have been developed to enhance the accuracy in reflecting the real taxonomy, in order to better guide the downstream identification, isolation and mechanistic studies. Though rigorous quality filtration steps were adopted in these pipelines, with well-designed mock and simulated data sets, we found that there were still a widely divergent number of spurious features due to the “pseudo sequences” artificially generated during the PCR and sequencing process. These pseudo sequences were in low abundances, and were unreliable determined through a weighted re-sampling test. To minimize their influences on the characterization of taxonomy, we proposed an approach that contains two steps, an abundance filtering (AF) step and the subsequent AF-based OTU picking and remapping (AOR) step, which can efficiently decrease the spurious OTUs, sequences or oligotyping features, and improve Matthew's Correlation Coefficient (MCC) values in OTU clustering. The approach can be easily integrated with the popularly-used 16S rRNA sequencing data analysis pipelines, to make the number of OTUs, alpha and beta diversities from divergent pipelines more consistent with the real structure of microbial communities.