Differential richness inference for 16S rRNA marker gene surveys
Individual and environmental health outcomes are frequently linked to changes in the diversity of associated microbial communities. This makes deriving health indicators based on microbiome diversity measures essential. While microbiome data generated using high throughput 16S rRNA marker gene surveys are appealing for this purpose, 16S surveys also generate a plethora of spurious microbial taxa. When this artificial inflation in the observed number of taxa (i.e., richness, a diversity measure) is ignored, we find that changes in the abundance of detected taxa confound current methods for inferring differences in richness. Here we argue that the evidence of our own experiments, theory guided exploratory data analyses and existing literature, support the conclusion that most sub-genus discoveries are spurious artifacts of clustering 16S sequencing reads. We proceed based on this finding to model a 16S survey's systematic patterns of sub-genus taxa generation as a function of genus abundance to derive a robust control for false taxa accumulation. Such controls unlock classical regression approaches for highly flexible differential richness inference at various levels of the surveyed microbial assemblage: from sample groups to specific taxa collections. The proposed methodology for differential richness inference is available through an R package, Prokounter. Package availability: https://github.com/mskb01/prokounter