Estimating and comparing microbial diversity in the presence of sequencing errors
Estimating and comparing microbial diversity are statistically challenging due to limited sampling and possible sequencing errors for low-frequency counts, producing spurious singletons. The inflated singleton count seriously affects statistical analysis and inferences about microbial diversity. Previous statistical approaches to tackle the sequencing errors generally require different parametric assumptions about the sampling model or about the functional form of frequency counts. Different parametric assumptions may lead to drastically different diversity estimates. We focus on nonparametric methods which are universally valid for all parametric assumptions and can be used to compare diversity across communities. We develop here for the first time a nonparametric estimator of the true singleton count to replace the spurious singleton count. Our estimator of the true singleton count is in terms of the frequency counts of doubletons, tripletons and quadrupletons. To quantify microbial diversity, we adopt the measure of Hill numbers (effective number of taxa) under a nonparametric framework. Hill numbers, parameterized by an order q that determines the measures’ emphasis on rare or common species, include taxa richness (q=0), Shannon diversity (q=1), and Simpson diversity (q=2). Based on the estimated singleton count and the original non-singleton frequency counts, two statistical approaches are developed to compare microbial diversity for multiple communities. (1) A non-asymptotic approach based on standardizing sample size or sample completeness via seamless rarefaction and extrapolation sampling curves of Hill numbers. (2) An asymptotic approach based on a continuous diversity (Hill number) profile which depicts the estimated asymptotes of diversities as a function of order q. Replacing the spurious singleton count by our estimated count, we can greatly remove the positive biases associated with diversity estimates due to spurious singletons in the two approaches and make fair comparison across microbial communities, as illustrated in applying our method to analyze sequencing data from viral metagenomes.