Estimating and comparing microbial diversity in the presence of sequencing errors

Mapping Intimacies ◽

10.7287/peerj.preprints.1353 ◽

2015 ◽

Author(s):

Chun-Huo Chiu ◽

Anne Chao

Keyword(s):

Microbial Diversity ◽

Common Species ◽

Sequencing Data ◽

Asymptotic Approach ◽

Sequencing Errors ◽

Statistical Approaches ◽

Hill Numbers ◽

Continuous Diversity ◽

Diversity Estimates ◽

Frequency Counts

Estimating and comparing microbial diversity are statistically challenging due to limited sampling and possible sequencing errors for low-frequency counts, producing spurious singletons. The inflated singleton count seriously affects statistical analysis and inferences about microbial diversity. Previous statistical approaches to tackle the sequencing errors generally require different parametric assumptions about the sampling model or about the functional form of frequency counts. Different parametric assumptions may lead to drastically different diversity estimates. We focus on nonparametric methods which are universally valid for all parametric assumptions and can be used to compare diversity across communities. We develop here for the first time a nonparametric estimator of the true singleton count to replace the spurious singleton count. Our estimator of the true singleton count is in terms of the frequency counts of doubletons, tripletons and quadrupletons. To quantify microbial diversity, we adopt the measure of Hill numbers (effective number of taxa) under a nonparametric framework. Hill numbers, parameterized by an order q that determines the measures’ emphasis on rare or common species, include taxa richness (q=0), Shannon diversity (q=1), and Simpson diversity (q=2). Based on the estimated singleton count and the original non-singleton frequency counts, two statistical approaches are developed to compare microbial diversity for multiple communities. (1) A non-asymptotic approach based on standardizing sample size or sample completeness via seamless rarefaction and extrapolation sampling curves of Hill numbers. (2) An asymptotic approach based on a continuous diversity (Hill number) profile which depicts the estimated asymptotes of diversities as a function of order q. Replacing the spurious singleton count by our estimated count, we can greatly remove the positive biases associated with diversity estimates due to spurious singletons in the two approaches and make fair comparison across microbial communities, as illustrated in applying our method to analyze sequencing data from viral metagenomes.

Download Full-text

Estimating and comparing microbial diversity in the presence of sequencing errors

10.7287/peerj.preprints.1353v1 ◽

2015 ◽

Author(s):

Chun-Huo Chiu ◽

Anne Chao

Keyword(s):

Microbial Diversity ◽

Common Species ◽

Sequencing Data ◽

Asymptotic Approach ◽

Sequencing Errors ◽

Statistical Approaches ◽

Hill Numbers ◽

Continuous Diversity ◽

Diversity Estimates ◽

Frequency Counts

Download Full-text

Estimating and comparing microbial diversity in the presence of sequencing errors

PeerJ ◽

10.7717/peerj.1634 ◽

2016 ◽

Vol 4 ◽

pp. e1634 ◽

Cited By ~ 34

Author(s):

Chun-Huo Chiu ◽

Anne Chao

Keyword(s):

Microbial Diversity ◽

Finite Sample ◽

Sequencing Data ◽

Asymptotic Approach ◽

Sequencing Errors ◽

Statistical Approaches ◽

Hill Numbers ◽

Diversity Estimates ◽

Frequency Counts ◽

Hill Number

Estimating and comparing microbial diversity are statistically challenging due to limited sampling and possible sequencing errors for low-frequency counts, producing spurious singletons. The inflated singleton count seriously affects statistical analysis and inferences about microbial diversity. Previous statistical approaches to tackle the sequencing errors generally require different parametric assumptions about the sampling model or about the functional form of frequency counts. Different parametric assumptions may lead to drastically different diversity estimates. We focus on nonparametric methods which are universally valid for all parametric assumptions and can be used to compare diversity across communities. We develop here a nonparametric estimator of the true singleton count to replace the spurious singleton count in all methods/approaches. Our estimator of the true singleton count is in terms of the frequency counts of doubletons, tripletons and quadrupletons, provided these three frequency counts are reliable. To quantify microbial alpha diversity for an individual community, we adopt the measure of Hill numbers (effective number of taxa) under a nonparametric framework. Hill numbers, parameterized by an orderqthat determines the measures’ emphasis on rare or common species, include taxa richness (q= 0), Shannon diversity (q= 1, the exponential of Shannon entropy), and Simpson diversity (q= 2, the inverse of Simpson index). A diversity profile which depicts the Hill number as a function of orderqconveys all information contained in a taxa abundance distribution. Based on the estimated singleton count and the original non-singleton frequency counts, two statistical approaches (non-asymptotic and asymptotic) are developed to compare microbial diversity for multiple communities. (1) A non-asymptotic approach refers to the comparison of estimated diversities of standardized samples with a common finite sample size or sample completeness. This approach aims to compare diversity estimates for equally-large or equally-complete samples; it is based on the seamless rarefaction and extrapolation sampling curves of Hill numbers, specifically forq= 0, 1 and 2. (2) An asymptotic approach refers to the comparison of the estimated asymptotic diversity profiles. That is, this approach compares the estimated profiles for complete samples or samples whose size tends to be sufficiently large. It is based on statistical estimation of the true Hill number of any orderq≥ 0. In the two approaches, replacing the spurious singleton count by our estimated count, we can greatly remove the positive biases associated with diversity estimates due to spurious singletons and also make fair comparisons across microbial communities, as illustrated in our simulation results and in applying our method to analyze sequencing data from viral metagenomes.

Download Full-text

Increased yields of duplex sequencing data by a series of quality control tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab002 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Gundula Povysil ◽

Monika Heinzl ◽

Renato Salazar ◽

Nicholas Stoler ◽

Anton Nekrutenko ◽

...

Keyword(s):

Low Frequency ◽

Variant Calling ◽

Data Loss ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Consensus Sequences ◽

Sequencing Errors ◽

Data Output ◽

Reverse Strand ◽

Duplex Sequencing

Abstract Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling. Specifically, our tools show that tags contain polymerase chain reaction and sequencing errors that contribute to data loss and lower DCS yields. Our tools also identified chimeras, which likely reflect barcode collisions. Finally, we also developed a tool that re-examines variant calls from raw reads and provides different summary data that categorizes the confidence level of a variant call by a tier-based system. With this tool, we can include reads without a family and check the reliability of the call, that increases substantially the sequencing depth for variant calling, a particular important advantage for low-input samples or low-coverage regions.

Download Full-text

PSVII-9 Influence of lactose and milk oligosaccharides in whey permeate on jejunal mucosa-associated microbiota in nursery pigs during 7 to 11 kg BW

Journal of Animal Science ◽

10.1093/jas/skab235.732 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 407-407

Author(s):

Ki Beom Jang ◽

Sung Woo Kim

Keyword(s):

Alpha Diversity ◽

Jejunal Mucosa ◽

16S Rdna Sequencing ◽

Sequencing Data ◽

Milk Oligosaccharides ◽

Whey Permeate ◽

Nursery Pigs ◽

Rdna Sequencing ◽

Acid Producing Bacteria ◽

Diversity Estimates

Abstract This study aimed to evaluate supplemental effects of milk carbohydrates in whey permeate on jejunal mucosa-associated microbiota in nursery pigs during 7 to 11 kg BW. A total of 720 pigs at 7.5 kg BW were allotted to 6 treatments (6 pens/treatment and 20 pigs/pen). Treatments were 6 levels of whey permeate supplementation (0, 3.75, 7.50, 11.25, 15.00, and 18.75%) and fed to pigs for 11 d. On d 11, 36 pigs representing median BW of each pen were euthanized to collect the jejunal mucosa to evaluate microbiota in the jejunum by 16S rDNA sequencing. Data were analyzed using contrasts in MIXED procedure of SAS. Whey permeate contained 76.3% lactose and 0.4% milk oligosaccharides. Increasing whey permeate supplementation from 0 to 18.75% did not affect the alpha-diversity estimates of microbiota. Whey permeate supplementation tended to decrease (P = 0.073, 1.59 to 1.22) Firmicutes:Bacteroidetes compared with no addition of whey permeate. Increasing whey permeate supplementation tended to linearly increase Bifidobacteriaceae (P = 0.089, 0.73 to 1.11), decrease Enterobacteriaceae (P = 0.091, 1.04 to 0.52), decrease Stretococcaceae (P = 0.094, 1.50 to 0.71), and caused quadratic changes (P < 0.05) on Lactobacillaceae (maximum: 9.14% at 12.91% whey permeate). Increasing whey permeate supplementation caused a quadratic change (P < 0.05) on Lactobacillus_Salivarius (maximum: 0.92% at 7.35% whey permeate) and tended to cause quadratic changes on Lactobacillus_Rogosae (P = 0.083; maximum: 0.53% at 8.45% whey permeate) and Lactobacillus_Mucosae (P = 0.092; maximum: 0.70% at 6.98% whey permeate). In conclusion, supplementation of whey permeate as sources of lactose and milk oligosaccharides at a range from 7 to 13% seems to be beneficial to nursery pigs by increasing the abundance of lactic acid-producing bacteria in the jejunal mucosa.

Download Full-text

Dramatic differences in gut bacterial densities help to explain the relationship between diet and habitat in rainforest ants

10.1101/114512 ◽

2017 ◽

Cited By ~ 4

Author(s):

Jon G Sanders ◽

Piotr Lukasik ◽

Megan E Frederickson ◽

Jacob A Russell ◽

Ryuichi Koga ◽

...

Keyword(s):

16S Rrna ◽

Microbial Diversity ◽

Tropical Rainforest ◽

Amplicon Sequencing ◽

Sequencing Data ◽

Lowland Tropical Forest ◽

16S Rrna Amplicon Sequencing ◽

Microbial Symbionts ◽

Microbial Symbiosis ◽

Diversity Profiles

AbstractAbundance is a key parameter in microbial ecology, and important to estimates of potential metabolite flux, impacts of dispersal, and sensitivity of samples to technical biases such as laboratory contamination. However, modern amplicon-based sequencing techniques by themselves typically provide no information about the absolute abundance of microbes. Here, we use fluorescence microscopy and quantitative PCR as independent estimates of microbial abundance to test the hypothesis that microbial symbionts have enabled ants to dominate tropical rainforest canopies by facilitating herbivorous diets, and compare these methods to microbial diversity profiles from 16S rRNA amplicon sequencing. Through a systematic survey of ants from a lowland tropical forest, we show that the density of gut microbiota varies across several orders of magnitude among ant lineages, with median individuals from many genera only marginally above detection limits. Supporting the hypothesis that microbial symbiosis is important to dominance in the canopy, we find that the abundance of gut bacteria is positively correlated with stable isotope proxies of herbivory among canopy-dwelling ants, but not among ground-dwelling ants. Notably, these broad findings are much more evident in the quantitative data than in the 16S rRNA sequencing data. Our results help to resolve a longstanding question in tropical rainforest ecology, and have broad implications for the interpretation of sequence-based surveys of microbial diversity.

Download Full-text

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

Download Full-text

Variant calling and quality control of large-scale human genome sequencing data

Emerging Topics in Life Sciences ◽

10.1042/etls20190007 ◽

2019 ◽

Vol 3 (4) ◽

pp. 399-409 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Jae Hoon Sul

Keyword(s):

Quality Control ◽

Genome Sequencing ◽

Genetic Variants ◽

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Computational Approaches ◽

Sequencing Errors ◽

Human Genome Sequencing ◽

Number Of Individuals

Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

Download Full-text

Early life home microbiome and hyperactivity/inattention in school-age children

Scientific Reports ◽

10.1038/s41598-019-53527-1 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 2

Author(s):

Lidia Casas ◽

Anne M. Karvonen ◽

Pirkka V. Kirjavainen ◽

Martin Täubel ◽

Heidi Hyytiäinen ◽

...

Keyword(s):

Microbial Diversity ◽

Early Life ◽

Fungal Diversity ◽

Illumina Miseq ◽

Shannon Index ◽

School Age Children ◽

Sequencing Data ◽

Logistic Regression Models ◽

Diversity Measures ◽

Early Life Exposure

AbstractThis study evaluates the association between indoor microbial diversity early in life and hyperactivity/inattention symptoms in children at ages 10 and 15 years.A random sample enriched with subjects with hyperactivity/inattention at age 15 years was selected from the German LISA birth cohort. Bedroom floor dust was collected at age 3 months and 4 bacterial and fungal diversity measures [number of observed operational taxonomic units (OTUs), Chao1, Shannon and Simpson indices] were calculated from Illumina MiSeq sequencing data. Hyperactivity/inattention was based on the Strengths and Difficulties Questionnaire at ages 10 and 15 (cut-off ≥7). Adjusted associations between 4 diversity measures in tertiles and hyperactivity/inattention were investigated with weighted and survey logistic regression models. We included 226 individuals with information on microbial diversity and hyperactivity/inattention. Early life bacterial diversity was inversely associated with hyperactivity/inattention at age 10 [bacterial OTUs (medium vs low: aOR = 0.4, 95%CI = (0.2–0.8)) and Chao1 (medium vs low: 0.3 (0.1–0.5); high vs low: 0.3 (0.2–0.6)], whereas fungal diversity was directly associated [Chao1 (high vs low: 2.1 (1.1–4.0)), Shannon (medium vs low: 2.8 (1.3–5.8)), and Simpson (medium vs low: 4.7 (2.4–9.3))]. At age 15, only Shannon index was significantly associated with hyperactivity/inattention [bacteria (medium vs low: 2.3 (1.2–4.2); fungi (high vs low: 0.5 (0.3–0.9))]. In conclusion, early life exposure to microbial diversity may play a role in the psychobehavioural development. We observe heterogeneity in the direction of the associations encouraging further longitudinal studies to deepen our understanding of the characteristics of the microbial community underlying the observed associations.

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text