scholarly journals Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances

2022 ◽  
Author(s):  
Mahmudur Rahman Hera ◽  
N Tessa Pierce-Ward ◽  
David Koslicki

Sketching methods offer computational biologists scalable techniques to analyze data sets that continue to grow in size. MinHash is one such technique that has enjoyed recent broad application. However, traditional MinHash has previously been shown to perform poorly when applied to sets of very dissimilar sizes. FracMinHash was recently introduced as a modification of MinHash to compensate for this lack of performance when set sizes differ. While experimental evidence has been encouraging, FracMinHash has not yet been analyzed from a theoretical perspective. In this paper, we perform such an analysis and prove that while FracMinHash is not unbiased, this bias is easily corrected. Next, we detail how a simple mutation model interacts with FracMinHash and are able to derive confidence intervals for evolutionary mutation distances between pairs of sequences as well as hypothesis tests for FracMinHash. We find that FracMinHash estimates the containment of a genome in a large metagenome more accurately and more precisely when compared to traditional MinHash, and the confidence interval performs significantly better in estimating mutation distances. A python-based implementation of the theorems we derive is freely available at https://github.com/KoslickiLab/mutation-rate-ci-calculator. The results presented in this paper can be reproduced using the code at https://github.com/KoslickiLab/ScaledMinHash-reproducibles.

2020 ◽  
Author(s):  
Matthias Flor ◽  
Michael Weiβ ◽  
Thomas Selhorst ◽  
Christine Müller-Graf ◽  
Matthias Greiner

Abstract Background: Various methods exist for statistical inference about a prevalence that consider misclassifications due to an imperfect diagnostic test. However, traditional methods are known to suffer from truncation of the prevalence estimate and the confidence intervals constructed around the point estimate, as well as from under-performance of the confidence intervals' coverage. Methods: In this study, we used simulated data sets to validate a Bayesian prevalence estimation method and compare its performance to frequentist methods, i.e. the Rogan-Gladen estimate for prevalence, RGE, in combination with several methods of confidence interval construction. Our performance measures are (i) error distribution of the point estimate against the simulated true prevalence and (ii) coverage and length of the confidence interval, or credible interval in the case of the Bayesian method. Results: Across all data sets, the Bayesian point estimate and the RGE produced similar error distributions with slight advanteges of the former over the latter. In addition, the Bayesian estimate did not suffer from the RGE's truncation problem at zero or unity. With respect to coverage performance of the confidence and credible intervals, all of the traditional frequentist methods exhibited strong under-coverage, whereas the Bayesian credible interval as well as a newly developed frequentist method by Lang and Reiczigel performed as desired, with the Bayesian method having a very slight advantage in terms of interval length. Conclusion: The Bayesian prevalence estimation method should be prefered over traditional frequentist methods. An acceptable alternative is to combine the Rogan-Gladen point estimate with the Lang-Reiczigel confidence interval.


2020 ◽  
Author(s):  
Matthias Flor ◽  
Michael Weiβ ◽  
Thomas Selhorst ◽  
Christine Müller-Graf ◽  
Matthias Greiner

Abstract Background: Various methods exist for statistical inference about a prevalence that consider misclassifications due to an imperfect diagnostic test. However, traditional methods are known to suffer from truncation of the prevalence estimate and the confidence intervals constructed around the point estimate, as well as from under-performance of the confidence intervals' coverage. Methods : In this study, we used simulated data sets to validate a Bayesian prevalence estimation method and compare its performance to frequentist methods, i.e. the Rogan-Gladen estimate for prevalence, RGE, in combination with several methods of confidence interval construction. Our performance measures are (i) error distribution of the point estimate against the simulated true prevalence and (ii) coverage and length of the confidence interval, or credible interval in the case of the Bayesian method.Results: Across all data sets, the Bayesian point estimate and the RGE produced similar error distributions with slight advanteges of the former over the latter. In addition, the Bayesian estimate did not suffer from the RGE's truncation problem at zero or unity. With respect to coverage performance of the confidence and credible intervals, all of the traditional frequentist methods exhibited strong under-coverage, whereas the Bayesian credible interval as well as a newly developed frequentist method by Lang and Reiczigel performed as desired, with the Bayesian method having a very slight advantage in terms of interval length. Conclusion: The Bayesian prevalence estimation method should be prefered over traditional frequentist methods. An acceptable alternative is to combine the Rogan-Gladen point estimate with the Lang-Reiczigel confidence interval.


2021 ◽  
Author(s):  
Julius M. Pfadt ◽  
Don van den Bergh ◽  
morten moshagen

The reliability of a multidimensional test instrument is commonly estimated using coefficients ωt (total) and ωh (hierarchical) based on a factor model approach. However, point estimates for the coefficients are rarely accompanied by uncertainty estimates. In this study we compare several methods to obtain confidence intervals for the two coefficients: bootstrap and normal-theory intervals. In addition, we adapt methodology from Bayesian structural equation modeling to develop Bayesian versions of coefficients ωt and ωh by sampling from a second-order factor model. Results from a comprehensive simulation study show that the bootstrap standard error confidence interval, the bootstrap standard error log-transformed confidence interval, the Wald confidence interval, and the Bayesian credible interval perform well across a wide range of conditions. This study provides researchers with more information about the ωt and ωh confidence intervals they wish to report in their research. Moreover, the study introduces ωt and ωh credible intervals that are easy to use and come with all the benefits of Bayesian parameter estimation.


2014 ◽  
Author(s):  
Hua Chen ◽  
Jody Hey ◽  
Montgomery Slatkin

Recent positive selection can increase the frequency of an advantageous mutant rapidly enough that a relatively long ancestral haplotype will be remained intact around it. We present a hidden Markov model (HMM) to identify such haplotype structures. With HMM identified haplotype structures, a population genetic model for the extent of ancestral haplotypes is then adopted for parameter inference of the selection intensity and the allele age. Simulations show that this method can detect selection under a wide range of conditions and has higher power than the existing frequency spectrum-based method. In addition, it provides good estimate of the selection coefficients and allele ages for strong selection. The method analyzes large data sets in a reasonable amount of running time. This method is applied to HapMap III data for a genome scan, and identifies a list of candidate regions putatively under recent positive selection. It is also applied to several genes known to be under recent positive selection, including the LCT, KITLG and TYRP1 genes in Northern Europeans, and OCA2 in East Asians, to estimate their allele ages and selection coefficients.


2020 ◽  
Author(s):  
Matthias Flor ◽  
Michael Weiβ ◽  
Thomas Selhorst ◽  
Christine Müller-Graf ◽  
Matthias Greiner

Abstract Background: Various methods exist for statistical inference about a prevalence that consider misclassifications due to an imperfect diagnostic test. However, traditional methods are known to suffer from censoring of the prevalence estimate and the confidence intervals constructed around the point estimate, as well as from under-performance of the confidence intervals' coverage. Methods: In this study, we used simulated data sets to validate a Bayesian prevalence estimation method and compare its performance to frequentist methods, i.e. the Rogan-Gladen estimate for prevalence, RGE, in combination with several methods of confidence interval construction. Our performance measures are (i) bias of the point estimate against the simulated true prevalence and (ii) coverage and length of the confidence interval, or credible interval in the case of the Bayesian method. Results: Across all data sets, the Bayesian point estimate and the RGE produced similar bias distributions with slight advanteges of the former over the latter. In addition, the Bayesian estimate did not suffer from the RGE's censoring problem at zero or unity. With respect to coverage performance of the confidence and credible intervals, all of the traditional frequentist methods exhibited strong under-coverage, whereas the Bayesian credible interval as well as a newly developed frequentist method by Lang and Reiczigel performed as desired, with the Bayesian method having a very slight advantage in terms of interval length. Conclusion: The Bayesian prevalence estimation method should be prefered over traditional frequentist methods. An acceptable alternative is to combine the Rogan-Gladen point estimate with the Lang-Reiczigel confidence interval.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Eleanor F. Miller ◽  
Andrea Manica

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.


Genetics ◽  
1998 ◽  
Vol 148 (1) ◽  
pp. 525-535
Author(s):  
Claude M Lebreton ◽  
Peter M Visscher

AbstractSeveral nonparametric bootstrap methods are tested to obtain better confidence intervals for the quantitative trait loci (QTL) positions, i.e., with minimal width and unbiased coverage probability. Two selective resampling schemes are proposed as a means of conditioning the bootstrap on the number of genetic factors in our model inferred from the original data. The selection is based on criteria related to the estimated number of genetic factors, and only the retained bootstrapped samples will contribute a value to the empirically estimated distribution of the QTL position estimate. These schemes are compared with a nonselective scheme across a range of simple configurations of one QTL on a one-chromosome genome. In particular, the effect of the chromosome length and the relative position of the QTL are examined for a given experimental power, which determines the confidence interval size. With the test protocol used, it appears that the selective resampling schemes are either unbiased or least biased when the QTL is situated near the middle of the chromosome. When the QTL is closer to one end, the likelihood curve of its position along the chromosome becomes truncated, and the nonselective scheme then performs better inasmuch as the percentage of estimated confidence intervals that actually contain the real QTL's position is closer to expectation. The nonselective method, however, produces larger confidence intervals. Hence, we advocate use of the selective methods, regardless of the QTL position along the chromosome (to reduce confidence interval sizes), but we leave the problem open as to how the method should be altered to take into account the bias of the original estimate of the QTL's position.


Author(s):  
Thomas J Littlejohns ◽  
Amanda Y Chong ◽  
Naomi E Allen ◽  
Matthew Arnold ◽  
Kathryn E Bradbury ◽  
...  

ABSTRACT Background The number of gluten-free diet followers without celiac disease (CD) is increasing. However, little is known about the characteristics of these individuals. Objectives We address this issue by investigating a wide range of genetic and phenotypic characteristics in association with following a gluten-free diet. Methods The cross-sectional association between lifestyle and health-related characteristics and following a gluten-free diet was investigated in 124,447 women and men aged 40–69 y from the population-based UK Biobank study. A genome-wide association study (GWAS) of following a gluten-free diet was performed. Results A total of 1776 (1.4%) participants reported following a gluten-free diet. Gluten-free diet followers were more likely to be women, nonwhite, highly educated, living in more socioeconomically deprived areas, former smokers, have lost weight in the past year, have poorer self-reported health, and have made dietary changes as a result of illness. Conversely, these individuals were less likely to consume alcohol daily, be overweight or obese, have hypertension, or use cholesterol-lowering medication. Participants with hospital inpatient diagnosed blood and immune mechanism disorders (OR: 1.62; 95% CI: 1.18, 2.21) and non-CD digestive system diseases (OR: 1.58; 95% CI: 1.42, 1.77) were more likely to follow a gluten-free diet. The GWAS demonstrated that no genetic variants were associated with being a gluten-free diet follower. Conclusions Gluten-free diet followers have a better cardiovascular risk profile than non-gluten-free diet followers but poorer self-reported health and a higher prevalence of blood and immune disorders and digestive conditions. Reasons for following a gluten-free diet warrant further investigation.


Author(s):  
A Salman Avestimehr ◽  
Seyed Mohammadreza Mousavi Kalan ◽  
Mahdi Soltanolkotabi

Abstract Dealing with the shear size and complexity of today’s massive data sets requires computational platforms that can analyze data in a parallelized and distributed fashion. A major bottleneck that arises in such modern distributed computing environments is that some of the worker nodes may run slow. These nodes a.k.a. stragglers can significantly slow down computation as the slowest node may dictate the overall computational time. A recent computational framework, called encoded optimization, creates redundancy in the data to mitigate the effect of stragglers. In this paper, we develop novel mathematical understanding for this framework demonstrating its effectiveness in much broader settings than was previously understood. We also analyze the convergence behavior of iterative encoded optimization algorithms, allowing us to characterize fundamental trade-offs between convergence rate, size of data set, accuracy, computational load (or data redundancy) and straggler toleration in this framework.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yance Feng ◽  
Lei M. Li

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.


Sign in / Sign up

Export Citation Format

Share Document