Powerful and robust non-parametric association testing for microbiome data via a zero-inflated quantile approach (ZINQ)

Abstract Background Identification of bacterial taxa associated with diseases, exposures, and other variables of interest offers a more comprehensive understanding of the role of microbes in many conditions. However, despite considerable research in statistical methods for association testing with microbiome data, approaches that are generally applicable remain elusive. Classical tests often do not accommodate the realities of microbiome data, leading to power loss. Approaches tailored for microbiome data depend highly upon the normalization strategies used to handle differential read depth and other data characteristics, and they often have unacceptably high false positive rates, generally due to unsatisfied distributional assumptions. On the other hand, many non-parametric tests suffer from loss of power and may also present difficulties in adjusting for potential covariates. Most extant approaches also fail in the presence of heterogeneous effects. The field needs new non-parametric approaches that are tailored to microbiome data, robust to distributional assumptions, and powerful under heterogeneous effects, while permitting adjustment for covariates. Methods As an alternative to existing approaches, we propose a zero-inflated quantile approach (ZINQ), which uses a two-part quantile regression model to accommodate the zero inflation in microbiome data. For a given taxon, ZINQ consists of a valid test in logistic regression to model the zero counts, followed by a series of quantile rank-score based tests on multiple quantiles of the non-zero part with adjustment for the zero inflation. As a regression and quantile-based approach, the method is non-parametric and robust to irregular distributions, while providing an allowance for covariate adjustment. Since no distributional assumptions are made, ZINQ can be applied to data that has been processed under any normalization strategy. Results Thorough simulations based on real data across a range of scenarios and application to real data sets show that ZINQ often has equivalent or higher power compared to existing tests even as it offers better control of false positives. Conclusions We present ZINQ, a quantile-based association test between microbiota and dichotomous or quantitative clinical variables, providing a powerful and robust alternative for the current microbiome differential abundance analysis.

Download Full-text

PERFect: PERmutation Filtering test for microbiome data

Biostatistics ◽

10.1093/biostatistics/kxy020 ◽

2018 ◽

Vol 20 (4) ◽

pp. 615-631 ◽

Cited By ~ 2

Author(s):

Ekaterina Smirnova ◽

Snehalata Huzurbazar ◽

Farhad Jafari

Keyword(s):

Permutation Test ◽

Real Data ◽

Data Sets ◽

Therapeutic Approaches ◽

Experiment Data ◽

Microbiome Research ◽

Inflammatory Bowel ◽

Microbiome Data ◽

Processing Steps

Summary The human microbiota composition is associated with a number of diseases including obesity, inflammatory bowel disease, and bacterial vaginosis. Thus, microbiome research has the potential to reshape clinical and therapeutic approaches. However, raw microbiome count data require careful pre-processing steps that take into account both the sparsity of counts and the large number of taxa that are being measured. Filtering is defined as removing taxa that are present in a small number of samples and have small counts in the samples where they are observed. Despite progress in the number and quality of filtering approaches, there is no consensus on filtering standards and quality assessment. This can adversely affect downstream analyses and reproducibility of results across platforms and software. We introduce PERFect, a novel permutation filtering approach designed to address two unsolved problems in microbiome data processing: (i) define and quantify loss due to filtering by implementing thresholds and (ii) introduce and evaluate a permutation test for filtering loss to provide a measure of excessive filtering. Methods are assessed on three “mock experiment” data sets, where the true taxa compositions are known, and are applied to two publicly available real microbiome data sets. The method correctly removes contaminant taxa in “mock” data sets, quantifies and visualizes the corresponding filtering loss, providing a uniform data-driven filtering criteria for real microbiome data sets. In real data analyses PERFect tends to remove more taxa than existing approaches; this likely happens because the method is based on an explicit loss function, uses statistically principled testing, and takes into account correlation between taxa. The PERFect software is freely available at https://github.com/katiasmirn/PERFect.

Download Full-text

Bivariate Dynamic Weighted Survival Entropy of Order 𝛼

Stochastics and Quality Control ◽

10.1515/eqc-2018-0032 ◽

2019 ◽

Vol 34 (2) ◽

pp. 67-85

Author(s):

S. Nair Rohini ◽

E. I. Abdul Sathar

Keyword(s):

Survival Function ◽

Real Data ◽

Data Sets ◽

Reliability Modeling ◽

Dynamic Form ◽

Conditionally Specified Models ◽

Measure Of Uncertainty ◽

Non Parametric

Abstract Recently, G. Rajesh, E. I. Abdul-Sathar and S. Nair Rohini [G. Rajesh, E. I. Abdul-Sathar and S. Nair Rohini, On dynamic weighted survival entropy of order α, Comm. Statist. Theory Methods 46 2017, 5, 2139–2150] proposed a measure of uncertainty based on the survival function called weighted survival entropy of order α. They have also introduced the dynamic form of a measure called dynamic weighted survival entropy of order α and studied various properties in the context of reliability modeling. In this paper, we extend these measures into the bivariate setup and study its properties. We also look into the problem of extending the same measure for conditionally specified models. Empirical and non-parametric estimators are suggested for the proposed measure using the conditionally specified model, and the effect of the proposed estimators is illustrated using simulated and real data sets.

Download Full-text

Batch effects removal for microbiome data via conditional quantile regression (ConQuR)

10.1101/2021.09.23.461592 ◽

2021 ◽

Author(s):

Wodan Ling ◽

Ni Zhao ◽

Anju Lulla ◽

Anna M. Plantinga ◽

Weijia Fu ◽

...

Keyword(s):

Quantile Regression ◽

Genomic Analysis ◽

Data Sets ◽

Batch Effects ◽

Differential Processing ◽

Conditional Quantile ◽

Association Testing ◽

Quantile Regression Model ◽

Comprehensive Method ◽

Microbiome Data

Batch effects in microbiome data arise from differential processing of specimens and can lead to spurious findings and obscure true signals. Most existing strategies for mitigating batch effects rely on approaches designed for genomic analysis, failing to address the zero-inflated and over-dispersed microbiome data. Strategies tailored for microbiome data are restricted to association testing, failing to allow other analytic goals such as visualization. We develop the Conditional Quantile Regression (ConQuR) approach to remove microbiome batch effects using a two-part quantile regression model. It is a fundamental advancement in the field because it is the first comprehensive method that accommodates the complex distributions of microbial read counts, and it generates batch-removed zero-inflated read counts that can be used in and benefit all usual subsequent analyses. We apply ConQuR to real microbiome data sets and demonstrate its state-of-the-art performance in removing batch effects while preserving or even amplifying the signals of interest.

Download Full-text

Heterogeneous Effects of the De Jure and De Facto Business Environment: Findings from Multiple Data Sets on the Business Environment

10.1596/1813-9450-9115 ◽

2020 ◽

Author(s):

Christine Zhenwei Qiang ◽

He Wang ◽

L. Colin Xu

Keyword(s):

Business Environment ◽

Data Sets ◽

Multiple Data ◽

Heterogeneous Effects ◽

Multiple Data Sets

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

A New Extension of Thinning-Based Integer-Valued Autoregressive Models for Count Data

Entropy ◽

10.3390/e23010062 ◽

2020 ◽

Vol 23 (1) ◽

pp. 62

Author(s):

Zhengwei Liu ◽

Fukang Zhu

Keyword(s):

Likelihood Estimation ◽

Real Data ◽

Autoregressive Models ◽

Superior Performance ◽

Data Sets ◽

Binomial Thinning ◽

Free Case ◽

Two Parameters ◽

Conditional Maximum ◽

Thinning Operator

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.

Download Full-text

Goodness-of-Fit Tests for Bivariate Time Series of Counts

Econometrics ◽

10.3390/econometrics9010010 ◽

2021 ◽

Vol 9 (1) ◽

pp. 10

Author(s):

Šárka Hudecová ◽

Marie Hušková ◽

Simos G. Meintanis

Keyword(s):

Goodness Of Fit ◽

Probability Generating Function ◽

Parametric Bootstrap ◽

Real Data ◽

Data Sets ◽

Test Statistics ◽

Finite Sample ◽

Generalized Poisson ◽

Goodness Of Fit Tests ◽

Monte Carlo Experiments

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.

Download Full-text

TraceAll: A Real-Time Processing for Contact Tracing Using Indoor Trajectories

Information ◽

10.3390/info12050202 ◽

2021 ◽

Vol 12 (5) ◽

pp. 202

Author(s):

Louai Alarabi ◽

Saleh Basalamah ◽

Abdeltawab Hendawi ◽

Mohammed Abdalla

Keyword(s):

Infectious Diseases ◽

Infected Patient ◽

Public Health Problem ◽

Real Data ◽

Exposure Period ◽

Contact Tracing ◽

Data Sets ◽

Major Public Health Problem ◽

Real Time Processing ◽

Recent Developments

The rapid spread of infectious diseases is a major public health problem. Recent developments in fighting these diseases have heightened the need for a contact tracing process. Contact tracing can be considered an ideal method for controlling the transmission of infectious diseases. The result of the contact tracing process is performing diagnostic tests, treating for suspected cases or self-isolation, and then treating for infected persons; this eventually results in limiting the spread of diseases. This paper proposes a technique named TraceAll that traces all contacts exposed to the infected patient and produces a list of these contacts to be considered potentially infected patients. Initially, it considers the infected patient as the querying user and starts to fetch the contacts exposed to him. Secondly, it obtains all the trajectories that belong to the objects moved nearby the querying user. Next, it investigates these trajectories by considering the social distance and exposure period to identify if these objects have become infected or not. The experimental evaluation of the proposed technique with real data sets illustrates the effectiveness of this solution. Comparative analysis experiments confirm that TraceAll outperforms baseline methods by 40% regarding the efficiency of answering contact tracing queries.

Download Full-text

The Flexible Burr X-G Family: Properties, Inference, and Applications in Engineering Science

Symmetry ◽

10.3390/sym13030474 ◽

2021 ◽

Vol 13 (3) ◽

pp. 474

Author(s):

Abdulhakim A. Al-Babtain ◽

Ibrahim Elbatal ◽

Hazem Al-Mofleh ◽

Ahmed M. Gemeay ◽

Ahmed Z. Afify ◽

...

Keyword(s):

Numerical Simulations ◽

Exponential Distribution ◽

Real Data ◽

Exponential Model ◽

Statistical Properties ◽

Engineering Science ◽

Data Sets ◽

Engineering Sciences ◽

General Statistical ◽

Anderson Darling

In this paper, we introduce a new flexible generator of continuous distributions called the transmuted Burr X-G (TBX-G) family to extend and increase the flexibility of the Burr X generator. The general statistical properties of the TBX-G family are calculated. One special sub-model, TBX-exponential distribution, is studied in detail. We discuss eight estimation approaches to estimating the TBX-exponential parameters, and numerical simulations are conducted to compare the suggested approaches based on partial and overall ranks. Based on our study, the Anderson–Darling estimators are recommended to estimate the TBX-exponential parameters. Using two skewed real data sets from the engineering sciences, we illustrate the importance and flexibility of the TBX-exponential model compared with other existing competing distributions.

Download Full-text

Kumaraswamy Generalized Power Lomax Distributionand Its Applications

Stats ◽

10.3390/stats4010003 ◽

2021 ◽

Vol 4 (1) ◽

pp. 28-45

Author(s):

Vasili B.V. Nagarjuna ◽

R. Vishnu Vardhan ◽

Christophe Chesneau

Keyword(s):

Hazard Rate ◽

Real Data ◽

Rate Function ◽

Maximum Likelihood Estimates ◽

Parameter Estimates ◽

Parameter Distribution ◽

Data Sets ◽

Lomax Distribution ◽

Entropy Measures ◽

Modeling Behavior

In this paper, a new five-parameter distribution is proposed using the functionalities of the Kumaraswamy generalized family of distributions and the features of the power Lomax distribution. It is named as Kumaraswamy generalized power Lomax distribution. In a first approach, we derive its main probability and reliability functions, with a visualization of its modeling behavior by considering different parameter combinations. As prime quality, the corresponding hazard rate function is very flexible; it possesses decreasing, increasing and inverted (upside-down) bathtub shapes. Also, decreasing-increasing-decreasing shapes are nicely observed. Some important characteristics of the Kumaraswamy generalized power Lomax distribution are derived, including moments, entropy measures and order statistics. The second approach is statistical. The maximum likelihood estimates of the parameters are described and a brief simulation study shows their effectiveness. Two real data sets are taken to show how the proposed distribution can be applied concretely; parameter estimates are obtained and fitting comparisons are performed with other well-established Lomax based distributions. The Kumaraswamy generalized power Lomax distribution turns out to be best by capturing fine details in the structure of the data considered.

Download Full-text