zero inflation
Recently Published Documents


TOTAL DOCUMENTS

123
(FIVE YEARS 30)

H-INDEX

15
(FIVE YEARS 2)

2021 ◽  
Vol 2 (1) ◽  
pp. 43-61
Author(s):  
Aanchal Malhotra ◽  
Samarendra Das ◽  
Shesh N. Rai

Single-cell RNA-sequencing (scRNA-seq) technology provides an excellent platform for measuring the expression profiles of genes in heterogeneous cell populations. Multiple tools for the analysis of scRNA-seq data have been developed over the years. The tools require complicated commands and steps to analyze the underlying data, which are not easy to follow by genome researchers and experimental biologists. Therefore, we describe a step-by-step workflow for processing and analyzing the scRNA-seq unique molecular identifier (UMI) data from Human Lung Adenocarcinoma cell lines. We demonstrate the basic analyses including quality check, mapping and quantification of transcript abundance through suitable real data example to obtain UMI count data. Further, we performed basic statistical analyses, such as zero-inflation, differential expression and clustering analyses on the obtained count data. We studied the effects of excess zero-inflation present in scRNA-seq data on the downstream analyses. Our findings indicate that the zero-inflation associated with UMI data had no or minimal role in clustering, while it had significant effect on identifying differentially expressed genes. We also provide an insight into the comparative analysis for differential expression analysis tools based on zero-inflated negative binomial and negative binomial models on scRNA-seq data. The sensitivity analysis enhanced our findings in that the negative binomial model-based tool did not provide an accurate and efficient way to analyze the scRNA-seq data. This study provides a set of guidelines for the users to handle and analyze real scRNA-seq data more easily.


Paleobiology ◽  
2021 ◽  
pp. 1-18
Author(s):  
Jansen A. Smith ◽  
John C. Handley ◽  
Gregory P. Dietl

Abstract The effects of overdispersion and zero inflation (e.g., poor model fits) can result in misinterpretation in studies using count data. These effects have not been evaluated in paleoecological studies of predation and are further complicated by preservational bias and time averaging. We develop a hierarchical Bayesian framework to account for uncertainty from overdispersion and zero inflation in estimates of specimen and predation trace counts. We demonstrate its application using published data on drilling predators and their prey in time-averaged death assemblages from the Great Barrier Reef, Australia. Our results indicate that estimates of predation frequencies are underestimated when zero inflation is not considered, and this effect is likely compounded by removal of individuals and predation traces via preservational bias. Time averaging likely reduces zero inflation via accumulation of rare taxa and events; however, it increases the uncertainty in comparisons between assemblages by introducing variability in sampling effort. That is, there is an analytical cost with time-averaged count data, manifesting as broader confidence regions. Ecological inferences in paleoecology can be strengthened by accounting for the uncertainty inherent to paleoecological count data and the sampling processes by which they are generated.


Microbiome ◽  
2021 ◽  
Vol 9 (1) ◽  
Author(s):  
Wodan Ling ◽  
Ni Zhao ◽  
Anna M. Plantinga ◽  
Lenore J. Launer ◽  
Anthony A. Fodor ◽  
...  

Abstract Background Identification of bacterial taxa associated with diseases, exposures, and other variables of interest offers a more comprehensive understanding of the role of microbes in many conditions. However, despite considerable research in statistical methods for association testing with microbiome data, approaches that are generally applicable remain elusive. Classical tests often do not accommodate the realities of microbiome data, leading to power loss. Approaches tailored for microbiome data depend highly upon the normalization strategies used to handle differential read depth and other data characteristics, and they often have unacceptably high false positive rates, generally due to unsatisfied distributional assumptions. On the other hand, many non-parametric tests suffer from loss of power and may also present difficulties in adjusting for potential covariates. Most extant approaches also fail in the presence of heterogeneous effects. The field needs new non-parametric approaches that are tailored to microbiome data, robust to distributional assumptions, and powerful under heterogeneous effects, while permitting adjustment for covariates. Methods As an alternative to existing approaches, we propose a zero-inflated quantile approach (ZINQ), which uses a two-part quantile regression model to accommodate the zero inflation in microbiome data. For a given taxon, ZINQ consists of a valid test in logistic regression to model the zero counts, followed by a series of quantile rank-score based tests on multiple quantiles of the non-zero part with adjustment for the zero inflation. As a regression and quantile-based approach, the method is non-parametric and robust to irregular distributions, while providing an allowance for covariate adjustment. Since no distributional assumptions are made, ZINQ can be applied to data that has been processed under any normalization strategy. Results Thorough simulations based on real data across a range of scenarios and application to real data sets show that ZINQ often has equivalent or higher power compared to existing tests even as it offers better control of false positives. Conclusions We present ZINQ, a quantile-based association test between microbiota and dichotomous or quantitative clinical variables, providing a powerful and robust alternative for the current microbiome differential abundance analysis.


2021 ◽  
pp. 001316442110289
Author(s):  
Sooyong Lee ◽  
Suhwa Han ◽  
Seung W. Choi

Response data containing an excessive number of zeros are referred to as zero-inflated data. When differential item functioning (DIF) detection is of interest, zero-inflation can attenuate DIF effects in the total sample and lead to underdetection of DIF items. The current study presents a DIF detection procedure for response data with excess zeros due to the existence of unobserved heterogeneous subgroups. The suggested procedure utilizes the factor mixture modeling (FMM) with MIMIC (multiple-indicator multiple-cause) to address the compromised DIF detection power via the estimation of latent classes. A Monte Carlo simulation was conducted to evaluate the suggested procedure in comparison to the well-known likelihood ratio (LR) DIF test. Our simulation study results indicated the superiority of FMM over the LR DIF test in terms of detection power and illustrated the importance of accounting for latent heterogeneity in zero-inflated data. The empirical data analysis results further supported the use of FMM by flagging additional DIF items over and above the LR test.


2021 ◽  
Author(s):  
Thomas Thorne

Single cell RNA-seq data exhibit large numbers of zero count values, that we demonstrate can, for a subset of transcripts, be better modelled by a zero inflated negative binomial distribution. We develop a novel Dirichlet process mixture model which employs both a mixture at the cell level to model multiple cell types, and a mixture of single cell RNA-seq counts at the transcript level to model the transcript specific zero-inflation of counts. It is shown that this approach outperforms previous approaches that applied multinomial distributions to model single cell RNA-seq counts, and also performer better or comparably to existing top performing methods. By taking a Bayesian approach we are able to build interpretable models of expression within clusters, and to quantify uncertainty in cluster assignments. Applied to a publicly available data set of single cell RNA-seq counts of multiple cell types from the mouse cortex and hippocampus, we demonstrate how our approach can be used to distinguish sub-populations of cells as clusters in the data, and to identify gene sets that are indicative of membership of a sub-population.


2021 ◽  
Author(s):  
Ziyue Wu ◽  
Seth A. Berkowitz ◽  
Patrick J. Heagerty ◽  
David Benkeser

Objective. To improve the estimation of healthcare expenditures by introducing a novel method that is well-suited to situations where data exhibit strong skewness and zero-inflation. Data Sources. Simulations, and two sources of real-world data: the 2016-2017 Medical Expenditure Panel Survey (MEPS) and the Back Pain Outcomes using Longitudinal Data (BOLD) datasets. Study Design. The super learner is an ensemble machine learning approach that can combine several algorithms to improve estimation. We propose a two-stage super learner that is well suited for use with healthcare expenditure data by separately estimating the probability of any healthcare expenditure and the mean amount of healthcare expenditure conditional on having healthcare expenditures. These estimates can be combined to yield a single estimate of expenditures for each observation. The method can flexibly incorporate a range of individual estimation approaches for each stage of estimation, including both regression-based approaches and machine learning algorithms such as random forests. We compare the performance of the two-stage super learner with a one-stage super learner, and with multiple individual algorithms for estimation of healthcare cost under a broad range of data settings in simulated and real data. The predictive performance was compared using Mean Squared Error and R2. Data collection/Extraction methods. MEPS data include only adults and exclude observations with missingness, BOLD data include observations without missingness. Principal Findings. Our results indicate that the two-stage super learner has a better performance compared with a one-stage super learner and individual algorithms, for healthcare cost estimation under a wide variety of settings in simulations and empirical analyses. The improvement of the two-stage super learner over the one-stage super learner was particularly evident in settings when zero-inflation is high. Conclusions. The two-stage super learner provides researchers an effective approach for healthcare cost analyses in environments where they cannot know the best single algorithm a priori. Keywords. Semicontinuous data, two-part models, zero-inflation, super learning, healthcare expenditure.


Sign in / Sign up

Export Citation Format

Share Document