structural zeros
Recently Published Documents


TOTAL DOCUMENTS

41
(FIVE YEARS 9)

H-INDEX

9
(FIVE YEARS 1)

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
David Moriña ◽  
Pedro Puig ◽  
Albert Navarro

Abstract Background Zero-inflated models are generally aimed to addressing the problem that arises from having two different sources that generate the zero values observed in a distribution. In practice, this is due to the fact that the population studied actually consists of two subpopulations: one in which the value zero is by default (structural zero) and the other is circumstantial (sample zero). Methods This work proposes a new methodology to fit zero inflated Bernoulli data from a Bayesian approach, able to distinguish between two potential sources of zeros (structural and non-structural). Results The proposed methodology performance has been evaluated through a comprehensive simulation study, and it has been compiled as an R package freely available to the community. Its usage is illustrated by means of a real example from the field of occupational health as the phenomenon of sickness presenteeism, in which it is reasonable to think that some individuals will never be at risk of suffering it because they have not been sick in the period of study (structural zeros). Without separating structural and non-structural zeros one would be studying jointly the general health status and the presenteeism itself, and therefore obtaining potentially biased estimates as the phenomenon is being implicitly underestimated by diluting it into the general health status. Conclusions The proposed methodology is able to distinguish two different sources of zeros (structural and non-structural) from dichotomous data with or without covariates in a Bayesian framework, and has been made available to any interested researcher in the form of the bayesZIB R package (https://cran.r-project.org/package=bayesZIB).


2021 ◽  
Author(s):  
Shili Lin ◽  
Qing Xie

Motivation: Single-cell Hi-C techniques make it possible to study cell-to-cell variability in genomic features. However, excess zeros are commonly seen in single-cell Hi-C (scHi-C) data, making scHi-C matrices extremely sparse and bringing extra difficulties in downstream analysis. The observed zeros are a combination of two events: structural zeros for which the loci never inter- act due to underlying biological mechanisms, and dropouts or sampling zeros where the two loci interact but are not captured due to insufficient sequencing depth. Although quality improvement approaches have been proposed as an intermediate step for analyzing scHi-C data, little has been done to address these two types of zeros. We believe that differentiating between structural zeros and dropouts would benefit downstream analysis such as clustering. Results: We propose scHiCSRS, a self-representation smoothing method that improves the data quality, and a Gaussian mixture model that identifies structural zeros among observed zeros. scHiCSRS not only takes spatial dependencies of a scHi-C 2D data structure into account but also borrows information from similar single cells. Through an extensive set of simulation studies, we demonstrate the ability of scHiCSRS for identifying structural zeros with high sensitivity and for accurate imputation of dropout values in sampling zeros. Downstream analysis for three real datasets show that data improved from scHiCSRS yield more accurate clustering of cells than simply using observed data or improved data from several comparison methods.


2021 ◽  
Author(s):  
Qing Xie ◽  
Chengong Han ◽  
Victor Jin ◽  
Shili Lin

Single cell Hi-C techniques enable one to study cell to cell variability in chromatin interactions. However, single cell Hi-C (scHi-C) data suffer severely from sparsity, that is, the existence of excess zeros due to insufficient sequencing depth. Complicate things further is the fact that not all zeros are created equal, as some are due to loci truly not interacting because of the underlying biological mechanism (structural zeros), whereas others are indeed due to insufficient sequencing depth (sampling zeros), especially for loci that interact infrequently. Differentiating between structural zeros and sampling zeros is important since correct inference would improve downstream analyses such as clustering and discovery of subtypes. Nevertheless, distinguishing between these two types of zeros has received little attention in the single cell Hi-C literature, where the issue of sparsity has been addressed mainly as a data quality improvement problem. To fill this gap, in this paper, we propose HiCImpute, a Bayesian hierarchy model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros. HiCImpute takes spatial dependencies of scHi-C 2D data structure into account while also borrowing information from similar single cells and bulk data, when such are available. Through an extensive set of analyses of synthetic and real data, we demonstrate the ability of HiCImpute for identifying structural zeros with high sensitivity, and for accurate imputation of dropout values in sampling zeros. Downstream analyses using data improved from HiCImpute yielded much more accurate clustering of cell types compared to using observed data or data improved by several comparison methods. Most significantly, HiCImpute-improved data has led to the identification of subtypes within each of the excitatory neuronal cells of L4 and L5 in the prefrontal cortex.


2021 ◽  
Vol 12 (3) ◽  
pp. 439-460
Author(s):  
Hua He ◽  
Paul Crits-Christoph ◽  
Robert Gallop ◽  
Wan Tang ◽  
Ding-Geng (Din) Chen ◽  
...  

2021 ◽  
Vol 11 ◽  
Author(s):  
Rebecca A. Deek ◽  
Hongzhe Li

The human microbiome consists of a community of microbes in varying abundances and is shown to be associated with many diseases. An important first step in many microbiome studies is to identify possible distinct microbial communities in a given data set and to identify the important bacterial taxa that characterize these communities. The data from typical microbiome studies are high dimensional count data with excessive zeros due to both absence of species (structural zeros) and low sequencing depth or dropout. Although methods have been developed for identifying the microbial communities based on mixture models of counts, these methods do not account for excessive zeros observed in the data and do not differentiate structural from sampling zeros. In this paper, we introduce a zero-inflated Latent Dirichlet Allocation model (zinLDA) for sparse count data observed in microbiome studies. zinLDA builds on the flexible Latent Dirichlet Allocation model and allows for zero inflation in observed counts. We develop an efficient Markov chain Monte Carlo (MCMC) sampling procedure to fit the model. Results from our simulations show zinLDA provides better fits to the data and is able to separate structural zeros from sampling zeros. We apply zinLDA to the data set from the American Gut Project and identify microbial communities characterized by different bacterial genera.


2020 ◽  
Vol 36 (4) ◽  
pp. 803-825
Author(s):  
Marco Fortini

AbstractRecord linkage addresses the problem of identifying pairs of records coming from different sources and referred to the same unit of interest. Fellegi and Sunter propose an optimal statistical test in order to assign the match status to the candidate pairs, in which the needed parameters are obtained through EM algorithm directly applied to the set of candidate pairs, without recourse to training data. However, this procedure has a quadratic complexity as the two lists to be matched grow. In addition, a large bias of EM-estimated parameters is also produced in this case, so that the problem is tackled by reducing the set of candidate pairs through filtering methods such as blocking. Unfortunately, the probability that excluded pairs would be actually true-matches cannot be assessed through such methods.The present work proposes an efficient approach in which the comparison of records between lists are minimised while the EM estimates are modified by modelling tables with structural zeros in order to obtain unbiased estimates of the parameters. Improvement achieved by the suggested method is shown by means of simulations and an application based on real data.


Author(s):  
Chenggong Han ◽  
Qing Xie ◽  
Shili Lin

Abstract The prevalence of dropout events is a serious problem for single-cell Hi-C (scHiC) data due to insufficient sequencing depth and data coverage, which brings difficulties in downstream studies such as clustering and structural analysis. Complicating things further is the fact that dropouts are confounded with structural zeros due to underlying properties, leading to observed zeros being a mixture of both types of events. Although a great deal of progress has been made in imputing dropout events for single cell RNA-sequencing (RNA-seq) data, little has been done in identifying structural zeros and imputing dropouts for scHiC data. In this paper, we adapted several methods from the single-cell RNA-seq literature for inference on observed zeros in scHiC data and evaluated their effectiveness. Through an extensive simulation study and real data analysis, we have shown that a couple of the adapted single-cell RNA-seq algorithms can be powerful for correctly identifying structural zeros and accurately imputing dropout values. Downstream analysis using the imputed values showed considerable improvement for clustering cells of the same types together over clustering results before imputation.


2019 ◽  
Vol 10 (7) ◽  
pp. 949-959 ◽  
Author(s):  
Anabel Blasco‐Moreno ◽  
Marta Pérez‐Casany ◽  
Pedro Puig ◽  
Maria Morante ◽  
Eva Castells
Keyword(s):  

2018 ◽  
Vol 7 (4) ◽  
pp. 498-519 ◽  
Author(s):  
Olanrewaju Akande ◽  
Andrés Barrientos ◽  
Jerome P Reiter

Abstract Multivariate categorical data nested within households often include reported values that fail edit constraints—for example, a participating household reports a child’s age as older than his biological parent’s age—and have missing values. Generally, agencies prefer datasets to be free from erroneous or missing values before analyzing them or disseminating them to secondary data users. We present a model-based engine for editing and imputation of household data based on a Bayesian hierarchical model that includes (i) a nested data Dirichlet process mixture of products of multinomial distributions as the model for the true latent values of the data, truncated to allow only households that satisfy all edit constraints, (ii) a model for the location of errors, and (iii) a reporting model for the observed responses in error. The approach propagates uncertainty due to unknown locations of errors and missing values, generates plausible datasets that satisfy all edit constraints, and can preserve multivariate relationships within and across individuals in the same household. We illustrate the approach using data from the 2012 American Community Survey.


Sign in / Sign up

Export Citation Format

Share Document