pearson residuals
Recently Published Documents


TOTAL DOCUMENTS

35
(FIVE YEARS 16)

H-INDEX

8
(FIVE YEARS 3)

2021 ◽  
Author(s):  
Lauren L Hsu ◽  
Aedin C Culhane

Effective dimension reduction is an essential step in analysis of single cell RNA-seq(scRNAseq) count data, which are high-dimensional, sparse, and noisy. Principal component analysis (PCA) is widely used in analytical pipelines, and since PCA requires continuous data, it is often coupled with log-transformation in scRNAseq applications. However, log-transformation of scRNAseq counts distorts the data, and can obscure meaningful variation. We describe correspondence analysis (CA) for dimension reduction of scRNAseq data, which is a performant alternative to PCA.Designed for use with counts, CA is based on decomposition of a chi-squared residual matrix and does not require log-transformation of scRNAseq counts. We extend beyond standard CA (decomposition of Pearson residuals computed on the contingency table) and propose variations of CA, including an alternative chi-squared statistic, that address overdispersion and high sparsity in scRNAseq data. The performance of five variations of CA and standard CA are benchmarked on 10 datasets and compared to glmPCA. CA variations are fast, scalable, and outperforms standard CA and glmPCA, to compute embeddings with more performant or comparable clustering accuracy in 8 out of 9 datasets. Of the variations we considered,CA using the Freeman-Tukey chi-squared residual was most performant overall in scRNAseq data. Our analyses also showed that variance stabilizing transformations applied in conjunction with standard CA (using Pearson residuals) and the use of power deflation smoothing both improve performance in downstream clustering tasks, as compared to standard CA alone. CA has advantages including visual illustration of associations between genes and cell populations in a 'CA biplot' and easy extension to multi-table analysis enabling integrative dimension reduction. We introduce corralm, a CA-based method for multi-table batch integration of scRNAseq data in shared latent space, and we propose a new approach for assessing batch integration. We implement CA for scRNAseq in the corral R/Bioconductor package(https://www.bioconductor.org/packages/corral) that interfaces directly with widely used single cell classes in Bioconductor, allowing for easy integration into scRNAseq pipelines.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Sandra García-Bustos ◽  
Nadia Cárdenas-Escobar ◽  
Ana Debón ◽  
César Pincay

PurposeThe study aims to design a control chart based on an exponentially weighted moving average (EWMA) chart of Pearson's residuals of a model of negative binomial regression in order to detect possible anomalies in mortality data.Design/methodology/approachIn order to evaluate the performance of the proposed chart, the authors have considered official historical records of death of children of Ecuador. A negative binomial regression model was fitted to the data, and a chart of the Pearson residuals was designed. The parameters of the chart were obtained by simulation, as well as the performances of the charts related to changes in the mean of death.FindingsWhen the chart was plotted, outliers were detected in the deaths of children in the years 1990–1995, 2001–2006, 2013–2015, which could show that there are underreporting or an excessive growth in mortality. In the analysis of performances, the value of λ = 0.05 presented the fastest detection of changes in the mean death.Originality/valueThe proposed charts present better performances in relation to EWMA charts for deviance residuals, with a remarkable advantage of the Pearson residuals, which are much easier to interpret and calculate. Finally, the authors would like to point out that although this paper only applies control charts to Ecuadorian infant mortality, the methodology can be used to calculate mortality in any geographical area or to detect outbreaks of infectious diseases.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jan Lause ◽  
Philipp Berens ◽  
Dmitry Kobak

Abstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0254479
Author(s):  
Ta-Chien Chan ◽  
Jia-Hong Tang ◽  
Cheng-Yu Hsieh ◽  
Kevin J. Chen ◽  
Tsan-Hua Yu ◽  
...  

Background Sentinel physician surveillance in communities has played an important role in detecting early signs of epidemics. The traditional approach is to let the primary care physician voluntarily and actively report diseases to the health department on a weekly basis. However, this is labor-intensive work, and the spatio-temporal resolution of the surveillance data is not precise at all. In this study, we built up a clinic-based enhanced sentinel surveillance system named “Sentinel plus” which was designed for sentinel clinics and community hospitals to monitor 23 kinds of syndromic groups in Taipei City, Taiwan. The definitions of those syndromic groups were based on ICD-10 diagnoses from physicians. Methods Daily ICD-10 counts of two syndromic groups including ILI and EV-like syndromes in Taipei City were extracted from Sentinel plus. A negative binomial regression model was used to couple with lag structure functions to examine the short-term association between ICD counts and meteorological variables. After fitting the negative binomial regression model, residuals were further rescaled to Pearson residuals. We then monitored these daily standardized Pearson residuals for any aberrations from July 2018 to October 2019. Results The results showed that daily average temperature was significantly negatively associated with numbers of ILI syndromes. The ozone and PM2.5 concentrations were significantly positively associated with ILI syndromes. In addition, daily minimum temperature, and the ozone and PM2.5 concentrations were significantly negatively associated with the EV-like syndromes. The aberrational signals detected from clinics for ILI and EV-like syndromes were earlier than the epidemic period based on outpatient surveillance defined by the Taiwan CDC. Conclusions This system not only provides warning signals to the local health department for managing the risks but also reminds medical practitioners to be vigilant toward susceptible patients. The near real-time surveillance can help decision makers evaluate their policy on a timely basis.


METRON ◽  
2021 ◽  
Author(s):  
Giovanni Saraceno ◽  
Claudio Agostinelli ◽  
Luca Greco

AbstractA weighted likelihood technique for robust estimation of multivariate Wrapped distributions of data points scattered on a $$p-$$ p - dimensional torus is proposed. The occurrence of outliers in the sample at hand can badly compromise inference for standard techniques such as maximum likelihood method. Therefore, there is the need to handle such model inadequacies in the fitting process by a robust technique and an effective downweighting of observations not following the assumed model. Furthermore, the employ of a robust method could help in situations of hidden and unexpected substructures in the data. Here, it is suggested to build a set of data-dependent weights based on the Pearson residuals and solve the corresponding weighted likelihood estimating equations. In particular, robust estimation is carried out by using a Classification EM algorithm whose M-step is enhanced by the computation of weights based on current parameters’ values. The finite sample behavior of the proposed method has been investigated by a Monte Carlo numerical study and real data examples.


2021 ◽  
Vol 1 (2) ◽  
pp. 117-132
Author(s):  
Absai Chakaipa ◽  
◽  
Vitalis Basera ◽  
Phamella Dube ◽  
◽  
...  

Purpose: This research aimed to apply log-linear modelling to model association between multiple response categorical variables (MRCV) on urban agriculture and enhance data analysis of the paper by Basera, Chakaipa, & Dube (2020) impetus of urban agriculture on open spaces of Mutare City. Research methodology: The research data was obtained from households and farmers in Mutare City - urban and peri-urban (inclusive of plots in Weirmouth Park and Fern Valley area in December 2020. A total of one hundred and fifteen (115) household farmers were surveyed. Results: Simultaneous Pairwise Marginal Independence (SPMI) tests revealed the presence of associations. Log-linear tests revealed a perfect fit based on small standardized Pearson residuals and a strong positive association based on observed and model-predicted odds ratios on-field agricultural activities and use of herbicides. Log-linear and further application of heterogeneity tests revealed partial and near no perfect fit in other pairs of MRCVs with a strong negative association between municipality vacant places and field agricultural activities. Limitations: The research could not carry out log-linear model associations of three or more MRCVs because files exceeded 2GB in memory on both MI.test () function for SPMI tests and genloglin regressions. Contribution: The study contributes to urban agriculture planning especially in enactment of urban agriculture laws, agriculture one stop shop business centers housing farm input supply shops, farm produce shops, and determining fit support that can be rendered to urban farmers. Keywords: Multiple Response Categorical Variables (MRCV), Association, Urban agriculture


Entropy ◽  
2021 ◽  
Vol 23 (1) ◽  
pp. 107
Author(s):  
Elisavet M. Sofikitou ◽  
Ray Liu ◽  
Huipei Wang ◽  
Marianthi Markatou

Pearson residuals aid the task of identifying model misspecification because they compare the estimated, using data, model with the model assumed under the null hypothesis. We present different formulations of the Pearson residual system that account for the measurement scale of the data and study their properties. We further concentrate on the case of mixed-scale data, that is, data measured in both categorical and interval scale. We study the asymptotic properties and the robustness of minimum disparity estimators obtained in the case of mixed-scale data and exemplify the performance of the methods via simulation.


2020 ◽  
Author(s):  
Jan Lause ◽  
Philipp Berens ◽  
Dmitry Kobak

AbstractStandard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. We show that the model of Hafemeister and Satija (2019) produces noisy parameter estimates because it is overspecified (which is why the original paper employs post-hoc regularization). When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. (2019). Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija (2019) are biased, and that the data analyzed in that paper are in fact consistent with constant overdispersion parameter across genes. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data suggest very moderate overdispersion. Finally, we argue that analytic Pearson residuals (or, equivalently, rank-one GLM-PCA or negative binomial regression after regularization) strongly outperform standard preprocessing for identifying biologically variable genes, and capture more biologically meaningful variation when used for dimensionality reduction, compared to other methods.


Author(s):  
Muhammad Amin ◽  
Tahir Mahmood ◽  
Summera Kinat

Control charts are commonly applied for monitoring and controlling the performance of the manufacturing process. Usually, control charts are designed based on the main quality characteristics variable. However, there exist numerous other variables which are highly associated with the main variable. Therefore, generalized linear model (GLM)-based control charts were used, which are capable of maintaining the relationship between variables and of monitoring an abrupt change in the process mean. This study is an effort to develop the Phase II GLM-based memory type control charts using the deviance residuals (DR) and Pearson residuals (PR) of inverse Gaussian (IG) regression model. For evaluation, a simulation study is designed, and the performance of the proposed control charts is compared with the counterpart memory less control charts and data-based control charts (excluding the effect of covariate) in terms of the run length properties. Based on the simulation study, it is concluded that the exponential weighted moving average (EWMA) type control charts have better detection ability as compared with Shewhart and cumulative sum (CUSUM) type control charts under the small or/and moderate shift sizes. Moreover, it is shown that utilizing covariate may lead to useful conclusions. Finally, the proposed monitoring methods is implemented on the dataset related to the yarn manufacturing industry to highlight the importance of the proposed control chart.


Sign in / Sign up

Export Citation Format

Share Document