scholarly journals A Partial Correlation Screening Approach for Controlling the False Positive Rate in Sparse Gaussian Graphical Models

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Ginette Lafit ◽  
Francis Tuerlinckx ◽  
Inez Myin-Germeys ◽  
Eva Ceulemans

AbstractGaussian Graphical Models (GGMs) are extensively used in many research areas, such as genomics, proteomics, neuroimaging, and psychology, to study the partial correlation structure of a set of variables. This structure is visualized by drawing an undirected network, in which the variables constitute the nodes and the partial correlations the edges. In many applications, it makes sense to impose sparsity (i.e., some of the partial correlations are forced to zero) as sparsity is theoretically meaningful and/or because it improves the predictive accuracy of the fitted model. However, as we will show by means of extensive simulations, state-of-the-art estimation approaches for imposing sparsity on GGMs, such as the Graphical lasso, ℓ1 regularized nodewise regression, and joint sparse regression, fall short because they often yield too many false positives (i.e., partial correlations that are not properly set to zero). In this paper we present a new estimation approach that allows to control the false positive rate better. Our approach consists of two steps: First, we estimate an undirected network using one of the three state-of-the-art estimation approaches. Second, we try to detect the false positives, by flagging the partial correlations that are smaller in absolute value than a given threshold, which is determined through cross-validation; the flagged correlations are set to zero. Applying this new approach to the same simulated data, shows that it indeed performs better. We also illustrate our approach by using it to estimate (1) a gene regulatory network for breast cancer data, (2) a symptom network of patients with a diagnosis within the nonaffective psychotic spectrum and (3) a symptom network of patients with PTSD.

2018 ◽  
Author(s):  
Donald Ray Williams ◽  
Philippe Rast

Gaussian graphical models are an increasingly popular technique in psychology to characterize relationships among observed variables. These relationships are represented as covariances in the precision matrix. Standardizing this covariance matrix and reversing the sign yields corresponding partial correlations that imply pairwise dependencies in which the effects of all other variables have been controlled for. In order to estimate the precision matrix, the graphicallasso (glasso) has emerged as the default estimation method, which uses l-1-based regularization. Glasso was developed and optimized for high dimensional settings where the number of variables (p) exceeds the number of observations (n) which are uncommon in psychological applications. Here we propose to go “back to the basics”, wherein the precision matrix is first estimated withnon-regularized maximum likelihood and then Fisher Z-transformed confidence intervals are used to determine non-zero relationships. We first show the exact correspondence between the confidence level and specificity, which is due to 1 - specificity denoting the false positive rate (i.e., alpha). With simulations in low-dimensional settings (p << n), we then demonstrate superior performance compared to glasso for determining conditional relationships, in addition tofrequentist risk measured with various loss functions. Further, our results indicate that glasso is inconsistent for the purpose of model selection, whereas the proposed method converged on the true model with a probability that approached 100%. We end by discussing implications for estimating Gaussian graphical models in psychology.


2020 ◽  
Author(s):  
Victor Bernal ◽  
Rainer Bischoff ◽  
Peter Horvatovich ◽  
Victor Guryev ◽  
Marco Grzegorczyk

Abstract Background: In systems biology, it is important to reconstruct regulatory networks from quantitative molecular profiles. Gaussian graphical models (GGMs) are one of the most popular methods to this end. A GGM consists of nodes (representing the transcripts, metabolites or proteins) inter-connected by edges (reflecting their partial correlations). Learning the edges from quantitative molecular profiles is statistically challenging, as there are usually fewer samples than nodes (‘high dimensional problem’). Shrinkage methods address this issue by learning a regularized GGM. However, it is an open question how the shrinkage affects the final result and its interpretation.Results: We show that the shrinkage biases the partial correlation in a non-linear way. This bias does not only change the magnitudes of the partial correlations but also affects their order. Furthermore, it makes networks obtained from different experiments incomparable and hinders their biological interpretation. We propose a method, referred to as the ‘un-shrunk’ partial correlation, which corrects for this non-linear bias. Unlike traditional methods, which use a fixed shrinkage value, the new approach provides partial correlations that are closer to the actual (population) values and that are easier to interpret. We apply the ‘un-shrunk’ method to two gene expression datasets from Escherichia coli and Mus musculus.Conclusions: GGMs are popular undirected graphical models based on partial correlations. The application of GGMs to reconstruct regulatory networks is commonly performed using shrinkage to overcome the “high-dimensional” problem. Besides it advantages, we have identified that the shrinkage introduces a non-linear bias in the partial correlations. Ignoring this type of effects caused by the shrinkage can obscure the interpretation of the network, and impede the validation of earlier reported results.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Victor Bernal ◽  
Rainer Bischoff ◽  
Peter Horvatovich ◽  
Victor Guryev ◽  
Marco Grzegorczyk

Abstract Background In systems biology, it is important to reconstruct regulatory networks from quantitative molecular profiles. Gaussian graphical models (GGMs) are one of the most popular methods to this end. A GGM consists of nodes (representing the transcripts, metabolites or proteins) inter-connected by edges (reflecting their partial correlations). Learning the edges from quantitative molecular profiles is statistically challenging, as there are usually fewer samples than nodes (‘high dimensional problem’). Shrinkage methods address this issue by learning a regularized GGM. However, it remains open to study how the shrinkage affects the final result and its interpretation. Results We show that the shrinkage biases the partial correlation in a non-linear way. This bias does not only change the magnitudes of the partial correlations but also affects their order. Furthermore, it makes networks obtained from different experiments incomparable and hinders their biological interpretation. We propose a method, referred to as ‘un-shrinking’ the partial correlation, which corrects for this non-linear bias. Unlike traditional methods, which use a fixed shrinkage value, the new approach provides partial correlations that are closer to the actual (population) values and that are easier to interpret. This is demonstrated on two gene expression datasets from Escherichia coli and Mus musculus. Conclusions GGMs are popular undirected graphical models based on partial correlations. The application of GGMs to reconstruct regulatory networks is commonly performed using shrinkage to overcome the ‘high-dimensional problem’. Besides it advantages, we have identified that the shrinkage introduces a non-linear bias in the partial correlations. Ignoring this type of effects caused by the shrinkage can obscure the interpretation of the network, and impede the validation of earlier reported results.


2014 ◽  
Vol 644-650 ◽  
pp. 3338-3341 ◽  
Author(s):  
Guang Feng Guo

During the 30-year development of the Intrusion Detection System, the problems such as the high false-positive rate have always plagued the users. Therefore, the ontology and context verification based intrusion detection model (OCVIDM) was put forward to connect the description of attack’s signatures and context effectively. The OCVIDM established the knowledge base of the intrusion detection ontology that was regarded as the center of efficient filtering platform of the false alerts to realize the automatic validation of the alarm and self-acting judgment of the real attacks, so as to achieve the goal of filtering the non-relevant positives alerts and reduce false positives.


2020 ◽  
Vol 30 (12) ◽  
pp. 1851-1855
Author(s):  
Sruti Rao ◽  
M. B. Goens ◽  
Orrin B. Myers ◽  
Emilie A. Sebesta

AbstractAim:To determine the false-positive rate of pulse oximetry screening at moderate altitude, presumed to be elevated compared with sea level values and assess change in false-positive rate with time.Methods:We retrospectively analysed 3548 infants in the newborn nursery in Albuquerque, New Mexico, (elevation 5400 ft) from July 2012 to October 2013. Universal pulse oximetry screening guidelines were employed after 24 hours of life but before discharge. Newborn babies between 36 and 36 6/7 weeks of gestation, weighing >2 kg and babies >37 weeks weighing >1.7 kg were included in the study. Log-binomial regression was used to assess change in the probability of false positives over time.Results:Of the 3548 patients analysed, there was one true positive with a posteriorly-malaligned ventricular septal defect and an interrupted aortic arch. Of the 93 false positives, the mean pre- and post-ductal saturations were lower, 92 and 90%, respectively. The false-positive rate before April 2013 was 3.5% and after April 2013, decreased to 1.5%. There was a significant decrease in false-positive rate (p = 0.003, slope coefficient = −0.082, standard error of coefficient = 0.023) with the relative risk of a false positive decreasing at 0.92 (95% CI 0.88–0.97) per month.Conclusion:This is the first study in Albuquerque, New Mexico, reporting a high false-positive rate of 1.5% at moderate altitude at the end of the study in comparison to the false-positive rate of 0.035% at sea level. Implementation of the nationally recommended universal pulse oximetry screening was associated with a high false-positive rate in the initial period, thought to be from the combination of both learning curve and altitude. After the initial decline, it remained steadily elevated above sea level, indicating the dominant effect of moderate altitude.


1981 ◽  
Vol 74 (1) ◽  
pp. 41-43 ◽  
Author(s):  
I G Barrison ◽  
E R Littlewood ◽  
J Primavesi ◽  
A Sharpies ◽  
I T Gilmore ◽  
...  

Stools have been tested for occult gastrointestinal bleeding in 278 outpatients and 170 hospital inpatients using the Haemoccult and Haemastix methods. Seventeen outpatients (6.1%) and 42 inpatients (24.7%) were positive with the Haemoccult technique. Thirty-three outpatients (11.9%) and 93 inpatients (54.7%) were positive with the Haemastix test. Following investigation of the Haemoccult-positive patients, only 2 cases (3.4%) were considered false positives. However, the false positive rate with Haemastix was 22.9% which is unacceptable in a screening test. Haemoccult may be useful as a screening test for asymptomatic general practice patients, but a test of greater sensitivity is needed for hospital patients.


2018 ◽  
pp. 1-10
Author(s):  
Luke T. Lavallée ◽  
Rodney H. Breau ◽  
Dean Fergusson ◽  
Cynthia Walsh ◽  
Carl van Walraven

Purpose Administrative health data can be a valuable resource for health research. Because these data are not collected for research purposes, it is imperative that the accuracy of codes used to identify patients, exposures, and outcomes is measured. Patients and Methods Code sensitivity was determined by identifying a cohort of men with histologically confirmed prostate cancer in the Ontario Cancer Registry and linking them to the Ontario Health Insurance Plan (OHIP) to determine whether a prostate biopsy code had been claimed. Code specificity was estimated using a random sample of patients at The Ottawa Hospital for whom a prostate biopsy code was submitted to OHIP. A simulation model, which varied the code false-positive rate, true-negative rate, and proportion of code positives in the population, was created to determine specificity under a range of combinations of these parameters. Results Between 1991 and 2012, 97,369 of 148,669 men with histologically confirmed prostate cancer in the Ontario Cancer Registry had a prostate biopsy code in OHIP within 1 week of their diagnosis (code sensitivity, 86.0%). This increased significantly over time (63.8% in 1991 to 87.9% in 2012). The false-positive rate of the code for index prostate biopsies was 1.9%. The simulation model found that the code specificity exceeded 95% for first prostate biopsy but was lower for secondary biopsies because of more false positives. False positives primarily were related to placement of fiducial markers for patients who received radiotherapy. Conclusion Administrative data in Ontario can accurately identify men who receive a prostate biopsy. The code is less accurate for secondary biopsy procedures and their sequelae.


2014 ◽  
Vol 687-691 ◽  
pp. 2611-2617
Author(s):  
Hong Hai Zhou ◽  
Pei Bin Liu ◽  
Zhi Hao Jin

In this paper, a new method which is named DRNFD for network troubleshooting is brought forward in which “abnormal degree” is defined by the vector of probability and belief functions in a privileged process. A new formula based on Dempster Rule is presented to decrease false positives. This method (DRNFD) can effectively reduce false positive rate and non-response rate and can be applied to real-time fault diagnosis. The operational prototypical system demonstrates its feasibility and gets the effectiveness of real-time fault diagnosis.


Author(s):  
Pamela Reinagel

AbstractAfter an experiment has been completed and analyzed, a trend may be observed that is “not quite significant”. Sometimes in this situation, researchers incrementally grow their sample size N in an effort to achieve statistical significance. This is especially tempting in situations when samples are very costly or time-consuming to collect, such that collecting an entirely new sample larger than N (the statistically sanctioned alternative) would be prohibitive. Such post-hoc sampling or “N-hacking” is condemned, however, because it leads to an excess of false positive results. Here Monte-Carlo simulations are used to show why and how incremental sampling causes false positives, but also to challenge the claim that it necessarily produces alarmingly high false positive rates. In a parameter regime that would be representative of practice in many research fields, simulations show that the inflation of the false positive rate is modest and easily bounded. But the effect on false positive rate is only half the story. What many researchers really want to know is the effect N-hacking would have on the likelihood that a positive result is a real effect that will be replicable: the positive predictive value (PPV). This question has not been considered in the reproducibility literature. The answer depends on the effect size and the prior probability of an effect. Although in practice these values are not known, simulations show that for a wide range of values, the PPV of results obtained by N-hacking is in fact higher than that of non-incremented experiments of the same sample size and statistical power. This is because the increase in false positives is more than offset by the increase in true positives. Therefore in many situations, adding a few samples to shore up a nearly-significant result is in fact statistically beneficial. In conclusion, if samples are added after an initial hypothesis test this should be disclosed, and if a p value is reported it should be corrected. But, contrary to widespread belief, collecting additional samples to resolve a borderline p value is not invalid, and can confer previously unappreciated advantages for efficiency and positive predictive value.


2015 ◽  
Author(s):  
David M Rocke ◽  
Luyao Ruan ◽  
Yilun Zhang ◽  
J. Jared Gossett ◽  
Blythe Durbin-Johnson ◽  
...  

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.


Sign in / Sign up

Export Citation Format

Share Document