scholarly journals Why Overfitting is Not (Usually) a Problem in Partial Correlation Networks

2020 ◽  
Author(s):  
Donald Ray Williams ◽  
Josue E. Rodriguez

Network psychometrics is undergoing a time of methodological reflection. In part, this was spurred by the revelation that l1-regularization does not reduce spurious associations in partial correlation networks. In this work, we address another motivation for the widespread use of regularized estimation: the thought that it is needed to mitigate overfitting. We first clarify important aspects of overfitting and the bias-variance tradeoff that are especially relevant for the network literature, where the number of nodes or items in a psychometric scale are not largecompared to the number of observations (i.e., a low p/n ratio). This revealed that bias and especially variance are most problematic in p=n ratios rarely encountered. We then introduce a nonregularized method, based on classical hypothesis testing, that fulfills two desiderata: (1) reducing or controlling the false positives rate and (2) quelling concerns of overfitting by providing accurate predictions. These were the primary motivations for initially adopting the graphical lasso (glasso). In several simulation studies, our nonregularized method provided more than competitive predictive performance, and, in many cases, outperformed glasso. Itappears to be nonregularized, as opposed to regularized estimation, that best satisfies these desiderata. We then provide insights into using our methodology. Here we discuss the multiple comparisons problem in relation to prediction: stringent alpha levels, resulting in a sparse network, can deteriorate predictive accuracy. We end by emphasizing key advantages of our approach that make it ideal for both inference and prediction in network analysis.

2018 ◽  
Author(s):  
Donald Ray Williams ◽  
Mijke Rhemtulla ◽  
Anna Wysocki ◽  
Philippe Rast

An important goal for psychological science is developing methods to characterize relationships between variables. The customary approach uses structural equation models to connect latent factors to a number of observed measurements. More recently, regularized partial correlation networks have been proposed as an alternative approach for characterizing relationships among variables through covariances in the precision matrix. While the graphical lasso (glasso) method has merged as the default network estimation method, it was optimized in fields outside of psychology with very different needs, such as high dimensional data where the number of variables (p) exceeds the number of observations (n). In this paper, we describe the glasso method in the context of the fields where it was developed, and then we demonstrate that the advantages of regularization diminish in settings where psychological networks are often fitted (p ≪ n). We first show that improved properties of the precision matrix, such as eigenvalue estimation, and predictive accuracy with cross-validation are not always appreciable. We then introduce non-regularized methods based on multiple regression, after which we characterize performance with extensive simulations. Our results demonstrate that the non-regularized methods consistently outperform glasso with respect to limiting false positives, and they provide more consistent performance across sparsity levels, sample composition (p=n), and partial correlation size. We end by reviewing recent findings in the statistics literature that suggest alternative methods often have superior than glasso, as well as suggesting areas for future research in psychology.


Agronomy ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 761
Author(s):  
Daniel Bravo ◽  
Clara Leon-Moreno ◽  
Carlos Alberto Martínez ◽  
Viviana Marcela Varón-Ramírez ◽  
Gustavo Alfonso Araujo-Carrillo ◽  
...  

This study represents the first nationwide survey regarding the distribution of Cd content in cacao-growing soils in Colombia. The soil Cd distribution was analyzed using a cold/hotspots model. Moreover, both descriptive and predictive analytical tools were used to assess the key factors regulating the Cd concentration, considering Cd content and eight soil variables in the cacao systems. A critical discussion was performed in four main cacao-growing districts. Our results suggest that the performance of a model using all the variables will always be superior to the one using Zn alone. The analyzed variables featured an appropriate predictive performance, nonetheless, that performance has to be improved to develop a prediction method that might be used nationwide. Results from the fitted graphical models showed that the largest associations (as measured by the partial correlation coefficients) were those between Cd and Zn. Ca had the second-largest partial correlation with Cd and its predictive performance ranked second. Interestingly, it was found that there was a high variability in the factors correlated with Cd in cacao growing soils at a national level. Therefore, this study constitutes a baseline for the forthcoming studies in the country and should be reinforced with an analysis of cadmium content in cacao beans.


2021 ◽  
pp. 1-10
Author(s):  
I. Krug ◽  
J. Linardon ◽  
C. Greenwood ◽  
G. Youssef ◽  
J. Treasure ◽  
...  

Abstract Background Despite a wide range of proposed risk factors and theoretical models, prediction of eating disorder (ED) onset remains poor. This study undertook the first comparison of two machine learning (ML) approaches [penalised logistic regression (LASSO), and prediction rule ensembles (PREs)] to conventional logistic regression (LR) models to enhance prediction of ED onset and differential ED diagnoses from a range of putative risk factors. Method Data were part of a European Project and comprised 1402 participants, 642 ED patients [52% with anorexia nervosa (AN) and 40% with bulimia nervosa (BN)] and 760 controls. The Cross-Cultural Risk Factor Questionnaire, which assesses retrospectively a range of sociocultural and psychological ED risk factors occurring before the age of 12 years (46 predictors in total), was used. Results All three statistical approaches had satisfactory model accuracy, with an average area under the curve (AUC) of 86% for predicting ED onset and 70% for predicting AN v. BN. Predictive performance was greatest for the two regression methods (LR and LASSO), although the PRE technique relied on fewer predictors with comparable accuracy. The individual risk factors differed depending on the outcome classification (EDs v. non-EDs and AN v. BN). Conclusions Even though the conventional LR performed comparably to the ML approaches in terms of predictive accuracy, the ML methods produced more parsimonious predictive models. ML approaches offer a viable way to modify screening practices for ED risk that balance accuracy against participant burden.


2004 ◽  
Vol 1 (1) ◽  
pp. 131-142
Author(s):  
Ljupčo Todorovski ◽  
Sašo Džeroski ◽  
Peter Ljubič

Both equation discovery and regression methods aim at inducing models of numerical data. While the equation discovery methods are usually evaluated in terms of comprehensibility of the induced model, the emphasis of the regression methods evaluation is on their predictive accuracy. In this paper, we present Ciper, an efficient method for discovery of polynomial equations and empirically evaluate its predictive performance on standard regression tasks. The evaluation shows that polynomials compare favorably to linear and piecewise regression models, induced by the existing state-of-the-art regression methods, in terms of degree of fit and complexity.


2019 ◽  
Author(s):  
Donald Salami ◽  
Carla Alexandra Sousa ◽  
Maria do Rosário Oliveira Martins ◽  
César Capinha

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.


2021 ◽  
Vol 10 ◽  
Author(s):  
Yuan Li ◽  
Xiaolan Zhang ◽  
Yan Gao ◽  
Chunliang Shang ◽  
Bo Yu ◽  
...  

BackgroundHigh grade serous ovarian cancer (HGSOC) is the most common subtype of ovarian cancer. Although platinum-based chemotherapy has been the cornerstone for HGSOC treatment, nearly 25% of patients would have less than 6 months of interval since the last platinum chemotherapy, referred to as platinum-resistance. Currently, no precise tools to predict platinum resistance have been developed yet.MethodsNinety-nine HGSOC patients, who have finished cytoreductive surgery and platinum-based chemotherapy in Peking University Third Hospital from 2018 to 2019, were enrolled. Whole-genome sequencing (WGS) and whole-exome sequencing (WES) were performed on the collected tumor tissue samples to establish a platinum-resistance predictor in a discovery cohort of 57 patients, and further validated in another 42 HGSOC patients.ResultsA high prevalence of alterations in DNA damage repair (DDR) pathway, including BRCA1/2, was identified both in the platinum-sensitive and resistant HGSOC patients. Compared with the resistant subgroup, there was a trend of higher prevalence of homologous recombination deficiency (HRD) in the platinum-sensitive subgroup (78.95% vs. 47.37%, p=0.0646). Based on the HRD score, microhomology insertions and deletions (MHID), copy number changes load, duplication load of 1–100 kb, single nucleotide variants load, and eight other mutational signatures, a combined predictor of platinum-resistance, named as DRDscore, was established. DRDscore outperformed in predicting the platinum-sensitivity than the previously reported biomarkers with a predictive accuracy of 0.860 at a threshold of 0.7584. The predictive performance of DRDscore was validated in an independent cohort of 42 HGSOC patients with a sensitivity of 90.9%.ConclusionsA multi-genomic signature-based analysis enabled the prediction of initial platinum resistance in advanced HGSOC patients, which may serve as a novel assessment of platinum resistance, provide therapeutic guidance, and merit further validation.


2015 ◽  
Vol 2015 ◽  
pp. 1-7 ◽  
Author(s):  
Hai-Hui Huang ◽  
Yong Liang ◽  
Xiao-Ying Liu

Identifying biomarker and signaling pathway is a critical step in genomic studies, in which the regularization method is a widely used feature extraction approach. However, most of the regularizers are based onL1-norm and their results are not good enough for sparsity and interpretation and are asymptotically biased, especially in genomic research. Recently, we gained a large amount of molecular interaction information about the disease-related biological processes and gathered them through various databases, which focused on many aspects of biological systems. In this paper, we use an enhancedL1/2penalized solver to penalize network-constrained logistic regression model called an enhancedL1/2net, where the predictors are based on gene-expression data with biologic network knowledge. Extensive simulation studies showed that our proposed approach outperformsL1regularization, the oldL1/2penalized solver, and the Elastic net approaches in terms of classification accuracy and stability. Furthermore, we applied our method for lung cancer data analysis and found that our method achieves higher predictive accuracy thanL1regularization, the oldL1/2penalized solver, and the Elastic net approaches, while fewer but informative biomarkers and pathways are selected.


2021 ◽  
Vol 11 (2) ◽  
pp. 31-50
Author(s):  
S.L. Artemenkov

Network modeling, which has emerged in recent years, can be successfully applied to the consideration of relationships between measurable psychological variables. In this context, psychological variables are understood as directly affecting each other, and not as a consequence of a latent construct. The article describes regularization methods that can be used to effectively assess the sparse and interpretable network structure based on partial correlations of psychological indicators. An overview of the glasso regularization procedure using EBIC model selection for evaluating an ordered sparse network of partial correlations is presented. The issues of performing this analysis in R in the presence of normal and non-normal data distribution are considered, taking into account the influence of the hyperparameter, which is manually set by the researcher. The considered approach is also interesting as a way to visualize possible causal connections between variables. This review bridges the gap related to the lack of an accessible description in Russian of this approach, which is still uncommon in Russia and at the same time promising.


2019 ◽  
Vol 28 (8) ◽  
pp. 645-656 ◽  
Author(s):  
Cathy Geeson ◽  
Li Wei ◽  
Bryony Dean Franklin

BackgroundMedicines optimisation is a key role for hospital pharmacists, but with ever-increasing demands on services, there is a need to increase efficiency while maintaining patient safety.ObjectiveTo develop a prediction tool, the Medicines Optimisation Assessment Tool (MOAT), to target patients most in need of pharmacists’ input in hospital.MethodsPatients from adult medical wards at two UK hospitals were prospectively included into this cohort study. Data on medication-related problems (MRPs) were collected by pharmacists at the study sites as part of their routine daily clinical assessments. Data on potential risk factors, such as number of comorbidities and use of ‘high-risk’ medicines, were collected retrospectively. Multivariable logistic regression modelling was used to determine the relationship between risk factors and the study outcome: preventable MRPs that were at least moderate in severity. The model was internally validated and a simplified electronic scoring system developed.ResultsAmong 1503 eligible admissions, 610 (40.6%) experienced the study outcome. Eighteen risk factors were preselected for MOAT development, with 11 variables retained in the final model. The MOAT demonstrated fair predictive performance (concordance index 0.66) and good calibration. Two clinically relevant decision thresholds (ie, the minimum predicted risk probabilities to justify pharmacists’ input) were selected, with sensitivities of 90% and 66% (specificity 30% and 61%); these equate to positive predictive values of 47% and 54%, respectively. Decision curve analysis suggests that the MOAT has potential value in clinical practice in guiding decision-making.ConclusionThe MOAT has potential to predict those patients most at risk of moderate or severe preventable MRPs, experienced by 41% of admissions. External validation is now required to establish predictive accuracy in a new group of patients.


2020 ◽  
Vol 7 ◽  
Author(s):  
Bin Zhang ◽  
Qin Liu ◽  
Xiao Zhang ◽  
Shuyi Liu ◽  
Weiqi Chen ◽  
...  

Aim: Early detection of coronavirus disease 2019 (COVID-19) patients who are likely to develop worse outcomes is of great importance, which may help select patients at risk of rapid deterioration who should require high-level monitoring and more aggressive treatment. We aimed to develop and validate a nomogram for predicting 30-days poor outcome of patients with COVID-19.Methods: The prediction model was developed in a primary cohort consisting of 233 patients with laboratory-confirmed COVID-19, and data were collected from January 3 to March 20, 2020. We identified and integrated significant prognostic factors for 30-days poor outcome to construct a nomogram. The model was subjected to internal validation and to external validation with two separate cohorts of 110 and 118 cases, respectively. The performance of the nomogram was assessed with respect to its predictive accuracy, discriminative ability, and clinical usefulness.Results: In the primary cohort, the mean age of patients was 55.4 years and 129 (55.4%) were male. Prognostic factors contained in the clinical nomogram were age, lactic dehydrogenase, aspartate aminotransferase, prothrombin time, serum creatinine, serum sodium, fasting blood glucose, and D-dimer. The model was externally validated in two cohorts achieving an AUC of 0.946 and 0.878, sensitivity of 100 and 79%, and specificity of 76.5 and 83.8%, respectively. Although adding CT score to the clinical nomogram (clinical-CT nomogram) did not yield better predictive performance, decision curve analysis showed that the clinical-CT nomogram provided better clinical utility than the clinical nomogram.Conclusions: We established and validated a nomogram that can provide an individual prediction of 30-days poor outcome for COVID-19 patients. This practical prognostic model may help clinicians in decision making and reduce mortality.


Sign in / Sign up

Export Citation Format

Share Document