Why Overfitting is Not (Usually) a Problem in Partial Correlation Networks

Mapping Intimacies ◽

10.31234/osf.io/8pr9b ◽

2020 ◽

Author(s):

Donald Ray Williams ◽

Josue E. Rodriguez

Keyword(s):

Partial Correlation ◽

Predictive Accuracy ◽

Predictive Performance ◽

Simulation Studies ◽

Regularized Estimation ◽

Correlation Networks ◽

Graphical Lasso ◽

Bias Variance ◽

Methodological Reflection ◽

Classical Hypothesis

Network psychometrics is undergoing a time of methodological reflection. In part, this was spurred by the revelation that l1-regularization does not reduce spurious associations in partial correlation networks. In this work, we address another motivation for the widespread use of regularized estimation: the thought that it is needed to mitigate overfitting. We first clarify important aspects of overfitting and the bias-variance tradeoff that are especially relevant for the network literature, where the number of nodes or items in a psychometric scale are not largecompared to the number of observations (i.e., a low p/n ratio). This revealed that bias and especially variance are most problematic in p=n ratios rarely encountered. We then introduce a nonregularized method, based on classical hypothesis testing, that fulfills two desiderata: (1) reducing or controlling the false positives rate and (2) quelling concerns of overfitting by providing accurate predictions. These were the primary motivations for initially adopting the graphical lasso (glasso). In several simulation studies, our nonregularized method provided more than competitive predictive performance, and, in many cases, outperformed glasso. Itappears to be nonregularized, as opposed to regularized estimation, that best satisfies these desiderata. We then provide insights into using our methodology. Here we discuss the multiple comparisons problem in relation to prediction: stringent alpha levels, resulting in a sparse network, can deteriorate predictive accuracy. We end by emphasizing key advantages of our approach that make it ideal for both inference and prediction in network analysis.

Download Full-text

On Non-Regularized Estimation of Psychological Networks

10.31234/osf.io/xr2vf ◽

2018 ◽

Cited By ~ 2

Author(s):

Donald Ray Williams ◽

Mijke Rhemtulla ◽

Anna Wysocki ◽

Philippe Rast

Keyword(s):

Structural Equation ◽

Partial Correlation ◽

Structural Equation Models ◽

Predictive Accuracy ◽

Estimation Method ◽

Alternative Methods ◽

Future Research ◽

Precision Matrix ◽

Correlation Networks ◽

Regularized Methods

An important goal for psychological science is developing methods to characterize relationships between variables. The customary approach uses structural equation models to connect latent factors to a number of observed measurements. More recently, regularized partial correlation networks have been proposed as an alternative approach for characterizing relationships among variables through covariances in the precision matrix. While the graphical lasso (glasso) method has merged as the default network estimation method, it was optimized in fields outside of psychology with very different needs, such as high dimensional data where the number of variables (p) exceeds the number of observations (n). In this paper, we describe the glasso method in the context of the fields where it was developed, and then we demonstrate that the advantages of regularization diminish in settings where psychological networks are often fitted (p ≪ n). We first show that improved properties of the precision matrix, such as eigenvalue estimation, and predictive accuracy with cross-validation are not always appreciable. We then introduce non-regularized methods based on multiple regression, after which we characterize performance with extensive simulations. Our results demonstrate that the non-regularized methods consistently outperform glasso with respect to limiting false positives, and they provide more consistent performance across sparsity levels, sample composition (p=n), and partial correlation size. We end by reviewing recent findings in the statistics literature that suggest alternative methods often have superior than glasso, as well as suggesting areas for future research in psychology.

Download Full-text

The First National Survey of Cadmium in Cacao Farm Soil in Colombia

Agronomy ◽

10.3390/agronomy11040761 ◽

2021 ◽

Vol 11 (4) ◽

pp. 761

Author(s):

Daniel Bravo ◽

Clara Leon-Moreno ◽

Carlos Alberto Martínez ◽

Viviana Marcela Varón-Ramírez ◽

Gustavo Alfonso Araujo-Carrillo ◽

...

Keyword(s):

Graphical Models ◽

Partial Correlation ◽

National Level ◽

Prediction Method ◽

Correlation Coefficients ◽

Predictive Performance ◽

Nationwide Survey ◽

Critical Discussion ◽

Cadmium Content ◽

Soil Variables

This study represents the first nationwide survey regarding the distribution of Cd content in cacao-growing soils in Colombia. The soil Cd distribution was analyzed using a cold/hotspots model. Moreover, both descriptive and predictive analytical tools were used to assess the key factors regulating the Cd concentration, considering Cd content and eight soil variables in the cacao systems. A critical discussion was performed in four main cacao-growing districts. Our results suggest that the performance of a model using all the variables will always be superior to the one using Zn alone. The analyzed variables featured an appropriate predictive performance, nonetheless, that performance has to be improved to develop a prediction method that might be used nationwide. Results from the fitted graphical models showed that the largest associations (as measured by the partial correlation coefficients) were those between Cd and Zn. Ca had the second-largest partial correlation with Cd and its predictive performance ranked second. Interestingly, it was found that there was a high variability in the factors correlated with Cd in cacao growing soils at a national level. Therefore, this study constitutes a baseline for the forthcoming studies in the country and should be reinforced with an analysis of cadmium content in cacao beans.

Download Full-text

A proof-of-concept study applying machine learning methods to putative risk factors for eating disorders: results from the multi-centre European project on healthy eating

Psychological Medicine ◽

10.1017/s003329172100489x ◽

2021 ◽

pp. 1-10

Author(s):

I. Krug ◽

J. Linardon ◽

C. Greenwood ◽

G. Youssef ◽

J. Treasure ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Logistic Regression ◽

Predictive Accuracy ◽

Area Under The Curve ◽

Prediction Rule ◽

Predictive Performance ◽

Individual Risk ◽

European Project ◽

Wide Range

Abstract Background Despite a wide range of proposed risk factors and theoretical models, prediction of eating disorder (ED) onset remains poor. This study undertook the first comparison of two machine learning (ML) approaches [penalised logistic regression (LASSO), and prediction rule ensembles (PREs)] to conventional logistic regression (LR) models to enhance prediction of ED onset and differential ED diagnoses from a range of putative risk factors. Method Data were part of a European Project and comprised 1402 participants, 642 ED patients [52% with anorexia nervosa (AN) and 40% with bulimia nervosa (BN)] and 760 controls. The Cross-Cultural Risk Factor Questionnaire, which assesses retrospectively a range of sociocultural and psychological ED risk factors occurring before the age of 12 years (46 predictors in total), was used. Results All three statistical approaches had satisfactory model accuracy, with an average area under the curve (AUC) of 86% for predicting ED onset and 70% for predicting AN v. BN. Predictive performance was greatest for the two regression methods (LR and LASSO), although the PRE technique relied on fewer predictors with comparable accuracy. The individual risk factors differed depending on the outcome classification (EDs v. non-EDs and AN v. BN). Conclusions Even though the conventional LR performed comparably to the ML approaches in terms of predictive accuracy, the ML methods produced more parsimonious predictive models. ML approaches offer a viable way to modify screening practices for ED risk that balance accuracy against participant burden.

Download Full-text

Discovery of polynomial equations for regression

Advances in Methodology and Statistics ◽

10.51936/uogl8142 ◽

2004 ◽

Vol 1 (1) ◽

pp. 131-142

Author(s):

Ljupčo Todorovski ◽

Sašo Džeroski ◽

Peter Ljubič

Keyword(s):

Efficient Method ◽

Regression Models ◽

Predictive Accuracy ◽

State Of The Art ◽

Numerical Data ◽

Predictive Performance ◽

Polynomial Equations ◽

Regression Methods ◽

Piecewise Regression ◽

Standard Regression

Both equation discovery and regression methods aim at inducing models of numerical data. While the equation discovery methods are usually evaluated in terms of comprehensibility of the induced model, the emphasis of the regression methods evaluation is on their predictive accuracy. In this paper, we present Ciper, an efficient method for discovery of polynomial equations and empirically evaluate its predictive performance on standard regression tasks. The evaluation shows that polynomials compare favorably to linear and piecewise regression models, induced by the existing state-of-the-art regression methods, in terms of degree of fit and complexity.

Download Full-text

Predicting dengue importation into Europe, using machine learning and model-agnostic methods

10.1101/19013383 ◽

2019 ◽

Author(s):

Donald Salami ◽

Carla Alexandra Sousa ◽

Maria do Rosário Oliveira Martins ◽

César Capinha

Keyword(s):

Machine Learning ◽

Operating Characteristic ◽

Predictive Accuracy ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Transport Network ◽

Air Transport ◽

Health Concern ◽

Centrality Measures ◽

Network Centrality

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.

Download Full-text

Development of a Genomic Signatures-Based Predictor of Initial Platinum-Resistance in Advanced High-Grade Serous Ovarian Cancer Patients

Frontiers in Oncology ◽

10.3389/fonc.2020.625866 ◽

2021 ◽

Vol 10 ◽

Author(s):

Yuan Li ◽

Xiaolan Zhang ◽

Yan Gao ◽

Chunliang Shang ◽

Bo Yu ◽

...

Keyword(s):

Ovarian Cancer ◽

Predictive Accuracy ◽

Predictive Performance ◽

Platinum Resistance ◽

High Grade ◽

Serous Ovarian Cancer ◽

Single Nucleotide Variants ◽

Tissue Samples ◽

Platinum Sensitive ◽

Platinum Based Chemotherapy

BackgroundHigh grade serous ovarian cancer (HGSOC) is the most common subtype of ovarian cancer. Although platinum-based chemotherapy has been the cornerstone for HGSOC treatment, nearly 25% of patients would have less than 6 months of interval since the last platinum chemotherapy, referred to as platinum-resistance. Currently, no precise tools to predict platinum resistance have been developed yet.MethodsNinety-nine HGSOC patients, who have finished cytoreductive surgery and platinum-based chemotherapy in Peking University Third Hospital from 2018 to 2019, were enrolled. Whole-genome sequencing (WGS) and whole-exome sequencing (WES) were performed on the collected tumor tissue samples to establish a platinum-resistance predictor in a discovery cohort of 57 patients, and further validated in another 42 HGSOC patients.ResultsA high prevalence of alterations in DNA damage repair (DDR) pathway, including BRCA1/2, was identified both in the platinum-sensitive and resistant HGSOC patients. Compared with the resistant subgroup, there was a trend of higher prevalence of homologous recombination deficiency (HRD) in the platinum-sensitive subgroup (78.95% vs. 47.37%, p=0.0646). Based on the HRD score, microhomology insertions and deletions (MHID), copy number changes load, duplication load of 1–100 kb, single nucleotide variants load, and eight other mutational signatures, a combined predictor of platinum-resistance, named as DRDscore, was established. DRDscore outperformed in predicting the platinum-sensitivity than the previously reported biomarkers with a predictive accuracy of 0.860 at a threshold of 0.7584. The predictive performance of DRDscore was validated in an independent cohort of 42 HGSOC patients with a sensitivity of 90.9%.ConclusionsA multi-genomic signature-based analysis enabled the prediction of initial platinum resistance in advanced HGSOC patients, which may serve as a novel assessment of platinum resistance, provide therapeutic guidance, and merit further validation.

Download Full-text

Network-Based Logistic Classification with an EnhancedL1/2Solver Reveals Biomarker and Subnetwork Signatures for Diagnosing Lung Cancer

BioMed Research International ◽

10.1155/2015/713953 ◽

2015 ◽

Vol 2015 ◽

pp. 1-7 ◽

Cited By ~ 3

Author(s):

Hai-Hui Huang ◽

Yong Liang ◽

Xiao-Ying Liu

Keyword(s):

Lung Cancer ◽

Regularization Method ◽

Predictive Accuracy ◽

Elastic Net ◽

Genomic Research ◽

Biological Processes ◽

Simulation Studies ◽

Expression Data ◽

Cancer Data ◽

Genomic Studies

Identifying biomarker and signaling pathway is a critical step in genomic studies, in which the regularization method is a widely used feature extraction approach. However, most of the regularizers are based onL1-norm and their results are not good enough for sparsity and interpretation and are asymptotically biased, especially in genomic research. Recently, we gained a large amount of molecular interaction information about the disease-related biological processes and gathered them through various databases, which focused on many aspects of biological systems. In this paper, we use an enhancedL1/2penalized solver to penalize network-constrained logistic regression model called an enhancedL1/2net, where the predictors are based on gene-expression data with biologic network knowledge. Extensive simulation studies showed that our proposed approach outperformsL1regularization, the oldL1/2penalized solver, and the Elastic net approaches in terms of classification accuracy and stability. Furthermore, we applied our method for lung cancer data analysis and found that our method achieves higher predictive accuracy thanL1regularization, the oldL1/2penalized solver, and the Elastic net approaches, while fewer but informative biomarkers and pathways are selected.

Download Full-text

Ordered Partial Correlation Networks in Psychological Research

Modelling and Data Analysis ◽

10.17759/mda.2021110202 ◽

2021 ◽

Vol 11 (2) ◽

pp. 31-50

Author(s):

S.L. Artemenkov

Keyword(s):

Partial Correlation ◽

Data Distribution ◽

Psychological Research ◽

Psychological Variables ◽

Latent Construct ◽

Correlation Networks ◽

Causal Connections ◽

Psychological Indicators ◽

Partial Correlations ◽

Selection For

Network modeling, which has emerged in recent years, can be successfully applied to the consideration of relationships between measurable psychological variables. In this context, psychological variables are understood as directly affecting each other, and not as a consequence of a latent construct. The article describes regularization methods that can be used to effectively assess the sparse and interpretable network structure based on partial correlations of psychological indicators. An overview of the glasso regularization procedure using EBIC model selection for evaluating an ordered sparse network of partial correlations is presented. The issues of performing this analysis in R in the presence of normal and non-normal data distribution are considered, taking into account the influence of the hyperparameter, which is manually set by the researcher. The considered approach is also interesting as a way to visualize possible causal connections between variables. This review bridges the gap related to the lack of an accessible description in Russian of this approach, which is still uncommon in Russia and at the same time promising.

Download Full-text

Development and performance evaluation of the Medicines Optimisation Assessment Tool (MOAT): a prognostic model to target hospital pharmacists’ input to prevent medication-related problems

BMJ Quality & Safety ◽

10.1136/bmjqs-2018-008335 ◽

2019 ◽

Vol 28 (8) ◽

pp. 645-656 ◽

Cited By ~ 8

Author(s):

Cathy Geeson ◽

Li Wei ◽

Bryony Dean Franklin

Keyword(s):

Risk Factors ◽

Assessment Tool ◽

Predictive Accuracy ◽

External Validation ◽

Predictive Performance ◽

Study Data ◽

Predictive Values ◽

Hospital Pharmacists ◽

Potential Risk Factors ◽

Medicines Optimisation

BackgroundMedicines optimisation is a key role for hospital pharmacists, but with ever-increasing demands on services, there is a need to increase efficiency while maintaining patient safety.ObjectiveTo develop a prediction tool, the Medicines Optimisation Assessment Tool (MOAT), to target patients most in need of pharmacists’ input in hospital.MethodsPatients from adult medical wards at two UK hospitals were prospectively included into this cohort study. Data on medication-related problems (MRPs) were collected by pharmacists at the study sites as part of their routine daily clinical assessments. Data on potential risk factors, such as number of comorbidities and use of ‘high-risk’ medicines, were collected retrospectively. Multivariable logistic regression modelling was used to determine the relationship between risk factors and the study outcome: preventable MRPs that were at least moderate in severity. The model was internally validated and a simplified electronic scoring system developed.ResultsAmong 1503 eligible admissions, 610 (40.6%) experienced the study outcome. Eighteen risk factors were preselected for MOAT development, with 11 variables retained in the final model. The MOAT demonstrated fair predictive performance (concordance index 0.66) and good calibration. Two clinically relevant decision thresholds (ie, the minimum predicted risk probabilities to justify pharmacists’ input) were selected, with sensitivities of 90% and 66% (specificity 30% and 61%); these equate to positive predictive values of 47% and 54%, respectively. Decision curve analysis suggests that the MOAT has potential value in clinical practice in guiding decision-making.ConclusionThe MOAT has potential to predict those patients most at risk of moderate or severe preventable MRPs, experienced by 41% of admissions. External validation is now required to establish predictive accuracy in a new group of patients.

Download Full-text

Clinical Utility of a Nomogram for Predicting 30-Days Poor Outcome in Hospitalized Patients With COVID-19: Multicenter External Validation and Decision Curve Analysis

Frontiers in Medicine ◽

10.3389/fmed.2020.590460 ◽

2020 ◽

Vol 7 ◽

Author(s):

Bin Zhang ◽

Qin Liu ◽

Xiao Zhang ◽

Shuyi Liu ◽

Weiqi Chen ◽

...

Keyword(s):

Prognostic Factors ◽

Clinical Utility ◽

Predictive Accuracy ◽

Fasting Blood Glucose ◽

External Validation ◽

Predictive Performance ◽

Curve Analysis ◽

Discriminative Ability ◽

Decision Curve Analysis ◽

Poor Outcome

Aim: Early detection of coronavirus disease 2019 (COVID-19) patients who are likely to develop worse outcomes is of great importance, which may help select patients at risk of rapid deterioration who should require high-level monitoring and more aggressive treatment. We aimed to develop and validate a nomogram for predicting 30-days poor outcome of patients with COVID-19.Methods: The prediction model was developed in a primary cohort consisting of 233 patients with laboratory-confirmed COVID-19, and data were collected from January 3 to March 20, 2020. We identified and integrated significant prognostic factors for 30-days poor outcome to construct a nomogram. The model was subjected to internal validation and to external validation with two separate cohorts of 110 and 118 cases, respectively. The performance of the nomogram was assessed with respect to its predictive accuracy, discriminative ability, and clinical usefulness.Results: In the primary cohort, the mean age of patients was 55.4 years and 129 (55.4%) were male. Prognostic factors contained in the clinical nomogram were age, lactic dehydrogenase, aspartate aminotransferase, prothrombin time, serum creatinine, serum sodium, fasting blood glucose, and D-dimer. The model was externally validated in two cohorts achieving an AUC of 0.946 and 0.878, sensitivity of 100 and 79%, and specificity of 76.5 and 83.8%, respectively. Although adding CT score to the clinical nomogram (clinical-CT nomogram) did not yield better predictive performance, decision curve analysis showed that the clinical-CT nomogram provided better clinical utility than the clinical nomogram.Conclusions: We established and validated a nomogram that can provide an individual prediction of 30-days poor outcome for COVID-19 patients. This practical prognostic model may help clinicians in decision making and reduce mortality.

Download Full-text