Balancing Complex Signals for Robust Predictive Modeling

Robust predictive modeling is the process of creating, validating, and testing models to obtain better prediction outcomes. Datasets usually contain outliers whose trend deviates from the most data points. Conventionally, outliers are removed from the training dataset during preprocessing before building predictive models. Such models, however, may have poor predictive performance on the unseen testing data involving outliers. In modern machine learning, outliers are regarded as complex signals because of their significant role and are not suggested for removal from the training dataset. Models trained in modern regimes are interpolated (over trained) by increasing their complexity to treat outliers locally. However, such models become inefficient as they require more training due to the inclusion of outliers, and this also compromises the models’ accuracy. This work proposes a novel complex signal balancing technique that may be used during preprocessing to incorporate the maximum number of complex signals (outliers) in the training dataset. The proposed approach determines the optimal value for maximum possible inclusion of complex signals for training with the highest performance of the model in terms of accuracy, time, and complexity. The experimental results show that models trained after preprocessing with the proposed technique achieve higher predictive accuracy with improved execution time and low complexity as compared to traditional predictive modeling.

Download Full-text

A proof-of-concept study applying machine learning methods to putative risk factors for eating disorders: results from the multi-centre European project on healthy eating

Psychological Medicine ◽

10.1017/s003329172100489x ◽

2021 ◽

pp. 1-10

Author(s):

I. Krug ◽

J. Linardon ◽

C. Greenwood ◽

G. Youssef ◽

J. Treasure ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Logistic Regression ◽

Predictive Accuracy ◽

Area Under The Curve ◽

Prediction Rule ◽

Predictive Performance ◽

Individual Risk ◽

European Project ◽

Wide Range

Abstract Background Despite a wide range of proposed risk factors and theoretical models, prediction of eating disorder (ED) onset remains poor. This study undertook the first comparison of two machine learning (ML) approaches [penalised logistic regression (LASSO), and prediction rule ensembles (PREs)] to conventional logistic regression (LR) models to enhance prediction of ED onset and differential ED diagnoses from a range of putative risk factors. Method Data were part of a European Project and comprised 1402 participants, 642 ED patients [52% with anorexia nervosa (AN) and 40% with bulimia nervosa (BN)] and 760 controls. The Cross-Cultural Risk Factor Questionnaire, which assesses retrospectively a range of sociocultural and psychological ED risk factors occurring before the age of 12 years (46 predictors in total), was used. Results All three statistical approaches had satisfactory model accuracy, with an average area under the curve (AUC) of 86% for predicting ED onset and 70% for predicting AN v. BN. Predictive performance was greatest for the two regression methods (LR and LASSO), although the PRE technique relied on fewer predictors with comparable accuracy. The individual risk factors differed depending on the outcome classification (EDs v. non-EDs and AN v. BN). Conclusions Even though the conventional LR performed comparably to the ML approaches in terms of predictive accuracy, the ML methods produced more parsimonious predictive models. ML approaches offer a viable way to modify screening practices for ED risk that balance accuracy against participant burden.

Download Full-text

Discovery of polynomial equations for regression

Advances in Methodology and Statistics ◽

10.51936/uogl8142 ◽

2004 ◽

Vol 1 (1) ◽

pp. 131-142

Author(s):

Ljupčo Todorovski ◽

Sašo Džeroski ◽

Peter Ljubič

Keyword(s):

Efficient Method ◽

Regression Models ◽

Predictive Accuracy ◽

State Of The Art ◽

Numerical Data ◽

Predictive Performance ◽

Polynomial Equations ◽

Regression Methods ◽

Piecewise Regression ◽

Standard Regression

Both equation discovery and regression methods aim at inducing models of numerical data. While the equation discovery methods are usually evaluated in terms of comprehensibility of the induced model, the emphasis of the regression methods evaluation is on their predictive accuracy. In this paper, we present Ciper, an efficient method for discovery of polynomial equations and empirically evaluate its predictive performance on standard regression tasks. The evaluation shows that polynomials compare favorably to linear and piecewise regression models, induced by the existing state-of-the-art regression methods, in terms of degree of fit and complexity.

Download Full-text

Predicting dengue importation into Europe, using machine learning and model-agnostic methods

10.1101/19013383 ◽

2019 ◽

Author(s):

Donald Salami ◽

Carla Alexandra Sousa ◽

Maria do Rosário Oliveira Martins ◽

César Capinha

Keyword(s):

Machine Learning ◽

Operating Characteristic ◽

Predictive Accuracy ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Transport Network ◽

Air Transport ◽

Health Concern ◽

Centrality Measures ◽

Network Centrality

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.

Download Full-text

Machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District

10.31219/osf.io/csx6r ◽

2019 ◽

Author(s):

Zanya Reubenne D. Omadlao ◽

Nica Magdalena A. Tuguinay ◽

Ricarido Maglaqui Saturay

Keyword(s):

Machine Learning ◽

Daily Rainfall ◽

Predictive Performance ◽

Data Sets ◽

Prediction System ◽

True Positive ◽

Rainfall Thresholds ◽

Cumulative Rainfall ◽

Testing Data ◽

Positive Rate

A machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District is proposed to address the landslide risk due to the climate and topography of Benguet province. It is intended to improve the decision support system for road management with regards to landslides, as implemented by the Department of Public Works and Highways Benguet First District Engineering Office. Supervised classification was applied to daily rainfall and landslide data for the Benguet First Engineering District covering the years 2014 to 2018 using scikit-learn. Various forms of cumulative rainfall values were used to predict landslide occurrence for a given day. Following typical machine learning workflows, rainfall-landslide data set was divided into training and testing data sets. Machine learning algorithms such as K-Nearest Neighbors, Gaussian Naïve Bayes, Support Vector Machine, Logistic Regression, Random Forest, Decision Tree, and AdaBoost were trained using the training data sets, and the trained models were used to make predictions based on the testing data sets. Predictive performance of the models vis-a-vis the testing data sets were compared using true positive rates, false positive rates, and the area under the Receiver Operating Characteristic Curve. Predictive performance of these models were then compared to 1-day cumulative rainfall thresholds commonly used for landslide predictions. Among the machine learning models evaluated, Gaussian Naïve Bayes has the best performance, with mean false positive rate, true positive rate and area under the curve scores of 7%, 76%, and 84% respectively. It also performs better than the 1-day cumulative rainfall thresholds. This research demonstrates the potential of machine learning for identifying temporal patterns in rainfall-induced landslides using minimal data input -- daily rainfall from a single synoptic station, and highway maintenance records. Such an approach may be tested and applied to similar problems in the field of disaster risk reduction and management.

Download Full-text

Does Tail Label Help for Large-Scale Multi-Label Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/395 ◽

2018 ◽

Cited By ~ 5

Author(s):

Tong Wei ◽

Yu-Feng Li

Keyword(s):

Large Scale ◽

Performance Metrics ◽

Learning Algorithm ◽

Predictive Performance ◽

Low Complexity ◽

Tail Distribution ◽

Prediction Time ◽

Unseen Data ◽

Model Size ◽

Fast Prediction

Large-scale multi-label learning annotates relevant labels for unseen data from a huge number of candidate labels. It is well known that in large-scale multi-label learning, labels exhibit a long tail distribution in which a significant fraction of labels are tail labels. Nonetheless, how tail labels make impact on the performance metrics in large-scale multi-label learning was not explicitly quantified. In this paper, we disclose that whatever labels are randomly missing or misclassified, tail labels impact much less than common labels in terms of commonly used performance metrics (Top-$k$ precision and nDCG@$k$). With the observation above, we develop a low-complexity large-scale multi-label learning algorithm with the goal of facilitating fast prediction and compact models by trimming tail labels adaptively. Experiments clearly verify that both the prediction time and the model size are significantly reduced without sacrificing much predictive performance for state-of-the-art approaches.

Download Full-text

Development of a Genomic Signatures-Based Predictor of Initial Platinum-Resistance in Advanced High-Grade Serous Ovarian Cancer Patients

Frontiers in Oncology ◽

10.3389/fonc.2020.625866 ◽

2021 ◽

Vol 10 ◽

Author(s):

Yuan Li ◽

Xiaolan Zhang ◽

Yan Gao ◽

Chunliang Shang ◽

Bo Yu ◽

...

Keyword(s):

Ovarian Cancer ◽

Predictive Accuracy ◽

Predictive Performance ◽

Platinum Resistance ◽

High Grade ◽

Serous Ovarian Cancer ◽

Single Nucleotide Variants ◽

Tissue Samples ◽

Platinum Sensitive ◽

Platinum Based Chemotherapy

BackgroundHigh grade serous ovarian cancer (HGSOC) is the most common subtype of ovarian cancer. Although platinum-based chemotherapy has been the cornerstone for HGSOC treatment, nearly 25% of patients would have less than 6 months of interval since the last platinum chemotherapy, referred to as platinum-resistance. Currently, no precise tools to predict platinum resistance have been developed yet.MethodsNinety-nine HGSOC patients, who have finished cytoreductive surgery and platinum-based chemotherapy in Peking University Third Hospital from 2018 to 2019, were enrolled. Whole-genome sequencing (WGS) and whole-exome sequencing (WES) were performed on the collected tumor tissue samples to establish a platinum-resistance predictor in a discovery cohort of 57 patients, and further validated in another 42 HGSOC patients.ResultsA high prevalence of alterations in DNA damage repair (DDR) pathway, including BRCA1/2, was identified both in the platinum-sensitive and resistant HGSOC patients. Compared with the resistant subgroup, there was a trend of higher prevalence of homologous recombination deficiency (HRD) in the platinum-sensitive subgroup (78.95% vs. 47.37%, p=0.0646). Based on the HRD score, microhomology insertions and deletions (MHID), copy number changes load, duplication load of 1–100 kb, single nucleotide variants load, and eight other mutational signatures, a combined predictor of platinum-resistance, named as DRDscore, was established. DRDscore outperformed in predicting the platinum-sensitivity than the previously reported biomarkers with a predictive accuracy of 0.860 at a threshold of 0.7584. The predictive performance of DRDscore was validated in an independent cohort of 42 HGSOC patients with a sensitivity of 90.9%.ConclusionsA multi-genomic signature-based analysis enabled the prediction of initial platinum resistance in advanced HGSOC patients, which may serve as a novel assessment of platinum resistance, provide therapeutic guidance, and merit further validation.

Download Full-text

Multimodal signalling in the North American barn swallow: a phenotype network approach

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2015.1574 ◽

2015 ◽

Vol 282 (1816) ◽

pp. 20151574 ◽

Cited By ~ 33

Author(s):

Matthew R. Wilkins ◽

Daizaburo Shizuka ◽

Maxwell B. Joseph ◽

Joanna K. Hubbard ◽

Rebecca J. Safran

Keyword(s):

Mate Choice ◽

North American ◽

Communication Systems ◽

Animal Communication ◽

Complex Signal ◽

Network Approach ◽

Barn Swallow ◽

Complex Signals ◽

The North ◽

Phenotype Network

Complex signals, involving multiple components within and across modalities, are common in animal communication. However, decomposing complex signals into traits and their interactions remains a fundamental challenge for studies of phenotype evolution. We apply a novel phenotype network approach for studying complex signal evolution in the North American barn swallow ( Hirundo rustica erythrogaster ). We integrate model testing with correlation-based phenotype networks to infer the contributions of female mate choice and male–male competition to the evolution of barn swallow communication. Overall, the best predictors of mate choice were distinct from those for competition, while moderate functional overlap suggests males and females use some of the same traits to assess potential mates and rivals. We interpret model results in the context of a network of traits, and suggest this approach allows researchers a more nuanced view of trait clustering patterns that informs new hypotheses about the evolution of communication systems.

Download Full-text

Detection of misinformation on garlic and COVID-19 in Twitter: A machine learning-based approach (Preprint)

10.2196/preprints.33056 ◽

2021 ◽

Author(s):

Myeong Gyu Kim ◽

Jae Hyun Kim ◽

Kyungim Kim

Keyword(s):

Machine Learning ◽

Social Media ◽

Latent Dirichlet Allocation ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Polynomial Kernel ◽

Support Vector ◽

Accurate Information ◽

Probability Number

BACKGROUND Garlic-related misinformation is prevalent whenever a virus outbreak occurs. Again, with the outbreak of coronavirus disease 2019 (COVID-19), garlic-related misinformation is spreading through social media sites, including Twitter. Machine learning-based approaches can be used to detect misinformation from vast tweets. OBJECTIVE This study aimed to develop machine learning algorithms for detecting misinformation on garlic and COVID-19 in Twitter. METHODS This study used 5,929 original tweets mentioning garlic and COVID-19. Tweets were manually labeled as misinformation, accurate information, and others. We tested the following algorithms: k-nearest neighbors; random forest; support vector machine (SVM) with linear, radial, and polynomial kernels; and neural network. Features for machine learning included user-based features (verified account, user type, number of followers, and follower rate) and text-based features (uniform resource locator, negation, sentiment score, Latent Dirichlet Allocation topic probability, number of retweets, and number of favorites). A model with the highest accuracy in the training dataset (70% of overall dataset) was tested using a test dataset (30% of overall dataset). Predictive performance was measured using overall accuracy, sensitivity, specificity, and balanced accuracy. RESULTS SVM with the polynomial kernel model showed the highest accuracy of 0.670. The model also showed a balanced accuracy of 0.757, sensitivity of 0.819, and specificity of 0.696 for misinformation. Important features in the misinformation and accurate information classes included topic 4 (common myths), topic 13 (garlic-specific myths), number of followers, topic 11 (misinformation on social media), and follower rate. Topic 3 (cooking recipes) was the most important feature in the others class. CONCLUSIONS Our SVM model showed good performance in detecting misinformation. The results of our study will help detect misinformation related to garlic and COVID-19. It could also be applied to prevent misinformation related to dietary supplements in the event of a future outbreak of a disease other than COVID-19.

Download Full-text

Machine Learning Readmission Risk Modeling: A Pediatric Case Study

BioMed Research International ◽

10.1155/2019/8532892 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Patricio Wolff ◽

Manuel Graña ◽

Sebastián A. Ríos ◽

Maria Begoña Yarza

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Naive Bayes ◽

Class Imbalance ◽

Predictive Performance ◽

Naïve Bayes ◽

Distribution Model ◽

Training Dataset ◽

Support Vector ◽

Pediatric Hospital

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.

Download Full-text

Development and performance evaluation of the Medicines Optimisation Assessment Tool (MOAT): a prognostic model to target hospital pharmacists’ input to prevent medication-related problems

BMJ Quality & Safety ◽

10.1136/bmjqs-2018-008335 ◽

2019 ◽

Vol 28 (8) ◽

pp. 645-656 ◽

Cited By ~ 8

Author(s):

Cathy Geeson ◽

Li Wei ◽

Bryony Dean Franklin

Keyword(s):

Risk Factors ◽

Assessment Tool ◽

Predictive Accuracy ◽

External Validation ◽

Predictive Performance ◽

Study Data ◽

Predictive Values ◽

Hospital Pharmacists ◽

Potential Risk Factors ◽

Medicines Optimisation

BackgroundMedicines optimisation is a key role for hospital pharmacists, but with ever-increasing demands on services, there is a need to increase efficiency while maintaining patient safety.ObjectiveTo develop a prediction tool, the Medicines Optimisation Assessment Tool (MOAT), to target patients most in need of pharmacists’ input in hospital.MethodsPatients from adult medical wards at two UK hospitals were prospectively included into this cohort study. Data on medication-related problems (MRPs) were collected by pharmacists at the study sites as part of their routine daily clinical assessments. Data on potential risk factors, such as number of comorbidities and use of ‘high-risk’ medicines, were collected retrospectively. Multivariable logistic regression modelling was used to determine the relationship between risk factors and the study outcome: preventable MRPs that were at least moderate in severity. The model was internally validated and a simplified electronic scoring system developed.ResultsAmong 1503 eligible admissions, 610 (40.6%) experienced the study outcome. Eighteen risk factors were preselected for MOAT development, with 11 variables retained in the final model. The MOAT demonstrated fair predictive performance (concordance index 0.66) and good calibration. Two clinically relevant decision thresholds (ie, the minimum predicted risk probabilities to justify pharmacists’ input) were selected, with sensitivities of 90% and 66% (specificity 30% and 61%); these equate to positive predictive values of 47% and 54%, respectively. Decision curve analysis suggests that the MOAT has potential value in clinical practice in guiding decision-making.ConclusionThe MOAT has potential to predict those patients most at risk of moderate or severe preventable MRPs, experienced by 41% of admissions. External validation is now required to establish predictive accuracy in a new group of patients.

Download Full-text