scholarly journals Balancing Complex Signals for Robust Predictive Modeling

Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8465
Author(s):  
Fazal Aman ◽  
Azhar Rauf ◽  
Rahman Ali ◽  
Jamil Hussain ◽  
Ibrar Ahmed

Robust predictive modeling is the process of creating, validating, and testing models to obtain better prediction outcomes. Datasets usually contain outliers whose trend deviates from the most data points. Conventionally, outliers are removed from the training dataset during preprocessing before building predictive models. Such models, however, may have poor predictive performance on the unseen testing data involving outliers. In modern machine learning, outliers are regarded as complex signals because of their significant role and are not suggested for removal from the training dataset. Models trained in modern regimes are interpolated (over trained) by increasing their complexity to treat outliers locally. However, such models become inefficient as they require more training due to the inclusion of outliers, and this also compromises the models’ accuracy. This work proposes a novel complex signal balancing technique that may be used during preprocessing to incorporate the maximum number of complex signals (outliers) in the training dataset. The proposed approach determines the optimal value for maximum possible inclusion of complex signals for training with the highest performance of the model in terms of accuracy, time, and complexity. The experimental results show that models trained after preprocessing with the proposed technique achieve higher predictive accuracy with improved execution time and low complexity as compared to traditional predictive modeling.

2021 ◽  
pp. 1-10
Author(s):  
I. Krug ◽  
J. Linardon ◽  
C. Greenwood ◽  
G. Youssef ◽  
J. Treasure ◽  
...  

Abstract Background Despite a wide range of proposed risk factors and theoretical models, prediction of eating disorder (ED) onset remains poor. This study undertook the first comparison of two machine learning (ML) approaches [penalised logistic regression (LASSO), and prediction rule ensembles (PREs)] to conventional logistic regression (LR) models to enhance prediction of ED onset and differential ED diagnoses from a range of putative risk factors. Method Data were part of a European Project and comprised 1402 participants, 642 ED patients [52% with anorexia nervosa (AN) and 40% with bulimia nervosa (BN)] and 760 controls. The Cross-Cultural Risk Factor Questionnaire, which assesses retrospectively a range of sociocultural and psychological ED risk factors occurring before the age of 12 years (46 predictors in total), was used. Results All three statistical approaches had satisfactory model accuracy, with an average area under the curve (AUC) of 86% for predicting ED onset and 70% for predicting AN v. BN. Predictive performance was greatest for the two regression methods (LR and LASSO), although the PRE technique relied on fewer predictors with comparable accuracy. The individual risk factors differed depending on the outcome classification (EDs v. non-EDs and AN v. BN). Conclusions Even though the conventional LR performed comparably to the ML approaches in terms of predictive accuracy, the ML methods produced more parsimonious predictive models. ML approaches offer a viable way to modify screening practices for ED risk that balance accuracy against participant burden.


2004 ◽  
Vol 1 (1) ◽  
pp. 131-142
Author(s):  
Ljupčo Todorovski ◽  
Sašo Džeroski ◽  
Peter Ljubič

Both equation discovery and regression methods aim at inducing models of numerical data. While the equation discovery methods are usually evaluated in terms of comprehensibility of the induced model, the emphasis of the regression methods evaluation is on their predictive accuracy. In this paper, we present Ciper, an efficient method for discovery of polynomial equations and empirically evaluate its predictive performance on standard regression tasks. The evaluation shows that polynomials compare favorably to linear and piecewise regression models, induced by the existing state-of-the-art regression methods, in terms of degree of fit and complexity.


2019 ◽  
Author(s):  
Donald Salami ◽  
Carla Alexandra Sousa ◽  
Maria do Rosário Oliveira Martins ◽  
César Capinha

ABSTRACTThe geographical spread of dengue is a global public health concern. This is largely mediated by the importation of dengue from endemic to non-endemic areas via the increasing connectivity of the global air transport network. The dynamic nature and intrinsic heterogeneity of the air transport network make it challenging to predict dengue importation.Here, we explore the capabilities of state-of-the-art machine learning algorithms to predict dengue importation. We trained four machine learning classifiers algorithms, using a 6-year historical dengue importation data for 21 countries in Europe and connectivity indices mediating importation and air transport network centrality measures. Predictive performance for the classifiers was evaluated using the area under the receiving operating characteristic curve, sensitivity, and specificity measures. Finally, we applied practical model-agnostic methods, to provide an in-depth explanation of our optimal model’s predictions on a global and local scale.Our best performing model achieved high predictive accuracy, with an area under the receiver operating characteristic score of 0.94 and a maximized sensitivity score of 0.88. The predictor variables identified as most important were the source country’s dengue incidence rate, population size, and volume of air passengers. Network centrality measures, describing the positioning of European countries within the air travel network, were also influential to the predictions.We demonstrated the high predictive performance of a machine learning model in predicting dengue importation and the utility of the model-agnostic methods to offer a comprehensive understanding of the reasons behind the predictions. Similar approaches can be utilized in the development of an operational early warning surveillance system for dengue importation.


2019 ◽  
Author(s):  
Zanya Reubenne D. Omadlao ◽  
Nica Magdalena A. Tuguinay ◽  
Ricarido Maglaqui Saturay

A machine learning-based prediction system for rainfall-induced landslides in Benguet First Engineering District is proposed to address the landslide risk due to the climate and topography of Benguet province. It is intended to improve the decision support system for road management with regards to landslides, as implemented by the Department of Public Works and Highways Benguet First District Engineering Office. Supervised classification was applied to daily rainfall and landslide data for the Benguet First Engineering District covering the years 2014 to 2018 using scikit-learn. Various forms of cumulative rainfall values were used to predict landslide occurrence for a given day. Following typical machine learning workflows, rainfall-landslide data set was divided into training and testing data sets. Machine learning algorithms such as K-Nearest Neighbors, Gaussian Naïve Bayes, Support Vector Machine, Logistic Regression, Random Forest, Decision Tree, and AdaBoost were trained using the training data sets, and the trained models were used to make predictions based on the testing data sets. Predictive performance of the models vis-a-vis the testing data sets were compared using true positive rates, false positive rates, and the area under the Receiver Operating Characteristic Curve. Predictive performance of these models were then compared to 1-day cumulative rainfall thresholds commonly used for landslide predictions. Among the machine learning models evaluated, Gaussian Naïve Bayes has the best performance, with mean false positive rate, true positive rate and area under the curve scores of 7%, 76%, and 84% respectively. It also performs better than the 1-day cumulative rainfall thresholds. This research demonstrates the potential of machine learning for identifying temporal patterns in rainfall-induced landslides using minimal data input -- daily rainfall from a single synoptic station, and highway maintenance records. Such an approach may be tested and applied to similar problems in the field of disaster risk reduction and management.


Author(s):  
Tong Wei ◽  
Yu-Feng Li

Large-scale multi-label learning annotates relevant labels for unseen data from a huge number of candidate labels. It is well known that in large-scale multi-label learning, labels exhibit a long tail distribution in which a significant fraction of labels are tail labels. Nonetheless, how tail labels make impact on the performance metrics in large-scale multi-label learning was not explicitly quantified. In this paper, we disclose that whatever labels are randomly missing or misclassified, tail labels impact much less than common labels in terms of commonly used performance metrics (Top-$k$ precision and nDCG@$k$). With the observation above, we develop a low-complexity large-scale multi-label learning algorithm with the goal of facilitating fast prediction and compact models by trimming tail labels adaptively. Experiments clearly verify that both the prediction time and the model size are significantly reduced without sacrificing much predictive performance for state-of-the-art approaches.


2021 ◽  
Vol 10 ◽  
Author(s):  
Yuan Li ◽  
Xiaolan Zhang ◽  
Yan Gao ◽  
Chunliang Shang ◽  
Bo Yu ◽  
...  

BackgroundHigh grade serous ovarian cancer (HGSOC) is the most common subtype of ovarian cancer. Although platinum-based chemotherapy has been the cornerstone for HGSOC treatment, nearly 25% of patients would have less than 6 months of interval since the last platinum chemotherapy, referred to as platinum-resistance. Currently, no precise tools to predict platinum resistance have been developed yet.MethodsNinety-nine HGSOC patients, who have finished cytoreductive surgery and platinum-based chemotherapy in Peking University Third Hospital from 2018 to 2019, were enrolled. Whole-genome sequencing (WGS) and whole-exome sequencing (WES) were performed on the collected tumor tissue samples to establish a platinum-resistance predictor in a discovery cohort of 57 patients, and further validated in another 42 HGSOC patients.ResultsA high prevalence of alterations in DNA damage repair (DDR) pathway, including BRCA1/2, was identified both in the platinum-sensitive and resistant HGSOC patients. Compared with the resistant subgroup, there was a trend of higher prevalence of homologous recombination deficiency (HRD) in the platinum-sensitive subgroup (78.95% vs. 47.37%, p=0.0646). Based on the HRD score, microhomology insertions and deletions (MHID), copy number changes load, duplication load of 1–100 kb, single nucleotide variants load, and eight other mutational signatures, a combined predictor of platinum-resistance, named as DRDscore, was established. DRDscore outperformed in predicting the platinum-sensitivity than the previously reported biomarkers with a predictive accuracy of 0.860 at a threshold of 0.7584. The predictive performance of DRDscore was validated in an independent cohort of 42 HGSOC patients with a sensitivity of 90.9%.ConclusionsA multi-genomic signature-based analysis enabled the prediction of initial platinum resistance in advanced HGSOC patients, which may serve as a novel assessment of platinum resistance, provide therapeutic guidance, and merit further validation.


2015 ◽  
Vol 282 (1816) ◽  
pp. 20151574 ◽  
Author(s):  
Matthew R. Wilkins ◽  
Daizaburo Shizuka ◽  
Maxwell B. Joseph ◽  
Joanna K. Hubbard ◽  
Rebecca J. Safran

Complex signals, involving multiple components within and across modalities, are common in animal communication. However, decomposing complex signals into traits and their interactions remains a fundamental challenge for studies of phenotype evolution. We apply a novel phenotype network approach for studying complex signal evolution in the North American barn swallow ( Hirundo rustica erythrogaster ). We integrate model testing with correlation-based phenotype networks to infer the contributions of female mate choice and male–male competition to the evolution of barn swallow communication. Overall, the best predictors of mate choice were distinct from those for competition, while moderate functional overlap suggests males and females use some of the same traits to assess potential mates and rivals. We interpret model results in the context of a network of traits, and suggest this approach allows researchers a more nuanced view of trait clustering patterns that informs new hypotheses about the evolution of communication systems.


2021 ◽  
Author(s):  
Myeong Gyu Kim ◽  
Jae Hyun Kim ◽  
Kyungim Kim

BACKGROUND Garlic-related misinformation is prevalent whenever a virus outbreak occurs. Again, with the outbreak of coronavirus disease 2019 (COVID-19), garlic-related misinformation is spreading through social media sites, including Twitter. Machine learning-based approaches can be used to detect misinformation from vast tweets. OBJECTIVE This study aimed to develop machine learning algorithms for detecting misinformation on garlic and COVID-19 in Twitter. METHODS This study used 5,929 original tweets mentioning garlic and COVID-19. Tweets were manually labeled as misinformation, accurate information, and others. We tested the following algorithms: k-nearest neighbors; random forest; support vector machine (SVM) with linear, radial, and polynomial kernels; and neural network. Features for machine learning included user-based features (verified account, user type, number of followers, and follower rate) and text-based features (uniform resource locator, negation, sentiment score, Latent Dirichlet Allocation topic probability, number of retweets, and number of favorites). A model with the highest accuracy in the training dataset (70% of overall dataset) was tested using a test dataset (30% of overall dataset). Predictive performance was measured using overall accuracy, sensitivity, specificity, and balanced accuracy. RESULTS SVM with the polynomial kernel model showed the highest accuracy of 0.670. The model also showed a balanced accuracy of 0.757, sensitivity of 0.819, and specificity of 0.696 for misinformation. Important features in the misinformation and accurate information classes included topic 4 (common myths), topic 13 (garlic-specific myths), number of followers, topic 11 (misinformation on social media), and follower rate. Topic 3 (cooking recipes) was the most important feature in the others class. CONCLUSIONS Our SVM model showed good performance in detecting misinformation. The results of our study will help detect misinformation related to garlic and COVID-19. It could also be applied to prevent misinformation related to dietary supplements in the event of a future outbreak of a disease other than COVID-19.


2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Patricio Wolff ◽  
Manuel Graña ◽  
Sebastián A. Ríos ◽  
Maria Begoña Yarza

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.


2019 ◽  
Vol 28 (8) ◽  
pp. 645-656 ◽  
Author(s):  
Cathy Geeson ◽  
Li Wei ◽  
Bryony Dean Franklin

BackgroundMedicines optimisation is a key role for hospital pharmacists, but with ever-increasing demands on services, there is a need to increase efficiency while maintaining patient safety.ObjectiveTo develop a prediction tool, the Medicines Optimisation Assessment Tool (MOAT), to target patients most in need of pharmacists’ input in hospital.MethodsPatients from adult medical wards at two UK hospitals were prospectively included into this cohort study. Data on medication-related problems (MRPs) were collected by pharmacists at the study sites as part of their routine daily clinical assessments. Data on potential risk factors, such as number of comorbidities and use of ‘high-risk’ medicines, were collected retrospectively. Multivariable logistic regression modelling was used to determine the relationship between risk factors and the study outcome: preventable MRPs that were at least moderate in severity. The model was internally validated and a simplified electronic scoring system developed.ResultsAmong 1503 eligible admissions, 610 (40.6%) experienced the study outcome. Eighteen risk factors were preselected for MOAT development, with 11 variables retained in the final model. The MOAT demonstrated fair predictive performance (concordance index 0.66) and good calibration. Two clinically relevant decision thresholds (ie, the minimum predicted risk probabilities to justify pharmacists’ input) were selected, with sensitivities of 90% and 66% (specificity 30% and 61%); these equate to positive predictive values of 47% and 54%, respectively. Decision curve analysis suggests that the MOAT has potential value in clinical practice in guiding decision-making.ConclusionThe MOAT has potential to predict those patients most at risk of moderate or severe preventable MRPs, experienced by 41% of admissions. External validation is now required to establish predictive accuracy in a new group of patients.


Sign in / Sign up

Export Citation Format

Share Document