Benchmarking missing-values approaches for predictive models on health databases v2

Author(s):  
Alexandre Perez-Lebel ◽  
Gaël Varoquaux ◽  
Marine Le Morvan ◽  
Julie Josse ◽  
Jean-Baptiste Poline

BACKGROUND As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative --rather than generative-- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values --with missing incorporated attribute-- leads to robust, fast, and well-performing predictive modeling. CONCLUSIONS Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.

2020 ◽  
pp. 1-21 ◽  
Author(s):  
Clément Dalloux ◽  
Vincent Claveau ◽  
Natalia Grabar ◽  
Lucas Emanuel Silva Oliveira ◽  
Claudia Maria Cabral Moro ◽  
...  

Abstract Automatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora for these two languages which have been manually annotated for marking up the negation cues and their scope. Second, we propose automatic methods based on supervised machine learning approaches for the automatic detection of negation marks and of their scopes. The methods show to be robust in both languages (Brazilian Portuguese and French) and in cross-domain (general and biomedical languages) contexts. The approach is also validated on English data from the state of the art: it yields very good results and outperforms other existing approaches. Besides, the application is accessible and usable online. We assume that, through these issues (new annotated corpora, application accessible online, and cross-domain robustness), the reproducibility of the results and the robustness of the NLP applications will be augmented.


World Health Organization’s (WHO) report 2018, on diabetes has reported that the number of diabetic cases has increased from one hundred eight million to four hundred twenty-two million from the year 1980. The fact sheet shows that there is a major increase in diabetic cases from 4.7% to 8.5% among adults (18 years of age). Major health hazards caused due to diabetes include kidney function failure, heart disease, blindness, stroke, and lower limb dismembering. This article applies supervised machine learning algorithms on the Pima Indian Diabetic dataset to explore various patterns of risks involved using predictive models. Predictive model construction is based upon supervised machine learning algorithms: Naïve Bayes, Decision Tree, Random Forest, Gradient Boosted Tree, and Tree Ensemble. Further, the analytical patterns about these predictive models have been presented based on various performance parameters which include accuracy, precision, recall, and F-measure.


2021 ◽  
Author(s):  
Xavier Dupont

BACKGROUND As of October 2020, the COVID-19 death toll has reached over one million with 38 million confirmed cases globally. This pandemic is shaking the foundations of economies and reminding us the fragility of our system. Epidemics have affected societies since biblical times, but the recent acceleration in science and technology, as well as global cooperation, has provided scientists and mathematicians new resources, they can use to anticipate how a pandemic will spread with mathematical modelling. Compartmental modelling techniques, such as the SIR model, have been well-established for more than a century and have proven efficient and reliable in helping governments decide what strategies to use to fight pandemics. OBJECTIVE State of the art report on predictive models and technology METHODS Field research, Interview, RESULTS More recently, digitalisation and rapid progress in fields such as Machine Learning, IoT and big data have brought new perspectives to predictive models that improve their ability to predict how a pandemic will unfold and therefore which actions should be taken to eradicate the disease. This report will first review how pandemic modelling works. CONCLUSIONS It will then discuss the benefits and limitations of those models before outlining how new initiatives in several fields of technology are being used to fight the virus that causes COVID-19.


Sensors ◽  
2020 ◽  
Vol 20 (16) ◽  
pp. 4540
Author(s):  
Kieran Rendall ◽  
Antonia Nisioti ◽  
Alexios Mylonas

Phishing is one of the most common threats that users face while browsing the web. In the current threat landscape, a targeted phishing attack (i.e., spear phishing) often constitutes the first action of a threat actor during an intrusion campaign. To tackle this threat, many data-driven approaches have been proposed, which mostly rely on the use of supervised machine learning under a single-layer approach. However, such approaches are resource-demanding and, thus, their deployment in production environments is infeasible. Moreover, most previous works utilise a feature set that can be easily tampered with by adversaries. In this paper, we investigate the use of a multi-layered detection framework in which a potential phishing domain is classified multiple times by models using different feature sets. In our work, an additional classification takes place only when the initial one scores below a predefined confidence level, which is set by the system owner. We demonstrate our approach by implementing a two-layered detection system, which uses supervised machine learning to identify phishing attacks. We evaluate our system with a dataset consisting of active phishing attacks and find that its performance is comparable to the state of the art.


2020 ◽  
Author(s):  
Hossein Foroozand ◽  
Steven V. Weijs

<p>Machine learning is the fast-growing branch of data-driven models, and its main objective is to use computational methods to become more accurate in predicting outcomes without being explicitly programmed. In this field, a way to improve model predictions is to use a large collection of models (called ensemble) instead of a single one. Each model is then trained on slightly different samples of the original data, and their predictions are averaged. This is called bootstrap aggregating, or Bagging, and is widely applied. A recurring question in previous works was: how to choose the ensemble size of training data sets for tuning the weights in machine learning? The computational cost of ensemble-based methods scales with the size of the ensemble, but excessively reducing the ensemble size comes at the cost of reduced predictive performance. The choice of ensemble size was often determined based on the size of input data and available computational power, which can become a limiting factor for larger datasets and complex models’ training. In this research, it is our hypothesis that if an ensemble of artificial neural networks (ANN) models or any other machine learning technique uses the most informative ensemble members for training purpose rather than all bootstrapped ensemble members, it could reduce the computational time substantially without negatively affecting the performance of simulation.</p>


Missing data arise major issues in the large database regarding quantitative analysis. Due to this issues, the inference of the computational process produce bias results, more damage of data, the error rate can increase, and more difficult to accomplish the process of imputation. Prediction of disguised missing data occurs in the large data sets are another major problems in real time operation. Machine learning (ML) techniques to connect with the classification of measurement to enforce the accuracy rate of predictive values. These techniques overcome the various challenges to the problem of losing data. Recent work based on the prediction of misclassification using supervised ML approach; to predict an output for an unseen input with limited parameters in a data set. When increase the size of parameter, then it generates the outcome of less accuracy rate. This article presented a new approach COBACO, an effective supervised machine learning technique. Several strategies describe the classification of predictive techniques for missing data analysis in efficient supervised machine learning techniques. The proposed predictive techniques COBACO generated more precise, accurate results than the other predictive approaches. The Experimental results obtained using both real and synthetic data set show that the proposed approach offers a valuable and promising insight to the problem of prediction of missing information.


PLoS ONE ◽  
2021 ◽  
Vol 16 (7) ◽  
pp. e0253789
Author(s):  
Magdalyn E. Elkin ◽  
Xingquan Zhu

As of March 30 2021, over 5,193 COVID-19 clinical trials have been registered through Clinicaltrial.gov. Among them, 191 trials were terminated, suspended, or withdrawn (indicating the cessation of the study). On the other hand, 909 trials have been completed (indicating the completion of the study). In this study, we propose to study underlying factors of COVID-19 trial completion vs. cessation, and design predictive models to accurately predict whether a COVID-19 trial may complete or cease in the future. We collect 4,441 COVID-19 trials from ClinicalTrial.gov to build a testbed, and design four types of features to characterize clinical trial administration, eligibility, study information, criteria, drug types, study keywords, as well as embedding features commonly used in the state-of-the-art machine learning. Our study shows that drug features and study keywords are most informative features, but all four types of features are essential for accurate trial prediction. By using predictive models, our approach achieves more than 0.87 AUC (Area Under the Curve) score and 0.81 balanced accuracy to correctly predict COVID-19 clinical trial completion vs. cessation. Our research shows that computational methods can deliver effective features to understand difference between completed vs. ceased COVID-19 trials. In addition, such models can also predict COVID-19 trial status with satisfactory accuracy, and help stakeholders better plan trials and minimize costs.


Energies ◽  
2020 ◽  
Vol 13 (13) ◽  
pp. 3340 ◽  
Author(s):  
Ehsan Harirchian ◽  
Tom Lahmer ◽  
Vandana Kumari ◽  
Kirti Jadhav

The economic losses from earthquakes tend to hit the national economy considerably; therefore, models that are capable of estimating the vulnerability and losses of future earthquakes are highly consequential for emergency planners with the purpose of risk mitigation. This demands a mass prioritization filtering of structures to identify vulnerable buildings for retrofitting purposes. The application of advanced structural analysis on each building to study the earthquake response is impractical due to complex calculations, long computational time, and exorbitant cost. This exhibits the need for a fast, reliable, and rapid method, commonly known as Rapid Visual Screening (RVS). The method serves as a preliminary screening platform, using an optimum number of seismic parameters of the structure and predefined output damage states. In this study, the efficacy of the Machine Learning (ML) application in damage prediction through a Support Vector Machine (SVM) model as the damage classification technique has been investigated. The developed model was trained and examined based on damage data from the 1999 Düzce Earthquake in Turkey, where the building’s data consists of 22 performance modifiers that have been implemented with supervised machine learning.


Author(s):  
Liye Zhang ◽  
W. M. Kim Roddis

A method combining machine learning and regression analysis to automatically and intelligently update predictive models used in the Kansas Department of Transportation’s (KDOT’s) internal management system is presented. The predictive models used by KDOT consist of planning factors (mathematical functions) and base quantities (constants). The duration of a functional unit (defined as a subactivity) is determined by the product of a planning factor and its base quantity. The availability of a large data base on projects executed over the past decade provided the opportunity to develop an automated process updating predictive models based on extracting information from historical data through machine learning. To perform the entire task of updating the predictive models, the learning process consists of three stages. The first stage derives the numerical relationship between the duration of a functional unit and the project attributes recorded in the data base. The second stage finds the functional units with similar behavior—that is, identifies functional units that can be described by the same shared planning factor scaled in terms of their own base quantities. The third stage generates new planning factors and base quantities. A system called PFactor built on the basis of the three-stage learning process shows good performance in updating KDOT’s predictive models.


Sign in / Sign up

Export Citation Format

Share Document