APPLYING ECONOMIC MEASURES TO LAPSE RISK MANAGEMENT WITH MACHINE LEARNING APPROACHES

Astin Bulletin ◽

10.1017/asb.2021.10 ◽

2021 ◽

pp. 1-33

Author(s):

Stéphane Loisel ◽

Pierrick Piette ◽

Cheng-Hsien Jason Tsai

Keyword(s):

Machine Learning ◽

Risk Management ◽

Regression Tree ◽

Classification Problem ◽

Point Of View ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Modeling policyholders’ lapse behaviors is important to a life insurer, since lapses affect pricing, reserving, profitability, liquidity, risk management, and the solvency of the insurer. In this paper, we apply two machine learning methods to lapse modeling. Then, we evaluate the performance of these two methods along with two popular statistical methods by means of statistical accuracy and profitability measure. Moreover, we adopt an innovative point of view on the lapse prediction problem that comes from churn management. We transform the classification problem into a regression question and then perform optimization, which is new to lapse risk management. We apply the aforementioned four methods to a large real-world insurance dataset. The results show that Extreme Gradient Boosting (XGBoost) and support vector machine outperform logistic regression (LR) and classification and regression tree with respect to statistic accuracy, while LR performs as well as XGBoost in terms of retention gains. This highlights the importance of a proper validation metric when comparing different methods. The optimization after the transformation brings out significant and consistent increases in economic gains. Therefore, the insurer should conduct optimization on its economic objective to achieve optimal lapse management.

Download Full-text

Estimating the Optimal Dexketoprofen Pharmaceutical Formulation with Machine Learning Methods and Statistical Approaches

Healthcare Informatics Research ◽

10.4258/hir.2021.27.4.279 ◽

2021 ◽

Vol 27 (4) ◽

pp. 279-286

Author(s):

Atakan Başkor ◽

Yağmur Pirinçci Tok ◽

Burcu Mesut ◽

Yıldız Özsoy ◽

Tamer Uçar

Keyword(s):

Machine Learning ◽

Regression Tree ◽

Cost Effective ◽

Pharmaceutical Formulation ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Disintegration Time ◽

Pharmaceutical Dosage Form ◽

Extreme Gradient Boosting

Objectives: Orally disintegrating tablets (ODTs) can be utilized without any drinking water; this feature makes ODTs easy to use and suitable for specific groups of patients. Oral administration of drugs is the most commonly used route, and tablets constitute the most preferable pharmaceutical dosage form. However, the preparation of ODTs is costly and requires long trials, which creates obstacles for dosage trials. The aim of this study was to identify the most appropriate formulation using machine learning (ML) models of ODT dexketoprofen formulations, with the goal of providing a cost-effective and timereducing solution.Methods: This research utilized nonlinear regression models, including the k-nearest neighborhood (k-NN), support vector regression (SVR), classification and regression tree (CART), bootstrap aggregating (bagging), random forest (RF), gradient boosting machine (GBM), and extreme gradient boosting (XGBoost) methods, as well as the t-test, to predict the quantity of various components in the dexketoprofen formulation within fixed criteria.Results: All the models were developed with Python libraries. The performance of the ML models was evaluated with R2 values and the root mean square error. Hardness values of 0.99 and 2.88, friability values of 0.92 and 0.02, and disintegration time values of 0.97 and 10.09 using the GBM algorithm gave the best results.Conclusions: In this study, we developed a computational approach to estimate the optimal pharmaceutical formulation of dexketoprofen. The results were evaluated by an expert, and it was found that they complied with Food and Drug Administration criteria.

Download Full-text

Exploring the Mechanism of Crashes with Autonomous Vehicles Using Machine Learning

Mathematical Problems in Engineering ◽

10.1155/2021/5524356 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Hengrui Chen ◽

Hong Chen ◽

Ruiyu Zhou ◽

Zhizhen Liu ◽

Xiaoke Sun

Keyword(s):

Machine Learning ◽

Autonomous Vehicles ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Crash Severity ◽

Apriori Algorithm ◽

Driving Mode ◽

Extreme Gradient Boosting ◽

The Impact

The safety issue has become a critical obstacle that cannot be ignored in the marketization of autonomous vehicles (AVs). The objective of this study is to explore the mechanism of AV-involved crashes and analyze the impact of each feature on crash severity. We use the Apriori algorithm to explore the causal relationship between multiple factors to explore the mechanism of crashes. We use various machine learning models, including support vector machine (SVM), classification and regression tree (CART), and eXtreme Gradient Boosting (XGBoost), to analyze the crash severity. Besides, we apply the Shapley Additive Explanations (SHAP) to interpret the importance of each factor. The results indicate that XGBoost obtains the best result (recall = 75%; G-mean = 67.82%). Both XGBoost and Apriori algorithm effectively provided meaningful insights about AV-involved crash characteristics and their relationship. Among all these features, vehicle damage, weather conditions, accident location, and driving mode are the most critical features. We found that most rear-end crashes are conventional vehicles bumping into the rear of AVs. Drivers should be extremely cautious when driving in fog, snow, and insufficient light. Besides, drivers should be careful when driving near intersections, especially in the autonomous driving mode.

Download Full-text

Combining Binary and Post-Classification Change Analysis of Augmented ALOS Backscatter for Identifying Subtle Land Cover Changes

Remote Sensing ◽

10.3390/rs11010100 ◽

2019 ◽

Vol 11 (1) ◽

pp. 100 ◽

Cited By ~ 7

Author(s):

Dyah R. Panuju ◽

David J. Paull ◽

Bambang H. Trisasongko

Keyword(s):

Random Forest ◽

Classification Accuracy ◽

Regression Tree ◽

Google Earth ◽

Variable Importance ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Change Analysis ◽

Extreme Gradient Boosting

This research aims to detect subtle changes by combining binary change analysis, the Iteratively Reweighted Multivariate Alteration Detection (IRMAD), over dual polarimetric Advanced Land Observing Satellite (ALOS) backscatter with augmented data for post-classification change analysis. The accuracy of change detection was iteratively evaluated based on thresholds composed of mean and a range constant of standard deviation. Four datasets were examined for post-classification change analysis including the dual polarimetric backscatter as the benchmark and its augmented data with indices, entropy alpha decomposition and selected texture features. Variable importance was then evaluated to build a best subset model employing seven classifiers, including Bagged Classification and Regression Tree (CAB), Extreme Learning Machine Neural Network (ENN), Bagged Multivariate Adaptive Regression Spline (MAB), Regularised Random Forest (RFG), Original Random Forest (RFO), Support Vector Machine (SVM), and Extreme Gradient Boosting Tree (XGB). The best accuracy was 98.8%, which resulted from thresholding MAD variate-2 with constants at 1.7. The highest improvement of classification accuracy was obtained by amending the grey level co-occurrence matrix (GLCM) texture. The identification of variable importance (VI) confirmed that selected GLCM textures (mean and variance of HH or HV) were equally superior, while the contribution of index and decomposition were negligible. The best model produced similar classification accuracy at about 90% for both years 2007 and 2010. Tree-based algorithms including RFO, RFG and XGB were more robust than SVM and ENN. Subtle changes indicated by binary change analysis were somewhat hidden in post-classification analysis. Reclassification by combining all important variables and adding five classes to include subtle changes assisted by Google Earth yielded an accuracy of 82%.

Download Full-text

Performance of Statistical and Machine Learning-Based Methods for Predicting Biogeographical Patterns of Fungal Productivity in Forest Ecosystems

10.21203/rs.3.rs-122045/v1 ◽

2020 ◽

Author(s):

Albert Morera ◽

Juan Martínez de Aragón ◽

José Antonio Bonet ◽

Jingjing Liang ◽

Sergio de-Miguel

Keyword(s):

Machine Learning ◽

Random Forest ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models ◽

Modelling Approaches

Abstract BackgroundThe prediction of biogeographical patterns from a large number of driving factors with complex interactions, correlations and non-linear dependences require advanced analytical methods and modelling tools. This study compares different statistical and machine learning models for predicting fungal productivity biogeographical patterns as a case study for the thorough assessment of the performance of alternative modelling approaches to provide accurate and ecologically-consistent predictions.MethodsWe evaluated and compared the performance of two statistical modelling techniques, namely, generalized linear mixed models and geographically weighted regression, and four machine learning models, namely, random forest, extreme gradient boosting, support vector machine and deep learning to predict fungal productivity. We used a systematic methodology based on substitution, random, spatial and climatic blocking combined with principal component analysis, together with an evaluation of the ecological consistency of spatially-explicit model predictions.ResultsFungal productivity predictions were sensitive to the modelling approach and complexity. Moreover, the importance assigned to different predictors varied between machine learning modelling approaches. Decision tree-based models increased prediction accuracy by ~7% compared to other machine learning approaches and by more than 25% compared to statistical ones, and resulted in higher ecological consistence at the landscape level.ConclusionsWhereas a large number of predictors are often used in machine learning algorithms, in this study we show that proper variable selection is crucial to create robust models for extrapolation in biophysically differentiated areas. When dealing with spatial-temporal data in the analysis of biogeographical patterns, climatic blocking is postulated as a highly informative technique to be used in cross-validation to assess the prediction error over larger scales. Random forest was the best approach for prediction both in sampling-like environments as well as in extrapolation beyond the spatial and climatic range of the modelling data.

Download Full-text

Importance of GWAS risk loci and clinical data in predicting asthma using machine-learning approaches

10.21203/rs.3.rs-21271/v1 ◽

2020 ◽

Author(s):

Si-Qiao Liang ◽

Jian-Xiong Long ◽

Jingmin Deng ◽

Xuan Wei ◽

Mei-Ling Yang ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Clinical Data ◽

Genome Wide Association Study ◽

Prediction Models ◽

Area Under The Curve ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Asthma is a serious immune-mediated respiratory airway disease. Its pathological processes involve genetics and the environment, but it remains unclear. To understand the risk factors of asthma, we combined genome-wide association study (GWAS) risk loci and clinical data in predicting asthma using machine-learning approaches. A case–control study with 123 asthma patients and 100 healthy controls was conducted in Zhuang population in Guangxi. GWAS risk loci were detected using polymerase chain reaction, and clinical data were collected. Machine-learning approaches (e.g., extreme gradient boosting [XGBoost], decision tree, support vector machine, and random forest algorithms) were used to identify the major factors that contributed to asthma. A total of 14 GWAS risk loci with clinical data were analyzed on the basis of 10 times of 10-fold cross-validation for all machine-learning models. Using GWAS risk loci or clinical data, the best performances were area under the curve (AUC) values of 64.3% and 71.4%, respectively. Combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data can enable improved performance. We then sorted the importance of features and found that the top six risk factors for predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma and thus provide insights into the disease pathogenesis of asthma. Further research is required to evaluate more genetic markers and clinical data and predict asthma risk.

Download Full-text

Performance of statistical and machine learning-based methods for predicting biogeographical patterns of fungal productivity in forest ecosystems

Forest Ecosystems ◽

10.1186/s40663-021-00297-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Albert Morera ◽

Juan Martínez de Aragón ◽

José Antonio Bonet ◽

Jingjing Liang ◽

Sergio de-Miguel

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Modeling Tools ◽

Modeling Approaches ◽

Extreme Gradient Boosting ◽

Modeling Data

Abstract Background The prediction of biogeographical patterns from a large number of driving factors with complex interactions, correlations and non-linear dependences require advanced analytical methods and modeling tools. This study compares different statistical and machine learning-based models for predicting fungal productivity biogeographical patterns as a case study for the thorough assessment of the performance of alternative modeling approaches to provide accurate and ecologically-consistent predictions. Methods We evaluated and compared the performance of two statistical modeling techniques, namely, generalized linear mixed models and geographically weighted regression, and four techniques based on different machine learning algorithms, namely, random forest, extreme gradient boosting, support vector machine and artificial neural network to predict fungal productivity. Model evaluation was conducted using a systematic methodology combining random, spatial and environmental blocking together with the assessment of the ecological consistency of spatially-explicit model predictions according to scientific knowledge. Results Fungal productivity predictions were sensitive to the modeling approach and the number of predictors used. Moreover, the importance assigned to different predictors varied between machine learning modeling approaches. Decision tree-based models increased prediction accuracy by more than 10% compared to other machine learning approaches, and by more than 20% compared to statistical models, and resulted in higher ecological consistence of the predicted biogeographical patterns of fungal productivity. Conclusions Decision tree-based models were the best approach for prediction both in sampling-like environments as well as in extrapolation beyond the spatial and climatic range of the modeling data. In this study, we show that proper variable selection is crucial to create robust models for extrapolation in biophysically differentiated areas. This allows for reducing the dimensions of the ecosystem space described by the predictors of the models, resulting in higher similarity between the modeling data and the environmental conditions over the whole study area. When dealing with spatial-temporal data in the analysis of biogeographical patterns, environmental blocking is postulated as a highly informative technique to be used in cross-validation to assess the prediction error over larger scales.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival

Scientific Reports ◽

10.1038/s41598-021-86327-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Arturo Moncada-Torres ◽

Marissa C. van Maaren ◽

Mathijs P. Hendriks ◽

Sabine Siesling ◽

Gijs Geleijnse

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Explicit Knowledge ◽

Cox Regression ◽

Metastatic Breast ◽

Gradient Boosting ◽

Support Vector ◽

Netherlands Cancer Registry ◽

Extreme Gradient Boosting ◽

The Impact

AbstractCox Proportional Hazards (CPH) analysis is the standard for survival analysis in oncology. Recently, several machine learning (ML) techniques have been adapted for this task. Although they have shown to yield results at least as good as classical methods, they are often disregarded because of their lack of transparency and little to no explainability, which are key for their adoption in clinical settings. In this paper, we used data from the Netherlands Cancer Registry of 36,658 non-metastatic breast cancer patients to compare the performance of CPH with ML techniques (Random Survival Forests, Survival Support Vector Machines, and Extreme Gradient Boosting [XGB]) in predicting survival using the $$c$$ c -index. We demonstrated that in our dataset, ML-based models can perform at least as good as the classical CPH regression ($$c$$ c -index $$\sim \,0.63$$ ∼ 0.63 ), and in the case of XGB even better ($$c$$ c -index $$\sim 0.73$$ ∼ 0.73 ). Furthermore, we used Shapley Additive Explanation (SHAP) values to explain the models’ predictions. We concluded that the difference in performance can be attributed to XGB’s ability to model nonlinearities and complex interactions. We also investigated the impact of specific features on the models’ predictions as well as their corresponding insights. Lastly, we showed that explainable ML can generate explicit knowledge of how models make their predictions, which is crucial in increasing the trust and adoption of innovative ML techniques in oncology and healthcare overall.

Download Full-text

Machine learning as a successful approach for predicting complex spatio–temporal patterns in animal species abundance

Animal Biodiversity and Conservation ◽

10.32800/abc.2021.44.0289 ◽

2021 ◽

pp. 289-301

Author(s):

B. Martín ◽

J. González–Arias ◽

J. A. Vicente–Vírseda

Keyword(s):

Machine Learning ◽

Random Forest ◽

Animal Species ◽

Temporal Patterns ◽

Additive Models ◽

Gradient Boosting ◽

Support Vector ◽

Stochastic Gradient Boosting ◽

Extreme Gradient Boosting ◽

Spatio Temporal

Our aim was to identify an optimal analytical approach for accurately predicting complex spatio–temporal patterns in animal species distribution. We compared the performance of eight modelling techniques (generalized additive models, regression trees, bagged CART, k–nearest neighbors, stochastic gradient boosting, support vector machines, neural network, and random forest –enhanced form of bootstrap. We also performed extreme gradient boosting –an enhanced form of radiant boosting– to predict spatial patterns in abundance of migrating Balearic shearwaters based on data gathered within eBird. Derived from open–source datasets, proxies of frontal systems and ocean productivity domains that have been previously used to characterize the oceanographic habitats of seabirds were quantified, and then used as predictors in the models. The random forest model showed the best performance according to the parameters assessed (RMSE value and R2). The correlation between observed and predicted abundance with this model was also considerably high. This study shows that the combination of machine learning techniques and massive data provided by open data sources is a useful approach for identifying the long–term spatial–temporal distribution of species at regional spatial scales.

Download Full-text