Machine Learning Approaches for Auto Insurance Big Data

The growing trend in the number and severity of auto insurance claims creates a need for new methods to efficiently handle these claims. Machine learning (ML) is one of the methods that solves this problem. As car insurers aim to improve their customer service, these companies have started adopting and applying ML to enhance the interpretation and comprehension of their data for efficiency, thus improving their customer service through a better understanding of their needs. This study considers how automotive insurance providers incorporate machinery learning in their company, and explores how ML models can apply to insurance big data. We utilize various ML methods, such as logistic regression, XGBoost, random forest, decision trees, naïve Bayes, and K-NN, to predict claim occurrence. Furthermore, we evaluate and compare these models’ performances. The results showed that RF is better than other methods with the accuracy, kappa, and AUC values of 0.8677, 0.7117, and 0.840, respectively.

Download Full-text

Development of an ensemble machine learning prognostic model to predict 60-day risk of major adverse cardiac events in adults with chest pain

10.1101/2021.03.08.21252615 ◽

2021 ◽

Author(s):

Chris J. Kennedy ◽

Dustin G. Mark ◽

Jie Huang ◽

Mark J. van der Laan ◽

Alan E. Hubbard ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Chest Pain ◽

Random Forest ◽

Decision Trees ◽

Low Risk ◽

Major Adverse Cardiac Events ◽

Risk Scores ◽

Cardiac Events ◽

Adverse Cardiac Events

Background: Chest pain is the second leading reason for emergency department (ED) visits and is commonly identified as a leading driver of low-value health care. Accurate identification of patients at low risk of major adverse cardiac events (MACE) is important to improve resource allocation and reduce over-treatment. Objectives: We sought to assess machine learning (ML) methods and electronic health record (EHR) covariate collection for MACE prediction. We aimed to maximize the pool of low-risk patients that are accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced testing. Population Studied: 116,764 adult patients presenting with chest pain in the ED and evaluated for potential acute coronary syndrome (ACS). 60-day MACE rate was 1.9%. Methods: We evaluated ML algorithms (lasso, splines, random forest, extreme gradient boosting, Bayesian additive regression trees) and SuperLearner stacked ensembling. We tuned ML hyperparameters through nested ensembling, and imputed missing values with generalized low-rank models (GLRM). We benchmarked performance to key biomarkers, validated clinical risk scores, decision trees, and logistic regression. We explained the models through variable importance ranking and accumulated local effect visualization. Results: The best discrimination (area under the precision-recall [PR-AUC] and receiver operating characteristic [ROC-AUC] curves) was provided by SuperLearner ensembling (0.148, 0.867), followed by random forest (0.146, 0.862). Logistic regression (0.120, 0.842) and decision trees (0.094, 0.805) exhibited worse discrimination, as did risk scores [HEART (0.064, 0.765), EDACS (0.046, 0.733)] and biomarkers [serum troponin level (0.064, 0.708), electrocardiography (0.047, 0.686)]. The ensemble's risk estimates were miscalibrated by 0.2 percentage points. The ensemble accurately identified 50% of patients to be below a 0.5% 60-day MACE risk threshold. The most important predictors were age, peak troponin, HEART score, EDACS score, and electrocardiogram. GLRM imputation achieved 90% reduction in root mean-squared error compared to median-mode imputation. Conclusion: Use of ML algorithms, combined with broad predictor sets, improved MACE risk prediction compared to simpler alternatives, while providing calibrated predictions and interpretability. Standard risk scores may neglect important health information available in other characteristics and combined in nuanced ways via ML.

Download Full-text

Lying on the Dissection Table: Anatomizing Faked Responses

10.31234/osf.io/2m5xw ◽

2021 ◽

Author(s):

Jessica Röhner ◽

Philipp Thoss ◽

Astrid Schütz

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Input Data ◽

Data Sets ◽

Response Patterns ◽

Implicit Association ◽

Machine Learning Classifiers ◽

Self Reports ◽

Better Than

Research has shown that even experts cannot detect faking above chance, but recent studies have suggested that machine learning may help in this endeavor. However, faking differs between faking conditions, previous efforts have not taken these differences into account, and faking indices have yet to be integrated into such approaches. We reanalyzed seven data sets (N = 1,039) with various faking conditions (high and low scores, different constructs, naïve and informed faking, faking with and without practice, different measures [self-reports vs. implicit association tests; IATs]). We investigated the extent to which and how machine learning classifiers could detect faking under these conditions and compared different input data (response patterns, scores, faking indices) and different classifiers (logistic regression, random forest, XGBoost). We also explored the features that classifiers used for detection. Our results show that machine learning has the potential to detect faking, but detection success varies between conditions from chance levels to 100%. There were differences in detection (e.g., detecting low-score faking was better than detecting high-score faking). For self-reports, response patterns and scores were comparable with regard to faking detection, whereas for IATs, faking indices and response patterns were superior to scores. Logistic regression and random forest worked about equally well and outperformed XGBoost. In most cases, classifiers used more than one feature (faking occurred over different pathways), and the features varied in their relevance. Our research supports the assumption of different faking processes and explains why detecting faking is a complex endeavor.

Download Full-text

Employee's Attrition Prediction Using Machine Learning Approaches

Machine Learning and Deep Learning in Real-Time Applications - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-7998-3095-5.ch005 ◽

2020 ◽

pp. 121-128

Author(s):

Krishna Kumar Mohbey

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Comparative Analysis ◽

Random Forest ◽

Decision Tree ◽

Experimental Results ◽

Learning Approaches ◽

Employee Attrition ◽

Or Organization ◽

Customer Attrition

In any industry, attrition is a big problem, whether it is about employee attrition of an organization or customer attrition of an e-commerce site. If we can accurately predict which customer or employee will leave their current company or organization, then it will save much time, effort, and cost of the employer and help them to hire or acquire substitutes in advance, and it would not create a problem in the ongoing progress of an organization. In this chapter, a comparative analysis between various machine learning approaches such as Naïve Bayes, SVM, decision tree, random forest, and logistic regression is presented. The presented result will help us in identifying the behavior of employees who can be attired over the next time. Experimental results reveal that the logistic regression approach can reach up to 86% accuracy over other machine learning approaches.

Download Full-text

Don’t Dismiss Logistic Regression: The Case for Sensible Extraction of Interactions in the Era of Machine Learning

10.1101/2019.12.15.877134 ◽

2019 ◽

Cited By ~ 1

Author(s):

Joshua J. Levy ◽

A. James O’Malley

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Model Building ◽

Machine Learning Techniques ◽

Learning Approaches ◽

Statistical Machine Learning ◽

Forest Model ◽

Learning Techniques ◽

Modeling Techniques

AbstractBackgroundMachine learning approaches have become increasingly popular modeling techniques, relying on data-driven heuristics to arrive at its solutions. Recent comparisons between these algorithms and traditional statistical modeling techniques have largely ignored the superiority gained by the former approaches due to involvement of model-building search algorithms. This has led to alignment of statistical and machine learning approaches with different types of problems and the under-development of procedures that combine their attributes. In this context, we hoped to understand the domains of applicability for each approach and to identify areas where a marriage between the two approaches is warranted. We then sought to develop a hybrid statistical-machine learning procedure with the best attributes of each.MethodsWe present three simple examples to illustrate when to use each modeling approach and posit a general framework for combining them into an enhanced logistic regression model building procedure that aids interpretation. We study 556 benchmark machine learning datasets to uncover when machine learning techniques outperformed rudimentary logistic regression models and so are potentially well-equipped to enhance them. We illustrate a software package, InteractionTransformer, which embeds logistic regression with advanced model building capacity by using machine learning algorithms to extract candidate interaction features from a random forest model for inclusion in the model. Finally, we apply our enhanced logistic regression analysis to two real-word biomedical examples, one where predictors vary linearly with the outcome and another with extensive second-order interactions.ResultsPreliminary statistical analysis demonstrated that across 556 benchmark datasets, the random forest approach significantly outperformed the logistic regression approach. We found a statistically significant increase in predictive performance when using hybrid procedures and greater clarity in the association with the outcome of terms acquired compared to directly interpreting the random forest output.ConclusionsWhen a random forest model is closer to the true model, hybrid statistical-machine learning procedures can substantially enhance the performance of statistical procedures in an automated manner while preserving easy interpretation of the results. Such hybrid methods may help facilitate widespread adoption of machine learning techniques in the biomedical setting.

Download Full-text

ANALYSIS OF SINGLE AND ENSEMBLE MACHINE LEARNING CLASSIFIERS FOR PHISHING ATTACKS DETECTION

International Journal of Computer Systems & Software Engineering ◽

10.15282/ijsecs.7.2.2021.5.0088 ◽

2021 ◽

Vol 7 (2) ◽

pp. 44-49

Author(s):

Oyelakin A. M ◽

Alimi O. M ◽

Mustapha I. O ◽

Ajiboye I. K

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Decision Trees ◽

Random Forest Algorithm ◽

Ensemble Techniques ◽

Learning Classifiers ◽

Phishing Attacks ◽

Ensemble Machine Learning

Phishing attacks have been used in different ways to harvest the confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks, several machine learning techniques have been proposed in the past. However, fewer studies have considered investigating single and ensemble machine learning-based models for the classification of phishing attacks. This study carried out performance analysis of selected single and ensemble machine learning (ML) classifiers in phishing classification.The focus is to investigate how these algorithms behave in the classification of phishing attacks in the chosen dataset. Logistic Regression and Decision Trees were chosen as single learning classifiers while simple voting techniques and Random Forest were used as the ensemble machine learning algorithms. Accuracy, Precision, Recall and F1-score were used as performance metrics. Logistic Regression algorithm recorded 0.86 as accuracy, 0.89 as precision, 0.87 as recall and 0.81 as F1-score. Similarly, the Decision Trees classifier achieved an accuracy of 0.87, 0.83 for precision, 0.88 for recall and 0.81 for F1-score. In the voting ensemble, accuracy of 0.92 was achieved. 0.90 was obtained for precision, 0.92 for recall and 0.92 for F1-score. Random Forest algorithm recorded 0.98, 0.97, 0.98 and 0.97 as accuracy, precision, recall and F1-score respectively. From the experimental analyses, Random Forest algorithm outperformed simple averaging classifier and the two single algorithms used for phishing url detection. The study established that the ensemble techniques that were used for the experimentations are more efficient for phishing url identification compared to the single classifiers.

Download Full-text

CT-Based Radiomics Signature With Machine Learning Predicts MYCN Amplification in Pediatric Abdominal Neuroblastoma

Frontiers in Oncology ◽

10.3389/fonc.2021.687884 ◽

2021 ◽

Vol 11 ◽

Author(s):

Xin Chen ◽

Haoru Wang ◽

Kaiping Huang ◽

Huan Liu ◽

Hao Ding ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Predictive Performance ◽

Training Group ◽

Clinicopathological Parameters ◽

Mycn Amplification ◽

Clinical Model ◽

Radiomics Signature ◽

Better Than

PurposeMYCN amplification plays a critical role in defining high-risk subgroup of patients with neuroblastoma. We aimed to develop and validate the CT-based machine learning models for predicting MYCN amplification in pediatric abdominal neuroblastoma.MethodsA total of 172 patients with MYCN amplified (n = 47) and non-amplified (n = 125) were enrolled. The cohort was randomly stratified sampling into training and testing groups. Clinicopathological parameters and radiographic features were selected to construct the clinical predictive model. The regions of interest (ROIs) were segmented on three-phrase CT images to extract first-, second- and higher-order radiomics features. The ICCs, mRMR and LASSO methods were used for dimensionality reduction. The selected features from the training group were used to establish radiomics models using Logistic regression, Support Vector Machine (SVM), Bayes and Random Forest methods. The performance of four different radiomics models was evaluated according to the area under the receiver operator characteristic (ROC) curve (AUC), and then compared by Delong test. The nomogram incorporated of clinicopathological parameters, radiographic features and radiomics signature was developed through multivariate logistic regression. Finally, the predictive performance of the clinical model, radiomics models, and nomogram was evaluated in both training and testing groups.ResultsIn total, 1,218 radiomics features were extracted from the ROIs on three-phrase CT images, and then 14 optimal features, including one original first-order feature and eight wavelet-transformed features and five LoG-transformed features, were identified and selected to construct the radiomics models. In the training group, the AUC of the Logistic, SVM, Bayes and Random Forest model was 0.940, 0.940, 0.780 and 0.927, respectively, and the corresponding AUC in the testing group was 0.909, 0.909, 0.729, 0.851, respectively. There was no significant difference among the Logistic, SVM and Random Forest model, but all better than the Bayes model (p <0.005). The predictive performance of the Logistic radiomics model based on three-phrase is similar to nomogram, but both better than the clinical model and radiomics model based on single venous phase.ConclusionThe CT-based radiomics signature is able to predict MYCN amplification of pediatric abdominal NB with high accuracy based on SVM, Logistic and Random Forest classifiers, while Bayes classifier yields lower predictive performance. When combined with clinical and radiographic qualitative features, the clinics-radiomics nomogram can improve the performance of predicting MYCN amplification.

Download Full-text

A survey of big data and machine learning

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i1.pp575-580 ◽

2020 ◽

Vol 10 (1) ◽

pp. 575 ◽

Cited By ~ 2

Author(s):

Surender Reddy Salkuti

Keyword(s):

Machine Learning ◽

Big Data ◽

Power System ◽

Digital Media ◽

Customer Service ◽

Electrical Power ◽

Big Data Analytics ◽

Learning Approaches ◽

Impact Measurement ◽

Power And Energy

This paper presents a detailed analysis of big data and machine learning (ML) in the electrical power and energy sector. Big data analytics for smart energy operations, applications, impact, measurement and control, and challenges are presented in this paper. Big data and machine learning approaches need to be applied after analyzing the power system problem carefully. Determining the match between the strengths of big data and machine learning for solving the power system problem is of utmost important. They can be of great help to plan and operate the traditional grid/smart grid (SG). The basics of big data and machine learning are described in detailed manner along with their applications in various fields such as electrical power and energy, health care and life sciences, government, telecommunications, web and digital media, retailers, finance, e-commerce and customer service, etc. Finally, the challenges and opportunities of big data and machine learning are presented in this paper.

Download Full-text

Big Data Mining and Classification of Intelligent Material Science Data Using Machine Learning

Applied Sciences ◽

10.3390/app11188596 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8596

Author(s):

Swetha Chittam ◽

Balakrishna Gokaraju ◽

Zhigang Xu ◽

Jagannathan Sankar ◽

Kaushik Roy

Keyword(s):

Machine Learning ◽

Data Mining ◽

Logistic Regression ◽

Big Data ◽

Random Forest ◽

Material Science ◽

Classification Model ◽

Support Vector ◽

Material Strength ◽

Science Data

There is a high need for a big data repository for material compositions and their derived analytics of metal strength, in the material science community. Currently, many researchers maintain their own excel sheets, prepared manually by their team by tabulating the experimental data collected from scientific journals, and analyzing the data by performing manual calculations using formulas to determine the strength of the material. In this study, we propose a big data storage for material science data and its processing parameters information to address the laborious process of data tabulation from scientific articles, data mining techniques to retrieve the information from databases to perform big data analytics, and a machine learning prediction model to determine material strength insights. Three models are proposed based on Logistic regression, Support vector Machine SVM and Random Forest Algorithms. These models are trained and tested using a 10-fold cross validation approach. The Random Forest classification model performed better on the independent dataset, with 87% accuracy in comparison to Logistic regression and SVM with 72% and 78%, respectively.

Download Full-text

Predicting the need for intubation in the first 24 h after critical care admission using machine learning approaches

Scientific Reports ◽

10.1038/s41598-020-77893-3 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Benjamin Ming Kit Siu ◽

Gloria Hyunjung Kwak ◽

Lowell Ling ◽

Pan Hui

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Critical Care ◽

Random Forest ◽

Critically Ill ◽

Critically Ill Patients ◽

Learning Approaches ◽

Safety Margins ◽

Critical Care Admission ◽

Random Forest Models

AbstractEarly and accurate prediction of the need for intubation may provide more time for preparation and increase safety margins by avoiding high risk late intubation. This study evaluates whether machine learning can predict the need for intubation within 24 h using commonly available bedside and laboratory parameters taken at critical care admission. We extracted data from 2 large critical care databases (MIMIC-III and eICU-CRD). Missing variables were imputed using autoencoder. Machine learning classifiers using logistic regression and random forest were trained using 60% of the data and tested using the remaining 40% of the data. We compared the performance of logistic regression and random forest models to predict intubation in critically ill patients. After excluding patients with limitations of therapy and missing data, we included 17,616 critically ill patients in this retrospective cohort. Within 24 h of admission, 2,292 patients required intubation, whilst 15,324 patients were not intubated. Blood gas parameters (PaO2, PaCO2, HCO3−), Glasgow Coma Score, respiratory variables (respiratory rate, SpO2), temperature, age, and oxygen therapy were used to predict intubation. Random forest had AUC 0.86 (95% CI 0.85–0.87) and logistic regression had AUC 0.77 (95% CI 0.76–0.78) for intubation prediction performance. Random forest model had sensitivity of 0.88 (95% CI 0.86–0.90) and specificity of 0.66 (95% CI 0.63–0.69), with good calibration throughout the range of intubation risks. The results showed that machine learning could predict the need for intubation in critically ill patients using commonly collected bedside clinical parameters and laboratory results. It may be used in real-time to help clinicians predict the need for intubation within 24 h of intensive care unit admission.

Download Full-text

Predictors of remission from body dysmorphic disorder after internet-delivered cognitive behavior therapy: a machine learning approach

10.31234/osf.io/eqcdx ◽

2019 ◽

Author(s):

Oskar Flygare ◽

Jesper Enander ◽

Erik Andersson ◽

Brjánn Ljótsson ◽

Volen Z Ivanov ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Clinical Utility ◽

Body Dysmorphic Disorder ◽

Prediction Models ◽

Behavioral Therapy ◽

Learning Approach ◽

Learning Approaches ◽

Machine Learning Approach

**Background:** Previous attempts to identify predictors of treatment outcomes in body dysmorphic disorder (BDD) have yielded inconsistent findings. One way to increase precision and clinical utility could be to use machine learning methods, which can incorporate multiple non-linear associations in prediction models. **Methods:** This study used a random forests machine learning approach to test if it is possible to reliably predict remission from BDD in a sample of 88 individuals that had received internet-delivered cognitive behavioral therapy for BDD. The random forest models were compared to traditional logistic regression analyses. **Results:** Random forests correctly identified 78% of participants as remitters or non-remitters at post-treatment. The accuracy of prediction was lower in subsequent follow-ups (68%, 66% and 61% correctly classified at 3-, 12- and 24-month follow-ups, respectively). Depressive symptoms, treatment credibility, working alliance, and initial severity of BDD were among the most important predictors at the beginning of treatment. By contrast, the logistic regression models did not identify consistent and strong predictors of remission from BDD. **Conclusions:** The results provide initial support for the clinical utility of machine learning approaches in the prediction of outcomes of patients with BDD. **Trial registration:** ClinicalTrials.gov ID: NCT02010619.

Download Full-text