Validation of Machine Learning Models for Health Insurance Risks Assessment

A universal healthcare policy success is impossible without the use of insurance instruments. The healthcare and insurance industries are on the verge of integrating seamlessly with the help of sensors and algorithms. This research work focuses on validating an algorithm that can help to model and classify health insurance risk data. Six algorithms Logistic Regression (LR), K-Nearest Neighbors (KNN), Decision Tree (DT), Naive Bayes (NB) and Support Vector Machine (SVM) were evaluated and objective validation of these algorithms has been demonstrated. To maintain the replicability of the study the data and code are available in public repository. From the study, it is clear that the KNN algorithm is best suited as a risk classifier. This is evidence from the values of R2 , error metrics, completeness score, explained variance, normalized mutual score v measure score, precision, recall, f1 score, and accuracy metrics. Secondly, the algorithms have been validated using 10 k-fold method using five types of performance metrics. In almost all cases, it was found that the KNN algorithm performs consistently and is the most suitable numerically. This can be attributed that the standard deviation remains tight of performance metrics in evaluation. From all the validation test, it can be claimed that on the current dataset, the KNN algorithm with Accuracy, Homogeneity Score Explained variance and Normalized mutual score hyper-parameter configuration is the best performer

Download Full-text

Multi-label classification approach for quranic verses labeling

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i1.pp484-490 ◽

2021 ◽

Vol 24 (1) ◽

pp. 484

Author(s):

Abdullahi Adeleke ◽

Noor Azah Samsudin ◽

Mohd Hisyam Abdul Rahim ◽

Shamsul Kamal Ahmad Khalid ◽

Riswan Efendi

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Research Work ◽

Support Vector ◽

Classification Problems ◽

K Nearest Neighbors ◽

Training Systems ◽

Learning Tasks ◽

Binary Relevance ◽

Problem Transformation

Machine learning involves the task of training systems to be able to make decisions without being explicitly programmed. Important among machine learning tasks is classification involving the process of training machines to make predictions from predefined labels. Classification is broadly categorized into three distinct groups: single-label (SL), multi-class, and multi-label (ML) classification. This research work presents an application of a multi-label classification (MLC) technique in automating Quranic verses labeling. MLC has been gaining attention in recent years. This is due to the increasing amount of works based on real-world classification problems of multi-label data. In traditional classification problems, patterns are associated with a single-label from a set of disjoint labels. However, in MLC, an instance of data is associated with a set of labels. In this paper, three standard <em>MLC</em> methods: <span>binary relevance (BR), classifier chain (CC), and label powerset (LP) algorithms are implemented with four baseline classifiers: support vector machine (SVM), naïve Bayes (NB), k-nearest neighbors (k-NN), and J48. The research methodology adopts the multi-label problem transformation (PT) approach. The results are validated using six conventional performance metrics. These include: hamming loss, accuracy, one error, micro-F1, macro-F1, and avg. precision. From the results, the classifiers effectively achieved above 70% accuracy mark. Overall, SVM achieved the best results with CC and LP algorithms.</span>

Download Full-text

An Empirical Approach for Extreme Behavior Identification through Tweets Using Machine Learning

Applied Sciences ◽

10.3390/app9183723 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3723

Author(s):

Sharif ◽

Mumtaz ◽

Shafiq ◽

Riaz ◽

Ali ◽

...

Keyword(s):

Machine Learning ◽

Research Work ◽

Principal Component ◽

Research Area ◽

Ensemble Classification ◽

High Dimensional ◽

Support Vector ◽

K Nearest Neighbors ◽

N Gram ◽

Low Dimensional

The rise of social media has led to an increasing online cyber-war via hate and violent comments or speeches, and even slick videos that lead to the promotion of extremism and radicalization. An analysis to sense cyber-extreme content from microblogging sites, specifically Twitter, is a challenging, and an evolving research area since it poses several challenges owing short, noisy, context-dependent, and dynamic nature content. The related tweets were crawled using query words and then carefully labelled into two classes: Extreme (having two sub-classes: pro-Afghanistan government and pro-Taliban) and Neutral. An Exploratory Data Analysis (EDA) using Principal Component Analysis (PCA), was performed for tweets data (having Term Frequency—Inverse Document Frequency (TF-IDF) features) to reduce a high-dimensional data space into a low-dimensional (usually 2-D or 3-D) space. PCA-based visualization has shown better cluster separation between two classes (extreme and neutral), whereas cluster separation, within sub-classes of extreme class, was not clear. The paper also discusses the pros and cons of applying PCA as an EDA in the context of textual data that is usually represented by a high-dimensional feature set. Furthermore, the classification algorithms like naïve Bayes’, K Nearest Neighbors (KNN), random forest, Support Vector Machine (SVM) and ensemble classification methods (with bagging and boosting), etc., were applied with PCA-based reduced features and with a complete set of features (TF-IDF features extracted from n-gram terms in the tweets). The analysis has shown that an SVM demonstrated an average accuracy of 84% compared with other classification models. It is pertinent to mention that this is the novel reported research work in the context of Afghanistan war zone for Twitter content analysis using machine learning methods.

Download Full-text

A Recommendation System & Their Performance Metrics using several ML Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.c5791.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2445-2451

Keyword(s):

Support Vector Machine ◽

Logistic Regression ◽

Search Engine ◽

Recommendation System ◽

Performance Metrics ◽

Nearest Neighbors ◽

Support Vector ◽

K Nearest Neighbors ◽

Data Set ◽

Google Search

Recommendation systems are subdivision of Refine Data that request to anticipate ranking or liking a user would give to an item. Recommended systems produce user customized exhortations for product or service. Recommended systems are used in different services like Google Search Engine, YouTube, Gmail and also Product recommendation service on any E-Commerce website. These systems usually depends on content based approach. in this paper, we develop these type recommended systems by using several algorithms like K-Nearest neighbors(KNN), Support-Vector Machine(SVM), Logistic Regression(LR), MultinomialNB(MNB),and Multi-layer Perception(MLP). These will predict nearest categories from the News Category Data, among these categories we will recommend the most common sentence to a user and we analyze the performance metrics. This approach is tested on News Category Data set. This data set having more or less 200k Headlines of News and 41 classes, collected from the Huff post from the year of 2012-2018.

Download Full-text

Diagnosis of Various Thyroid Ailments using Data Mining Classification Techniques

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195119 ◽

2019 ◽

pp. 131-136

Author(s):

Umar Sidiq ◽

Syed Mutahar Aaqib ◽

Rafi Ahmad Khan

Keyword(s):

Data Mining ◽

Decision Tree ◽

Research Work ◽

Support Vector ◽

Data Sets ◽

Data Mining Technique ◽

K Nearest Neighbors ◽

Data Set ◽

Classification Techniques ◽

Using Data

Classification is one of the most considerable supervised learning data mining technique used to classify predefined data sets the classification is mainly used in healthcare sectors for making decisions, diagnosis system and giving better treatment to the patients. In this work, the data set used is taken from one of recognized lab of Kashmir. The entire research work is to be carried out with ANACONDA3-5.2.0 an open source platform under Windows 10 environment. An experimental study is to be carried out using classification techniques such as k nearest neighbors, Support vector machine, Decision tree and Naïve bayes. The Decision Tree obtained highest accuracy of 98.89% over other classification techniques.

Download Full-text

Hyperparameter tuning for multi-label classification of feedbacks in online courses

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-219238 ◽

2021 ◽

pp. 1-9

Author(s):

Dorian Ruiz Alonso ◽

Claudia Zepeda Cortés ◽

Hilda Castillo Zacatelco ◽

José Luis Carballido Carranza

Keyword(s):

Search Strategy ◽

Performance Metrics ◽

Online Courses ◽

Support Vector ◽

Grid Search ◽

K Nearest Neighbors ◽

Data Set ◽

Feedback Model ◽

Vector Machines

In this work, we propose the extension of a methodology for the multi-label classification of feedback according to the Hattie and Timperley feedback model, incorporating a hyperparameter tuning stage. It is analyzed whether the incorporation of the hyperparameter tuning stage prior to the execution of the algorithms support vector machines, random forest and multi-label k-nearest neighbors, improves the performance metrics of multi-label classifiers that automatically locate the feedback generated by a teacher to the activities sent by students in online courses on the Blackboard platform at the task, process, regulation, praise and other levels proposed in the feedback model by Hattie and Timperley. The grid search strategy is used to refine the hyperparameters of each algorithm. The results show that the adjustment of the hyperparameters improves the performance metrics for the data set used.

Download Full-text

Prediction of diabetes and hypertension using multi-layer perceptron neural networks

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962321500124 ◽

2021 ◽

pp. 2150012

Author(s):

Hani Bani-Salameh ◽

Shadi M. Alkhatib ◽

Moawyiah Abdalla ◽

Mo’taz Al-Hami ◽

Ruaa Banat ◽

...

Keyword(s):

Neural Network ◽

Blood Pressure ◽

Neural Networks ◽

Performance Metrics ◽

Age Groups ◽

Model Performance ◽

Support Vector ◽

Multi Layer Perceptron ◽

K Nearest Neighbors ◽

Classification Rate

Background: Diabetes and hypertension are two of the commonest diseases in the world. As they unfavorably affect people of different age groups, they have become a cause of concern and must be predicted and diagnosed well in advance. Objective: This research aims to determine the effectiveness of artificial neural networks (ANNs) in predicting diabetes and blood pressure diseases and to point out the factors which have a high impact on these diseases. Sample: This work used two online datasets which consist of data collected from 768 individuals. We applied neural network algorithms to predict if the individuals have those two diseases based on some factors. Diabetes prediction is based on five factors: age, weight, fat-ratio, glucose, and insulin, while blood pressure prediction is based on six factors: age, weight, fat-ratio, blood pressure, alcohol, and smoking. Method: A model based on the Multi-Layer Perceptron Neural Network (MLP) was implemented. The inputs of the network were the factors for each disease, while the output was the prediction of the disease’s occurrence. The model performance was compared with other classifiers such as Support Vector Machine (SVM) and K-Nearest Neighbors (KNN). We used performance metrics measures to assess the accuracy and performance of MLP. Also, a tool was implemented to help diagnose the diseases and to understand the results. Result: The model predicted the two diseases with correct classification rate (CCR) of 77.6% for diabetes and 68.7% for hypertension. The results indicate that MLP correctly predicts the probability of being diseased or not, and the performance can be significantly increased compared with both SVM and KNN. This shows MLPs effectiveness in early disease prediction.

Download Full-text

The Implementation of Classification and Clustering Techniques on Churn Analysis

Journal of Intelligent Systems with Applications ◽

10.54856/jiswa.201905065 ◽

2019 ◽

pp. 72-75

Author(s):

Ahmet Elbir ◽

Hamza Osman Ilhan ◽

Mehmet Furkan Aydin ◽

Yunus Emre Demirbulut

Keyword(s):

Performance Metrics ◽

Clustering Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Clustering Methods ◽

K Nearest Neighbors ◽

Churn Analysis ◽

Learning Techniques ◽

Telecommunication Companies ◽

Hierarchical Clustering Methods

One of the most important problems of telecommunication companies is the potential transfer of customers between the firms. In order to avoid this problem, it is very important to identify customers who are likely to leave. In this study, the performance of the classification and the clustering algorithms in machine learning techniques has been evaluated and compared on the analysis of potential customer trends, which have been reported as churn analysis. K nearest neighbors, decision trees, random forests, support vector machines and naive bayes methods were tested in scope of classification idea. Additionally, K-Means and hierarchical clustering methods were tested. The performances of the methods have been evaluated according to the accuracy, precision, sensitivity and F-measure performance metrics.

Download Full-text

Detection of Bundle Branch Block using Higher Order Statistics and Temporal Features

The International Arab Journal of Information Technology ◽

10.34028/iajit/18/3/3 ◽

2021 ◽

Vol 18 (3) ◽

Author(s):

Yasin Kaya

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Performance Metrics ◽

Normal Sinus Rhythm ◽

Support Vector ◽

K Nearest Neighbors ◽

Bundle Branch Block ◽

Backward Elimination ◽

Specificity And Sensitivity ◽

Temporal Features

Bundle Branch Block (BBB) beats are the most common Electrocardiogram (ECG) arrhythmias and can be indicators of significant heart disease. This study aimed to provide an effective machine-learning method for the detection of BBB beats. To this purpose, statistical and temporal features were calculated and the more valuable ones searched using feature selection algorithms. Forward search, backward elimination and genetic algorithms were used for feature selection. Three different classifiers, K-Nearest Neighbors (KNN), neural networks, and support vector machines, were used comparatively in this study. Accuracy, specificity, and sensitivity performance metrics were calculated in order to compare the results. Normal sinus rhythm (N), Right Bundle Branch Block (RBBB), and Left Bundle Branch Block (LBBB) ECG beat types were used in the study. All beats containing these three beat types in the MIT-BIH arrhythmia database were used in the experiments. All of the feature sets were obtained at a promising classification accuracy for BBB classification. The KNN classifier using backward elimination-selected features achieved the highest classification accuracy results in the study with 99.82%. The results showed the proposed approach to be successful in the detection of BBB beats.

Download Full-text

Classification of Soils into Hydrologic Groups Using Machine Learning

Data ◽

10.3390/data5010002 ◽

2019 ◽

Vol 5 (1) ◽

pp. 2 ◽

Cited By ~ 4

Author(s):

Shiny Abraham ◽

Chau Huynh ◽

Huy Vu

Keyword(s):

Machine Learning ◽

Water Conservation ◽

Large Scale ◽

Performance Metrics ◽

Gaussian Kernel ◽

Support Vector ◽

K Nearest Neighbors ◽

Group B ◽

Soil Groups

Hydrologic soil groups play an important role in the determination of surface runoff, which, in turn, is crucial for soil and water conservation efforts. Traditionally, placement of soil into appropriate hydrologic groups is based on the judgement of soil scientists, primarily relying on their interpretation of guidelines published by regional or national agencies. As a result, large-scale mapping of hydrologic soil groups results in widespread inconsistencies and inaccuracies. This paper presents an application of machine learning for classification of soil into hydrologic groups. Based on features such as percentages of sand, silt and clay, and the value of saturated hydraulic conductivity, machine learning models were trained to classify soil into four hydrologic groups. The results of the classification obtained using algorithms such as k-Nearest Neighbors, Support Vector Machine with Gaussian Kernel, Decision Trees, Classification Bagged Ensembles and TreeBagger (Random Forest) were compared to those obtained using estimation based on soil texture. The performance of these models was compared and evaluated using per-class metrics and micro- and macro-averages. Overall, performance metrics related to kNN, Decision Tree and TreeBagger exceeded those for SVM-Gaussian Kernel and Classification Bagged Ensemble. Among the four hydrologic groups, it was noticed that group B had the highest rate of false positives.

Download Full-text

An Improvised Machine Learning Model Based on Mutual Information Feature Selection Approach for Microbes Classification

Entropy ◽

10.3390/e23020257 ◽

2021 ◽

Vol 23 (2) ◽

pp. 257

Author(s):

Anaahat Dhindsa ◽

Sanjay Bhatia ◽

Sunil Agrawal ◽

Balwinder Singh Sohi

Keyword(s):

Feature Selection ◽

Mutual Information ◽

Clustering Algorithm ◽

Research Work ◽

Support Vector ◽

K Nearest Neighbors ◽

Machine Learning Model ◽

Convolution Filter ◽

Novel Method ◽

Feature Selection Approach

The accurate classification of microbes is critical in today’s context for monitoring the ecological balance of a habitat. Hence, in this research work, a novel method to automate the process of identifying microorganisms has been implemented. To extract the bodies of microorganisms accurately, a generalized segmentation mechanism which consists of a combination of convolution filter (Kirsch) and a variance-based pixel clustering algorithm (Otsu) is proposed. With exhaustive corroboration, a set of twenty-five features were identified to map the characteristics and morphology for all kinds of microbes. Multiple techniques for feature selection were tested and it was found that mutual information (MI)-based models gave the best performance. Exhaustive hyperparameter tuning of multilayer layer perceptron (MLP), k-nearest neighbors (KNN), quadratic discriminant analysis (QDA), logistic regression (LR), and support vector machine (SVM) was done. It was found that SVM radial required further improvisation to attain a maximum possible level of accuracy. Comparative analysis between SVM and improvised SVM (ISVM) through a 10-fold cross validation method ultimately showed that ISVM resulted in a 2% higher performance in terms of accuracy (98.2%), precision (98.2%), recall (98.1%), and F1 score (98.1%).

Download Full-text