An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model's performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner's understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training.

Download Full-text

Application of Clustering Methods to Health Insurance Fraud Detection

2006 International Conference on Service Systems and Service Management ◽

10.1109/icsssm.2006.320598 ◽

2006 ◽

Cited By ~ 11

Author(s):

Yi Peng ◽

Gang Kou ◽

Alan Sabatka ◽

Zhengxin Chen ◽

Deepak Khazanchi ◽

...

Keyword(s):

Health Insurance ◽

Fraud Detection ◽

Insurance Fraud ◽

Clustering Methods

Download Full-text

Synthetic Minority Oversampling and Smote Regularised Deep Autoencoders Neural Network Techniques for Fraud Prediction in Financial Payment Services

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3419.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 3908-3915

Keyword(s):

Neural Network ◽

Machine Learning ◽

Financial Institutions ◽

Fraud Detection ◽

Machine Learning Algorithms ◽

Decision Tree Classifier ◽

Class Imbalance Problem ◽

Good Recall ◽

Tree Classifier ◽

Payment Services

Frauds in Financial Payment Services are the most prevalent form of cybercrime. The increased growth in e-commerce and mobile payments in recent years is behind the rising incidence of fraud in financial payment services. According to "McKinsey, fraud losses throughout the world could be close to $44 billion by 2025." Every year, fraudulent card transactions causes billions of US Dollar of loss. To reduce these losses, designing effective fraud detection algorithms is essential, which depend on sophisticated machine learning methods to help investigators in fraud. For banks and financial institutions, therefore, fraud detection systems have gained excellent significance. Though the fake transactions are very low when compared to genuine transaction, care must be taken to predict it so that the financial institutions can maintain the customer integrity. As fraud is unlikely to occur compared to normal operations, we have the class imbalance problem. We applied Synthetic Minority Oversampling TEchnique (SMOTE) and the Ensemble of sampling methods(Balanced Random Forest Classifier, Balanced Bagging Classifier, Easy Ensemble Classifier, RUS Boost) to Ensemble machine learning algorithms Performance assessment using sensitivity, specificity, precision, ROC area. The purpose of this article is to analyze different predictive models to see how precise they are to detect whether a transaction is a standard payment or a fraud. Instead of misclassifying a real transaction as fraud, this model seeks to improve detection of fraud. We noted that the technique of Ensemble learning using Maximum voting detects the fraud better than other classifiers. Decision Tree Classifier, Logistic Regression, Balanced Bagging classifier is combined and the proposed algorithm is OptimizedEnsembleFD Algorithm. The sample size is increased and deep learning is applied .It is found that the proposed system Smote Regularised Deep Autoencoders (SRD Autoencoders) neural network performs better with good recall and accuracy for this large dataset.

Download Full-text

A Hybrid Technique for Health Insurance Fraud Detection on Highly Imbalanced Dataset

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijrte.f1210.0886s19 ◽

2019 ◽

Vol 8 (6S) ◽

pp. 1091-1095

Keyword(s):

Health Insurance ◽

Random Forest ◽

Cross Validation ◽

Insurance Industry ◽

Imbalanced Data ◽

Fraud Detection ◽

Heterogeneous Data ◽

Classification Model ◽

Hybrid Technique ◽

Class Imbalance Problem

Health Insurance industry is producing a massive amount of heterogeneous data. Detecting fraud from these data is a challenging task. Highly imbalanced data causes huge challenge to the Insurance Data Analysis. Classification of imbalanced data is a critical issue faced by the fraud detection methodologies. Fraud only covers less than 10% of the whole data. In this study, we use highly imbalanced data and propose a hybrid method for fixing class imbalance problem by using a combination of SMOTE, Cross Validation, and Random Forest. We used Medicare data, which will be applied to various sampling techniques, and further a classification model was built. We observed that SMOTE with Random forest with cross validation produced excellent results. Our model should be capable of identifying all the relevant(fraud) instances, i.e., the model should have a high recall value. SMOTE with Random forest had average recall of 86% and an overall accuracy of 90%, which could be considered as good among the existing models.

Download Full-text

Health Insurance Fraud Detection

Advanced Information and Knowledge Processing - Optimization Based Data Mining: Theory and Applications ◽

10.1007/978-0-85729-504-0_14 ◽

2011 ◽

pp. 233-235 ◽

Cited By ~ 1

Author(s):

Yong Shi ◽

Yingjie Tian ◽

Gang Kou ◽

Yi Peng ◽

Jianping Li

Keyword(s):

Health Insurance ◽

Fraud Detection ◽

Insurance Fraud

Download Full-text

A Purview of the Impact of Supervised Learning Methodologies on Health Insurance Fraud Detection

Advances in Intelligent Systems and Computing - Information Systems Design and Intelligent Applications ◽

10.1007/978-981-10-7512-4_98 ◽

2018 ◽

pp. 978-984 ◽

Cited By ~ 3

Author(s):

Ananthi Sheshasaayee ◽

Surya Susan Thomas

Keyword(s):

Health Insurance ◽

Supervised Learning ◽

Fraud Detection ◽

Insurance Fraud ◽

The Impact

Download Full-text

Artificial neural networks and decision tree classifier performance on medium resolution ASTER data to detect gully networks in southern Italy

10.1117/12.660602 ◽

2006 ◽

Cited By ~ 1

Author(s):

A. Ghaffari ◽

G. Priestnall ◽

M. L. Clarke

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Decision Tree ◽

Southern Italy ◽

Decision Tree Classifier ◽

Aster Data ◽

Medium Resolution ◽

Tree Classifier ◽

Classifier Performance ◽

Artificial Neural

Download Full-text

Hybrid Deep Neural Network for Handling Data Imbalance in Precursor MicroRNA

Frontiers in Public Health ◽

10.3389/fpubh.2021.821410 ◽

2021 ◽

Vol 9 ◽

Author(s):

Elakkiya R. ◽

Deepak Kumar Jain ◽

Ketan Kotecha ◽

Sharnil Pandya ◽

Sai Siddhartha Reddy ◽

...

Keyword(s):

Neural Network ◽

Decision Tree ◽

Deep Neural Network ◽

Imbalanced Data ◽

Vital Role ◽

Biological Data ◽

Decision Tree Classifier ◽

Genome Data ◽

Tree Classifier

Over the last decade, the field of bioinformatics has been increasing rapidly. Robust bioinformatics tools are going to play a vital role in future progress. Scientists working in the field of bioinformatics conduct a large number of researches to extract knowledge from the biological data available. Several bioinformatics issues have evolved as a result of the creation of massive amounts of unbalanced data. The classification of precursor microRNA (pre miRNA) from the imbalanced RNA genome data is one such problem. The examinations proved that pre miRNAs (precursor microRNAs) could serve as oncogene or tumor suppressors in various cancer types. This paper introduces a Hybrid Deep Neural Network framework (H-DNN) for the classification of pre miRNA in imbalanced data. The proposed H-DNN framework is an integration of Deep Artificial Neural Networks (Deep ANN) and Deep Decision Tree Classifiers. The Deep ANN in the proposed H-DNN helps to extract the meaningful features and the Deep Decision Tree Classifier helps to classify the pre miRNA accurately. Experimentation of H-DNN was done with genomes of animals, plants, humans, and Arabidopsis with an imbalance ratio up to 1:5000 and virus with a ratio of 1:400. Experimental results showed an accuracy of more than 99% in all the cases and the time complexity of the proposed H-DNN is also very less when compared with the other existing approaches.

Download Full-text