An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Shamitha S Kotekani,Ilango Velchamy

doi:10.20532/cit.2020.1005216

Shamitha S Kotekani, Ilango Velchamy

Open Access

https://doi.org/10.20532/cit.2020.1005216

Copy DOI

Abstract

Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model's performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner's understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Computing and Information Technology	Publication Date: Oct 21, 2021
Citations: 3	License type: cc-by-nd

R Discovery Prime

R Discovery Prime

An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Abstract

Talk to us

Similar Papers

More From: Journal of Computing and Information Technology

Lead the way for us

Similar Papers

Machine Learning Models for Classifying Imbalanced Class Datasets Using Ensemble Learning
Aditya Yulis Kusdiyanto ... Yoga Pristyanto
-
Aditya Yulis Kusdiyanto, et. al.Aditya Yulis Kusdiyanto ... Yoga Pristyanto
08 Dec 2022
08 Dec 2022

A New Approach for Fraud Detection with Artificial Intelligence
ÍPek Erdoğan ... Orhan Kurto
-
ÍPek Erdoğan, et. al.ÍPek Erdoğan ... Orhan Kurto
05 Oct 2020
05 Oct 2020

CODE: A Data Complexity Framework for Imbalanced Datasets
Cheng G Weng ... Josiah Poon
-
Cheng G Weng, et. al.Cheng G Weng ... Josiah Poon
01 Jan 2009
01 Jan 2009

Learning to improve medical decision making from imbalanced data without a priori cost.
Xiang Wan ... Jiming Liu
BMC Medical Informatics and Decision Making | VOL. 14
Xiang Wan, et. al.Xiang Wan ... Jiming Liu
01 Dec 2014
BMC Medical Informatics and Decision Making | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Abstract

Talk to us

Similar Papers

More From: Journal of Computing and Information Technology