Abstract

A data-driven Fraud detection model for insurance business can be seen as a two-phase method. Phase I is data-preprocessing of a given dataset, in which, handling class imbalance is a major challenge. Phase II is that of classification using Machine Learning models. It is important to comprehend if there is any influence of the technique used in Phase I on the efficiency of the model used for Phase II. A natural query that intrigues one is whether there is a golden combination of a technique in Phase I and a specific model in Phase II for assured best performance of a Fraud Detection Model.In this work, we study a few techniques for handling data imbalance issue namely, SMOTE, MWMOTE, ADASYN and TGAN in combination with various classifier models like Random Forest (RF), Decision Trees (DT), Support Vector Machines (SVM), LightGBM, XGBoost and Gradient Boosting Machines (GBM). The study is conducted on a dataset for motor vehicle insurance fraud detection.We present a comparison of various combinations of data imbalance technique and classifier models. It is observed that the combination of TGAN in Phase I and GBM in Phase II gives the best performance. This combination performs best in terms of important metrics such as false positive rate, precision and specificity. We obtained the lowest false positive rate of 0.0011 and precision of 0.9988 which minimizes the most critical risk for the insurance company of falsely classifying a non-fraud claim as a fraud. Finally, the specificity of 0.9989 indicates that the model was also very good at predicting the non-fraudulent claim.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call