Abstract This paper copes with a real-world classification problem related to the management of claims received in an insurance company. The way to obtain the classifier is not easy due to the high amount of missing values as well as the inherent imbalanced scenario within class labels. Once the data partition has been done, the training set is submitted to an intensive double grid search in order to obtain the most promising type of missing value imputation approach and then a step ahead is done using the best method and it starts the next round of data mining strategies which now falls into data rebalancing umbrella. Again, a grid search from an undersampling and oversampling family with different settings is done taking into account only seen data. The training data obtained after the first grid search are now submitted to the second step according the second grid search in order to get the ready training set for the further classifier training. The main objective of the work is to find the best combination of data mining techniques that suits the data set with a pipeline containing two types of data preparation methods coming from different families. As an outcome, first the problem of the presence of missing values has been addressed and then the data rebalancing techniques has been applied. The study focuses on obtaining classifiers based on Bayesian and lazy approaches as well as decision trees, evaluated on metrics such as the area under the ROC curve (AUC), Cohen’s kappa, Accuracy and the F-measure, among others. The imputation by the mean the mode is preferable to the Expectation Maximization Imputation in the scenario faced in this paper taking into account that the amount of missing values is higher than a forty percent for many features.
Read full abstract