Improving mortality prediction in Acute Pancreatitis by machine learning and data augmentation

M Asad Bin Hameed,Zareen Alamgir

doi:10.1016/j.compbiomed.2022.106077

Abstract

Acute Pancreatitis (AP) is the inflammation of the pancreas that can be fatal or lead to further complications based on the severity of the attack. Early detection of AP disease can help save lives by providing utmost care, rigorous treatment, and better resources. In this era of data and technology, instead of relying on manual scoring systems, scientists are employing advanced machine learning and data mining models for the early detection of patients with high chances of mortality. The current work on AP mortality prediction is negligible, and the few studies that exist have many shortcomings and are impractical for clinical deployment. In this research work, we tried to overcome the existing issues. One main issue is the lack of high-quality public datasets for AP, which are crucial for effectively training ML models. The available datasets are small in size, have many missing values, and suffer from high class imbalance. We augmented three public datasets, MIMIC-III, MIMIC-IV, and eICU, to obtain a larger dataset, and experiments proved that augmented data trained classifiers better than original small datasets. Moreover, we employed emerging advanced techniques to handle underlying issues in data. The results showed that iterative imputer is best for filling missing values in AP data. It beats not only the basic techniques but also the Knn-based imputation. Class imbalance is first addressed using data downsampling; apparently, it gave decent results on small test sets. However, we conducted numerous experiments on large test sets to prove that downsampling in the case of AP produced misleading and poor results. Next, we applied various techniques to upsample data in two different class splits, a 50 to 50 and a 70 to 30 majority–minority class split. Four different tabular generative adversarial networks, CTGAN, TGAN, CopulaGAN, and CTAB, and a variational autoencoder, TVAE, were deployed for synthetic data generation. SMOTE was also utilized for data upsampling. The computational results showed that the Random Forest (RF) classifier outperformed all other classifiers on a 50 to 50 class split data generated by CTGAN, with 0.702 Fβ and 0.833 recall. Results produced by RF on the TVAE dataset were also comparable, with 0.698 Fβ. In the case of SMOTE-based upsampling, DNN performed best with a 0.671 Fβ score.

Full Text