Standard statistical analyses often exclude incomplete observations, which can be particularly problematic when predicting rare outcomes, such as HIV positivity. In the linkage to the HIV care dataset, there were initially 553 complete HIV positive cases, with an additional 554 cases added through imputation. Imputation methods amelia, hmisc, mice and missForest were evaluated. Simulations were conducted across various scenarios using the complete data to guide imputation for the full dataset. A random forest model was used to predict HIV status, assessing imputation precision, overall prediction accuracy, and sensitivity. While missForest produced imputed values closer to the observed ones, this did not translate into better predictive models. Hmisc and mice imputations led to higher prediction accuracy and sensitivity, with median accuracy increasing from 64% to 76% and median sensitivity rising from 0.4 to 0.75. Hmisc and amelia were the fastest imputation methods. Additionally, oversampling the minority class combined with undersampling the majority class did not improve predictions of new HIV positive cases using only the complete observations. However, increasing the minority class information through imputation enhanced sensitivity for predicting cases in this class.
Read full abstract