In real-world data, predictive models for clinical risks (such as adverse drug reactions, hospital readmission, and chronic disease onset) are constantly struggling with low-quality issues, namely redundant and highly correlated features, extreme category imbalances, and most importantly, a large number of missing values. In most existing work, each patient is represented as a value vector with the fixed-length from some feature space, and missing values are forced to be imputed, which introduces much noise for prediction if the data set is highly incomplete. Besides, other challenges are either remaining unresolved or only partially solved when modeling, but without a systematic approach. In this paper, we propose a novel framework to address these low-quality problems, that we first treat patients as bags with the various number of feature-value pairs, called instances, and map them to an embedding space through our proposed feature embedding method to learn from it directly. In this way, predictive models can avoid the negative impact of missing data naturally. A novel multi-instance neural network is then connected, using two computational modules to deal with the problems of correlated and redundant features: multi-head attention and attention-based multi-instance pooling. They are capable of capturing the instance correlations and locating valuable information in each instance or bag. The feature embedding and multi-instance neural network are parameterized and optimized jointly in an end-to-end manner. Moreover, the training process is under both main and auxiliary supervision with focal loss functions to avoid the caveat of a highly imbalanced label set. This proposed framework is named AMI-Net3. We evaluate it on three suitable data sets from real-world settings with different clinical risk prediction tasks: adverse drug reaction of risperidone, schizophrenia relapse, and invasive fungi infection, respectively. The comprehensive experimental results demonstrate the effectiveness and superiority of our proposed method over competitive baselines.
Read full abstract