The presence of adverse drug reactions (ADRs) is an ongoing public health concern. While traditional methods to discover ADRs are very costly and limited, it is prudent to predict ADRs through non-invasive methods such as machine learning based on existing data. Although various studies exist regarding ADR prediction using non-clinical data, a process that leverages both demographic and non-clinical data for ADR prediction is missing. In addition, the importance of individual features in ADR prediction has yet to be fully explored. This study aims to develop an ADR prediction model based on demographic and non-clinical data, where we identify the highest contributing factors. We focus our efforts on 30 common and severe ADRs reported to the Food and Drug Administration (FDA) between 2012 and 2023. We have developed a random forest (RF) and deep learning (DL) machine learning model that ingests demographic data (e.g., Age and Gender of patients) and non-clinical data, which includes chemical, molecular, and biological drug characteristics. We successfully unified both demographic and non-clinical data sources within a complete dataset regarding ADR prediction. Model performances were assessed via the area under the receiver operating characteristic curve (AUC) and the mean average precision (MAP). We demonstrated that our parsimonious models, which include only the top 20 most important features comprising 5 demographic features and 15 non-clinical features (13 molecular and 2 biological), achieve ADR prediction performance comparable to a less practical, feature-rich model consisting of all 2,315 features. Specifically, our models achieved an AUC of 0.611 and 0.674 for RF and DL algorithms, respectively. We hope our research provides researchers and clinicians with valuable insights and facilitates future research designs by identifying top ADR predictors (including demographic information) and practical parsimonious models.
Read full abstract