Large language models (LLM) can potentially revolutionize the healthcare industry. They could reduce the burden on healthcare, increase care accessibility in areas with shortage and provide multilingual support to break down language barriers. Although, these models equipped with vast amounts of medical knowledge and the ability to understand and generate human-like text, require a proper set of feeding data (i.e., prompt engineering) for accurate diagnosis and providing reliable personalized treatment plans for patients. Multiple Myeloma (MM) is a complex hematological malignancy characterized by the uncontrolled proliferation of plasma cells in the bone marrow. Disease management for MM becomes particularly challenging due to its multisystemic nature based on the varying volume of malignant cells within the bone marrow. Implementing LLMs for clinical assessment of patients with MM needs feature selection to develop the most effective prompt for these models. Here, we utilized a machine learning (ML) approach to define salient features in a typical visit day that correlates most significantly with disease volume on the same day. These features could be the best candidate to reflect the multisystemic and dynamic nature of MM in each visit, and they could be candidates to be incorporated into LLMs to develop a system-based assessment in clinic visits. Methods: This study examined 1,472 clinical observations. To select a curated list of features associated with same-day M-spike values, 43 clinical and lab variables were input into an ML model. Random Forest (RF), an ensemble of regression trees suitable for nonlinear multiple regression, was selected as the model. The data were randomly divided into a training set (80%) and a test set (20%) for model validation. Using bootstrapping and generating 500 data sets, a random forest of regression trees was constructed, and results and estimates were aggregated across the trees. To determine the importance of each covariate, their inclusion and exclusion were compared in the models. Results: The residual distribution of the RF model indicated that nearly all M-spike values determined using the 43 variables distributed equally on either side of zero (Fig. 1). The weighted value of each of the 43 independent variables was determined by individually removing a variable from the ML algorithm and measuring its effect on the mean squared error (MSE) (Fig.2). Removal of the first lagged M-spike, serum total protein, second-lagged M-spike, serum IgG, serum IgM, and serum IgA, had the greatest effects on the ML algorithm. M-spike values determined using the ML algorithm correlated highly with M-spike values determined using the laboratory measured SPEP values as indicated by the proximity of the Pearson and Spearman correlation coefficients to +1. Using the 43 variables, the Pearson coefficient was 0.96 and the Spearman coefficient was 0.91. Feature selected modeling was performed to reduce the variables needed to predict the M-spike. Five RF models with different predictors were selected for comparison. Model A included all 43 predictors, Model B included the ten most important variables, Model C the top five variables, Model D included the first and second-lagged M-spike and serum total protein, and Model E the first-lagged M-spike and serum total protein. The Pearson's r and RMSE (root mean square error) values were used to compare the models. Pearson's r values for Model A, B, C, and D were 0.96, 0.96, 0.96, and 0.95 respectively, and the RMSE values were 0.21, 0.19, 0.19, and 0.22. In Model E, feature selection used only two variables and accurately predicted the M-spike value (Pearson's r = 0.95; RMSE 0.22). The Pearson's r values for feature selected models A, B, C, D, and E were 0.95, 0.96, 0.96, 0.95 and 0.91. Conclusion: Accurate prompt engineering to create global assessment of Myeloma clone by LLM requires a curated set of variables that correlated with disease volume. Here, we developed an ML model for feature selection utilizing the same-day available data in the patient chart. Features could be used in order of importance to provide focused, comprehensive prompts that aligned with the patient's context. The quality of AI-assisted disease assessment using these models should be compared with assessments performed by real world providers in future studies to ensure that the LLMs generate a written assessment that accurately reflects the patient's health status.
Read full abstract