Background: Multiple myeloma (MM) evolves over years. Pre-MM states, such as monoclonal gammopathy of undetermined significance (MGUS) and smoldering MM, are asymptomatic and often missed. When active MM is diagnosed, it is often associated with organ damage. We hypothesized that applying a machine learning (ML) approach on electronic medical records (EMR) can help in developing a predictive model that identifies population at most risk for MM. Methods: An observational retrospective study was performed using data extracted from the Clalit Health Services (CHS) EMR. CHS insures 4.7x106 individuals (53% of the Israeli population). The study included CHS members diagnosed with MM between 2002 and 2019 and their controls. First, we compared numerous clinical and lab parameters of MM patients in the pre-MM period (5yr to 2m prior to diagnosis) to controls. Then, a ML approach was used to develop a risk prediction model using a gradient boosting algorithm The unit of analysis was a patient who underwent a blood test at a given month. The training set included units from "future MM" patients and from matched controls. Model performance was evaluated on a separate test set including blood tests performed in the year 2014 by patients who were not included in the training set and were not diagnosed with MM at the time of their blood tests. Lastly, a simplified model was constructed by excluding MM-specific variables and applying a logistic regression. Results: We identified 4982 MM patients, of whom 4256 had the relevant lab tests and were therefore eligible for comparison. In the pre-MM period, "future MM" patients had higher ESR, lower Hb, neutrophil count (ANC) and Neutrophil/Lymphocyte ratio, and higher levels of serum globulins, urinary protein, serum IgG and ferritin, than controls. They tended, more than controls, to suffer from immune deficiencies, as well as myelodysplastic syndromes and familial Mediterranean fever. Consumption of medications (tranquilizers, anti-diabetics, Ca-antagonists, statins) was associated with reduced risk for MM. The gradient boosting predictive model was developed using 19,129 learning units of MM cases and 382,580 controls. The test set included 268,058 blood tests, 368 of these (0.14%) belonged to patients who were diagnosed with MM within 5yr. The performance of the model was good, with an area under the curve (AUC) of 0.836. Ranges of MM predictors stratify the risk of developing the disease in the future (Figure 1). The simplified logistic regression model included 9 parameters: age, sex, uric acid, LDH, RBC, lymphocyte %, ANC, HDL and Non-HDL-cholesterol. It had an AUC of 0.794. An example of its use is provided in Table 1. Conclusions: Using a large database and a ML approach, we were able to develop a predictive model of MM risk. Taking only a few, widely available parameters, a predicted MM risk can be provided for any individual performing simple blood tests. The model can be used as a first-line screening tool, pointing clinicians to the individuals most at risk for MM, allowing them to focus further workup accordingly. Figure 1View largeDownload PPTFigure 1View largeDownload PPT Close modal
Read full abstract