Abstract

Introduction: Diffuse Large B-cell Lymphoma (DLBCL) is the most common lymphoma in the world with usually an aggressive clinical course. The Ann Arbor staging system and International Prognostic Index (IPI) commonly utilized in clinical practice for risk stratification have known limitations. Machine learning (ML) has emerged as a promising tool for more comprehensive and deeper data analysis. We sought to utilize the ability of ML to predict survival in DLBCL compared to Ann Arbor staging system using a large national database. Methodology: We employed the ML algorithm XGBoost on the National Cancer Institute's Surveillance, Epidemiology and End Result (SEER) database to predict overall survival (OS) and the lymphoma specific survival (LSS). For prediction analysis, we transformed the survival labels into a simple Boolean format: “alive” represented as 0, “dead” as 1, and “dead (attributable to this cancer diagnosis)” also as 1. We utilized one-hot encoding to convert categorical features and variables into binary vectors. The data set was divided into two parts: training (80%) and test (20%). Further, we split the training set into the actual training set and validation set by using stratified 5-fold cross validation. Hyper-parameter optimization was done within the validation set. A broad range of attributes were utilized by the model for its prediction algorithm. To understand how each attribute contributes to predictions, we calculated its importance score in XGBoost. Results: A total of 64,912 patients with DLBCL were found and their data were extracted. The majority were Caucasian (78.9%) with a median age range of 60 to 69. The model was able to predict OS and LSS, with an area under the curve (AUC) of 0.89 and 0.75 (Figure 1), respectively. Factors selected by the model for survival prediction included presence or absence of B-symptoms, treatment status, and disease stage. For OS and LSS, the model found B symptoms to be the highest contributing factor with an importance score of 0.205 and 0.167, respectively. Other important factors incorporated by the model included age and stage IV for OS, and stage IV and clinically asymptomatic status for LSS. The least important factors were location of the primary lymphoma site and year of diagnosis (Table 1). Conclusion: Machine learning tools can help predict survival in patients with DLBCL and able to challenge current staging systems. Our results warrant validation in future prospective studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call