Novel scoring systems have been developed in recent years to improve the accuracy of prognostication from historical clinical staging systems (Rai, Binet) for chronic lymphocytic leukemia (CLL). Most of them, however, rely on discretized and dichotomic values of the various biomarkers to infer prognosis. Here we analyzed the immunophenotypic and (immuno)genetic profiles in a wide CLL cohort by applying unsupervised machine learning methods elaborating prognostic factors as continuous variables, to identify novel relationships and interactions likely missed in conventional models. The study included 989 CLL patients with Rai stages 0-I-II (50, 36, 14% respectively), 420 (42.4%) treated, analyzed between 2003 and 2020. Treatment-free survival (TFS) was calculated from sampling (median TFS, 46 months). Median time of sampling from diagnosis was 7.2 months (~60% cases within 12 months). The studied laboratory-based markers were: CD20, FMC7, CD49d, CD49c, CD38, CD23, CD43, CD22, ZAP-70 expression by flow cytometry (reported as % of positive cells); del13, tris12, del11 and del17 cytogenetic abnormalities detected by FISH, reported as % of nuclei with abnormal signal; mutational status of TP53 by deep NGS, reported as % variant allele fraction (VAF); IGHV gene mutational status, reported as % mutations. By applying Cox proportional hazard model to estimate TFS on this pool of features, we selected the features with a fit p-value<0.05, i.e. del11, tris12, TP53 mutations, % IGHV mutation, expression of CD38 and CD49d. Then we grouped similar profiles with an unsupervised k-means algorithm, optimized by the Elbow method, to partition observations into 6 clusters (C1-C6). Clustering confidence for each patient was estimated through leave-one-out procedure and the average normalized score was 0.83 (0.85, 0.86, 0.79, 0.91, 0.73, 0.84, for C1 through C6, respectively). Centroid analysis was employed to evaluate which features mostly defined each cluster, as detailed below. C1 (n=220): all cases with <4.1% IGHV mutations, mostly (70%) IGHV-unmutated (UM, <2% cut-off), low CD49d expression (<30% of positive cells) in 90% of cases, low representation of other features (tris12, del11, CD38, TP53 mutations); C2 (n=210): high CD49d expression (99% of cases), equally balanced IGHV status (52% UM), low representation of other features; C3 (n=147): tri12 cases (100%; >10% of nuclei), concurrent high expression of CD49d (95%) and CD38 (67%), slightly enriched in UM IGHV cases (60%); C4 (n=303): cases heavily mutated in IGHV genes (mutation range 4.1-22.0%), low representation of all the other features; C5 (n=52): TP53 mutated cases with high mutation burden (VAF 35-97%), skewed towards UM IGHV (70%), irrelevant all the other features; C6 (n=57): highly clonal del11q cases (range 50-98%) mutually exclusive with TP53 mutations (2 mutated cases only), mostly UM IGHV (89%), low representation of all other features. Notably, in C1-C4, cases bearing TP53 mutations were present, although representing a minority (5-10%), mostly with low-VAF (median 6.7%, range 5-39%). Kaplan-Meier analysis revealed heterogeneous behaviors: C2, C3, C5, C6 presented a 50% TFS of 59, 25, 5, 9 months respectively, whereas TFS was not reached for C1 and C4 (Figure, A). Hierarchical agglomerative clustering identified 3 major risk classes (Figure, B). The high risk class (n=109) comprised C5 and C6 (TP53 mutations and del11q); the intermediate risk class (n=577) stratified on C1-2-3; the low risk class (n=303) was made by C4 only (CLL with highly mutated IGHV). The 50% TFS for high-intermediate-low risk was 7, 50 and not reached, respectively. In conclusion, we present here a novel machine-learning-driven, laboratory-based classification for predicting the risk of early treatment in CLL. Our approach identifies clusters at different risk with some novelties: i) a high IGHV mutational burden (i.e >4%) in the absence of other markers (e.g. CD49d) identifies patients with a particularly benign clinical course; ii) TP53 mutations and del11q associate with high risk of early treatment only if present in the vast majority of the CLL clone; iii) a IGHV status with low burden of mutations (i.e. <4%) along with CD49d expression or tris12 identifies patients at intermediate risk. These novel stratifications should be incorporated in risk algorithms for treatment prediction of CLL patients. Validation in additional independent cohorts is needed. Figure 1View largeDownload PPTFigure 1View largeDownload PPT Close modal