Background: The availability of multimodal patient data, such as demographics, clinical, imaging, treatment, quality of life, outcomes and wearables data, as well as genome sequencing, have paved the way for the development of multimodal clinical solutions that introduce personalized or precision medicine. The clinical report is an information layer that contains relevant information about the disease in addition to the patient's point of view. Natural language processing (NLP) is a branch of artificial intelligence (AI) and its pre-trained language models are the key technology for extracting value from this data layer. Aims: This project was conducted by GenoMed4all and Synthema EU consortia, with the aim to: 1) Build an AI language model specific for the hematology domain. 2) Use NLP technology to extract relevant information from clinical reports and perform unsupervised stratification of patients, in order to 3) demonstrate that the clinical report is earlier access to data relative to disease clinical phenotype and biology and provide important information for patient stratification and prediction of clinical outcomes. Methods: To translate text sentences into numerical embeddings, we implemented bidirectional encoder representations from transformers (BERT) framework. To learn text representations and correlations within data, we performed domain-adaptation by fine-tuned pre-trained model on hematological clinical reports of patients with myeloproliferative neoplasms (MPN), myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML). Patient stratification was performed by HDBSCAN clustering on text embedding encoded by BERT (HematoBERT). Clusters validation was performed by assessing patients' diagnosis and survival probability. Finally, we compared domain-tuned HematoBERT vs pre-trained non-contextualized models. Results: We implemented HematoBERT based on the bert-base-multilingual-uncased version of BERT. Training data were hematological text reports of 1,328 patients. During fine-tuning, texts were tokenized, then we randomly replaced 15% of the tokens with masked tokens, training the model to predict them. We performed stratification using clinical reports from a validation cohort of 360 patients. We identified 7 clusters, defined according to similar words in meaning that were placed in a specific topic. We extracted the most important words and concepts for each cluster (topic) and we summarized them into effective descriptions for each group of patients. Two clusters included MDS patients with excess blasts, and without excess blasts with ring sideroblasts and del5q (n=69, n=115). One cluster included patients with excess blasts and MDS/MPN (n= 33). Two clusters included MPN patients with primary and secondary myelofibrosis, and MPN patients most including subjects affected with polycythemia vera and essential thrombocythemia (n=35, n=46). Two clusters included patients with AML from MDS and therapy-related AML, and patients with de novo AML (n=22, n=42). Clinical validation was performed based on the diagnosis and survival probability of patients assigned to clusters. Patients' diagnoses were compatible with the cluster assignment (Figure 1). Frequency of gene mutations (as assessed by targeted Next-Generation Sequencing) among different clusters reflected the well-known genotypic-phenotypic associations in MDS, MPN and AML. Kaplan-Maier curves indicated significative risk stratification in clusters in terms of survival probability (Figure 2), similar to stratifications performed on clinical and genomic data. Finally, we evaluate the domain adaptation by comparing the model to other pre-trained non-contextualized ones. Pseudo perplexity score (PPS), accuracy and F1 score were calculated to quantify how good the models are when they see new data, predicting the next word given the context of the sentence. HematoBERT obtained high PPS, accuracy and F1 scores, outperforming the other models also trained on generic clinical domains. Conclusion: Domain-adapted language models are able to understand contexts and correlations in documents. HematoBERT can be used to extract relevant features from clinical reports. This data layer is relevant to perform disease stratification of patients based on clinical and genomic information and could be integrated into next-generation multimodal models of personalized medicine.
Read full abstract