Abstract Head and neck cancers, which encompass a range of malignancies affecting the oral cavity, pharynx, and larynx, are critical to address due to their high morbidity and impact on quality of life. This underscores the need for effective and non-invasive diagnostic tools. Raman spectroscopy-based liquid biopsy detects cancers by analyzing intrinsic Raman fingerprints in biofluids without the need for molecular labels, making it a promising point-of-care screening method. However, Raman scattering is weak, and interpreting spectra from biological samples is challenging due to overlapping peaks, fluorescence background, cosmic rays, and instrument noise, which obscure the signal. Extensive preprocessing and advanced analytical techniques are usually needed to isolate the true Raman signal. While machine learning models have shown promise in classifying cancer status from Raman spectra, variations in Raman systems, lack of standardized protocols, and small patient cohorts limit their clinical utility. Our goal is to develop a model for Raman spectroscopy-based liquid biopsy that generalizes across different measurement conditions and clinical indications. A key challenge is the high cost of acquiring extensive labeled training data. To overcome this, we use a Siamese-based, self-supervised model trained on unlabeled spectral data. The model generates synthetic spectra that mimic real variations across Raman systems and measurement protocols by preprocessing the input and adding simulated factors like fluorescence background, noise, and cosmic rays. By maximizing the cosine similarity between the embeddings of the generated spectra and their originals, the model learns meaningful features of the Raman spectra. This approach not only expands the training set but also enhances the model’s stability across different instruments. In our preliminary study, we analyzed thousands of Raman spectra measured on liquid biopsy specimens (blood plasma and saliva) collected from a cohort of 71 head and neck cancer patients (spread across early and late-stage disease) and 44 non-cancerous controls. Spectra were divided into training and test sets with an 80:20 ratio. Each specimen was drop cast onto a quartz substrate and 25 repeated Raman point scans were performed in a grid over the dried spot. We augmented the training set to reach 9,200 spectra, and t-SNE plots of the embeddings of the test set (1,150 spectra) showed clear clustering separated by sample type, with distinct clusters for healthy individuals and cancer patients. Despite the excellent performance for cancer vs control, separation among cancer patients at different stages remains limited, but will be the target of future work building on this study. These results suggest that our label-free approach can effectively distinguish between healthy and cancerous samples using unlabeled data, advancing the potential for the widespread clinical application of Raman-based liquid biopsy powered by machine learning. Citation Format: Yifei Gu, Kwan Lun Chiu, Andrés Muñoz-Jaramillo, Andrew C. Birkeland, Maria Navas-Moreno, Randy Carney. Towards a generalized machine learning model for Raman spectroscopy-based liquid biopsy in cancer screening using self-supervised learning [abstract]. In: Proceedings of the AACR Special Conference: Liquid Biopsy: From Discovery to Clinical Implementation; 2024 Nov 13-16; San Diego, CA. Philadelphia (PA): AACR; Clin Cancer Res 2024;30(21_Suppl):Abstract nr B056.
Read full abstract