Abstract B065: FabricaTM: A large-scale data simulation platform isolates tumor signal from cell-free DNA and improves tissue of origin prediction accuracy

Kade Pettie,Shiva Farashahi,Jackson Killian,Dorna Kashef,Josh Hubbell,Feras Hantash,Jocelyn Charlton,Kieran Chacko

doi:10.1158/1557-3265.liqbiop24-b065

Abstract

Abstract Cell-free DNA (cfDNA) liquid biopsy is a promising non-invasive method for disease detection, but data availability and the complexity of cfDNA composition pose barriers to model development. Cancer prediction models trained on cfDNA data from clinically collected blood samples often struggle to isolate tumor-derived signals from confounding factors, and tissue of origin (TOO) models suffer from small sample sizes of rarer tumor types. To address these challenges, we developed FabricaTM, a platform to generate diverse, large-scale, high-fidelity simulated cfDNA datasets for robust and accurate machine learning (ML) model training. A key advantage of FabricaTM is its ability to enhance ML models by matching confounding variables between non-cancer and cancer samples. We first synthesized age-balanced non-cancer samples, then simulated tumor DNA shedding into circulation from 502 reference biopsy samples across 16 indications, optimizing for final tumor content distributions ranging from 0.001 to 14.2%. All reference non-cancer and biopsy samples consisted of aligned bisulfite sequencing reads from a custom target hybrid capture panel. This approach expanded a non-cancer dataset of 177 samples to 1,212 simulated samples and generated 6,400 simulated cancer cfDNA samples matching the non-cancers in age distribution. We then assessed this dataset for equivalence with clinical data and ability to improve cancer prediction. Non-linear dimensionality reduction and clustering of methylation signal showed simulated samples cluster indistinguishably from 2,290 samples (1,149 non-cancer, 1,141 treatment-naïve cancer) from the CORE-HH clinical study (NCT05435066). A binary cancer prediction model trained on the FabricaTM-generated dataset and evaluated on the CORE-HH dataset showed reduced reliance on age-associated signal (R2=0.15) and increased tuning towards tumor content, with minor cost to sensitivity (&lt;4% at 95% specificity), relative to a model trained and cross-validated on the CORE-HH data (R2=0.60). To evaluate the benefit of these data for TOO prediction, we trained multiclass TOO models using FabricaTM's tumor biopsy reference data with and without the addition of our simulated data for 10 target indication groupings and benchmarked their performance on our clinical dataset. Training with both data types yielded a 10% increase in balanced accuracy. Moreover, peak performance was achieved at different training sample numbers per indication, illustrating FabricaTM's potential to estimate sample size requirements during study design. Our findings suggest FabricaTM-generated data can augment limited datasets to train ML models to identify tumor-specific biomarkers obscured by technical and demographic biases in real-world data. The age-balancing approach is readily applied to other confounders to help models learn true disease signatures. The platform is also adaptable to other diseases and biofluids (e.g., Alzheimer's disease, urine), allowing for extension to diagnostic and screening development beyond cancer and blood across the care spectrum. Citation Format: Kade Pettie, Shiva Farashahi, Jackson Killian, Dorna Kashef, Josh Hubbell, Feras Hantash, Jocelyn Charlton, Kieran Chacko. FabricaTM: A large-scale data simulation platform isolates tumor signal from cell-free DNA and improves tissue of origin prediction accuracy [abstract]. In: Proceedings of the AACR Special Conference: Liquid Biopsy: From Discovery to Clinical Implementation; 2024 Nov 13-16; San Diego, CA. Philadelphia (PA): AACR; Clin Cancer Res 2024;30(21_Suppl):Abstract nr B065.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Abstract B065: FabricaTM: A large-scale data simulation platform isolates tumor signal from cell-free DNA and improves tissue of origin prediction accuracy

Abstract

Talk to us

Similar Papers

More From: Clinical Cancer Research

Lead the way for us

Similar Papers

Do You Consent to the Use of Your Biological Data for Training ML and AI Models? Online Survey Targeting Clinicians and Researchers.
Yury Rusinovich ... Volha Rusinovich
Web3 Journal: ML in Health Science | VOL. 1
Yury Rusinovich, et. al.Yury Rusinovich ... Volha Rusinovich
27 Jan 2024
Web3 Journal: ML in Health Science | VOL. 1

Disclosure control of machine learning models from trusted research environments (TRE): New challenges and opportunities
Esma Mansouri-Benssassi ... Emily Jefferson
Heliyon | VOL. 9
Esma Mansouri-Benssassi, et. al.Esma Mansouri-Benssassi ... Emily Jefferson
01 Apr 2023
Heliyon | VOL. 9

Optimizing data acquisition: a Bayesian approach for efficient machine learning model training
M R Mahani ... Andreas Wicht
Machine Learning: Science and Technology | VOL. 5
M R Mahani, et. al.M R Mahani ... Andreas Wicht
17 Jul 2024
Machine Learning: Science and Technology | VOL. 5

MLaaS4HEP: Machine Learning as a Service for HEP
Valentin Kuznetsov ... Luca Giommi
Computing and Software for Big Science | VOL. 5
Valentin Kuznetsov, et. al.Valentin Kuznetsov ... Luca Giommi
05 Jul 2021
Computing and Software for Big Science | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Abstract B065: FabricaTM: A large-scale data simulation platform isolates tumor signal from cell-free DNA and improves tissue of origin prediction accuracy

Abstract

Talk to us

Similar Papers

More From: Clinical Cancer Research