Abstract Cell-free DNA (cfDNA) liquid biopsy is a promising non-invasive method for disease detection, but data availability and the complexity of cfDNA composition pose barriers to model development. Cancer prediction models trained on cfDNA data from clinically collected blood samples often struggle to isolate tumor-derived signals from confounding factors, and tissue of origin (TOO) models suffer from small sample sizes of rarer tumor types. To address these challenges, we developed FabricaTM, a platform to generate diverse, large-scale, high-fidelity simulated cfDNA datasets for robust and accurate machine learning (ML) model training. A key advantage of FabricaTM is its ability to enhance ML models by matching confounding variables between non-cancer and cancer samples. We first synthesized age-balanced non-cancer samples, then simulated tumor DNA shedding into circulation from 502 reference biopsy samples across 16 indications, optimizing for final tumor content distributions ranging from 0.001 to 14.2%. All reference non-cancer and biopsy samples consisted of aligned bisulfite sequencing reads from a custom target hybrid capture panel. This approach expanded a non-cancer dataset of 177 samples to 1,212 simulated samples and generated 6,400 simulated cancer cfDNA samples matching the non-cancers in age distribution. We then assessed this dataset for equivalence with clinical data and ability to improve cancer prediction. Non-linear dimensionality reduction and clustering of methylation signal showed simulated samples cluster indistinguishably from 2,290 samples (1,149 non-cancer, 1,141 treatment-naïve cancer) from the CORE-HH clinical study (NCT05435066). A binary cancer prediction model trained on the FabricaTM-generated dataset and evaluated on the CORE-HH dataset showed reduced reliance on age-associated signal (R2=0.15) and increased tuning towards tumor content, with minor cost to sensitivity (<4% at 95% specificity), relative to a model trained and cross-validated on the CORE-HH data (R2=0.60). To evaluate the benefit of these data for TOO prediction, we trained multiclass TOO models using FabricaTM's tumor biopsy reference data with and without the addition of our simulated data for 10 target indication groupings and benchmarked their performance on our clinical dataset. Training with both data types yielded a 10% increase in balanced accuracy. Moreover, peak performance was achieved at different training sample numbers per indication, illustrating FabricaTM's potential to estimate sample size requirements during study design. Our findings suggest FabricaTM-generated data can augment limited datasets to train ML models to identify tumor-specific biomarkers obscured by technical and demographic biases in real-world data. The age-balancing approach is readily applied to other confounders to help models learn true disease signatures. The platform is also adaptable to other diseases and biofluids (e.g., Alzheimer's disease, urine), allowing for extension to diagnostic and screening development beyond cancer and blood across the care spectrum. Citation Format: Kade Pettie, Shiva Farashahi, Jackson Killian, Dorna Kashef, Josh Hubbell, Feras Hantash, Jocelyn Charlton, Kieran Chacko. FabricaTM: A large-scale data simulation platform isolates tumor signal from cell-free DNA and improves tissue of origin prediction accuracy [abstract]. In: Proceedings of the AACR Special Conference: Liquid Biopsy: From Discovery to Clinical Implementation; 2024 Nov 13-16; San Diego, CA. Philadelphia (PA): AACR; Clin Cancer Res 2024;30(21_Suppl):Abstract nr B065.
Read full abstract