Abstract Circulating tumor DNA (ctDNA) fraction has been shown to be a prognostic factor of treatment response in patients with metastatic cancers. However, algorithms that accurately determine ctDNA fraction across purity ranges are still lacking. Commonly used liquid biopsy gene panels (~70 genes) often approximate the ctDNA fraction in a sample by using somatic variant allele frequencies, resulting in large confidence intervals due to the scarcity of somatic mutations detected per sample and uncertainty of their clonality and copy number. Genome wide signatures such as nucleosome positioning, measured by windowed protection scores (WPS), have been shown to reflect genomic and epigenomic abnormalities specific to cancer and proposed as biomarkers for cancer detection. However, genome wide assays sacrifice somatic mutation detection due to low sequencing coverage and in clinical practice a second assay is required to find actionable mutations. We leverage data from ~6,000 cell-free DNA samples to develop methods for accurately estimating the ctDNA fraction in clinical samples sequenced with a 554-gene panel covering 3MB of genomic content. First, we use Gradient Boosting and 17 features extracted from standard variant calling, copy number and QC pipelines. This approach has been shown to be highly accurate in samples of high ctDNA fraction and therefore provide reliable training labels for those samples. We suggest that more sophisticated methods, using genome wide signatures, can increase prediction accuracies in samples of lower ctDNA fraction. Second, we develop a deep learning model for predicting ctDNA fraction using base pair resolution WPS over the genomic territory covered by the gene panel. Using Morlet wavelet, we transformed each autosomal WPS unidimensional representation into a two-dimensional image. A 22 channels image per sample was generated and used as input to a Convolutional Neural Network (22-CNN) model. The 22-CNN was trained to predict the ctDNA fraction and to classify if a sample was from a ctDNA shedder (ctDNA fraction >0) or not. The dataset was equally divided into training and test sets of ~2800 samples each. We used five-fold cross-validation (CV) during training and evaluated each model in the unseen test dataset. The Mean Square Error of the 22-CNN ctDNA prediction model on the testing set was 0.0072 +- 0.0013 [0.006-0.009] and the accuracy of the classification model on the testing set was 0.79 +- 0.006 [0.78-0.79]. In this work, we present both machine learning and deep learning models to reliably estimate the ctDNA fraction in plasma samples from cancer patients. These methods address a key challenge in the field and are applicable to other liquid biopsy assays. Citation Format: Aamna M. Al-Shehhi, Chintan Parmar, Ben Gustafson, Angad Singh, Rebecca Leary, Markus Riester, O. Alejandro Balbin. Accurate quantification of tumor DNA in liquid biopsies using deep learning [abstract]. In: Proceedings of the AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; 2021 Jan 13-14. Philadelphia (PA): AACR; Clin Cancer Res 2021;27(5_Suppl):Abstract nr PR-08.
Read full abstract