Abstract

Abstract UG100 is a novel next-generation sequencing platform that combines high throughput with significantly lower sequencing cost. Previous studies have demonstrated broad applicability of UG100 data for whole-genome germline variant calling, single cell transcriptomics and whole-genome methylation analysis, as well as for recalling cancer signatures from cfDNA at very low fraction of circulating tumor DNA. Somatic variant calling is a natural application for this platform as it can benefit from lower sequencing cost to enable deeper sequencing coverage. Here, we describe the implementation and evaluation of a somatic calling pipeline from UG100 whole genome sequence data. Since deep-learning-based variant calling methods currently outperform statistical variant calling methods for germline variant calling on UG100 data, we cast somatic variant calling as a classification problem. Specifically, we trained a classifier to distinguish if a candidate at a particular location is a somatic variant or a sequencing error. We used a version of DeepVariant optimized for UG100 data to train the deep-learning classifier in three scenarios: tumor only, tumor with an unmatched background sample and matched tumor-normal samples. The labeled truth set for training was generated by mixing whole genome sequenced samples from the genome-in-a-bottle project in a wide range of proportions (0-100% mixing ratio) to simulate various allele frequencies, with an average genome coverage of 100x. The tumor/normal model was the best-performing of the three models with a recall of >98% for SNPs and 90% for Indels at allele fraction > 10%. Notably, the model also showed high specificity as well with 16 false positive SNPs and 19 false positive indels at AF over 10% called on the chromosome that was not part of the training (chr20). We then applied the model for calling from the WGS data on three well characterized pairs of matched tumor and normal cell lines: HCC1143, COLO829 and HCC1395. We evaluated the performance on the pre-defined UG-HCR (Ultima Genomics - High Confidence Region), which includes 95% of the human genome. DeepVariant models performed very well on calling SNPs (>92% recall at allele frequencies above 10%) and indels (>90% recall). The calls were also highly specific, with less than 1/Mb variants absent in the ground truth across the UG-HCR. Lastly, we applied the models to 8 unpaired cell lines with known driver mutations and observed that we call 34/34 driver mutations of length <=20 bp that appear in COSMIC (100% recall). We expect the UG100 sequencer to become an important tool for somatic genome analysis and to enable deep whole-genome sequencing to become a routine assay in clinical oncology. Citation Format: Maya Levy, Doron Shem-Tov, Hila Benjamin, Sima Benjamin, Ilya Soifer, Shlomit Gilad, Danit Lebanony, Nika Iremadze, Eti Meiri, Doron Lipson, Omer Barad. Calling somatic variants from UG100 data using deep learning [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 3134.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call