Abstract Tumor-specific molecular profiling by next-generation sequencing (NGS) from tissue or blood (liquid biopsy) can have a tremendous potential to guide personalized healthcare for cancer treatments. However, based on the current clinical practices1, often tumor or blood may be sequenced without the matched germline normal tissue or buffy coat. Since the tumor cell content of the specimens can vary widely, as well as the fact that sequencing will identify both germline and somatic variations, confident identification of somatic mutations becomes very challenging in the absence of a matched normal. Identification of somatic variations in cell-free DNA becomes even more difficult due especially to the low yields of tumor-derived DNA in plasma samples. An ensemble tree-based machine learning method was developed to classify tumor-specific somatic mutations in the absence of matched normal (CSMutan). This algorithm leverages multiple public databases and variant allele frequencies as features to classify somatic mutations and germline variants. Using The Cancer Genome Atlas (TCGA) and in-house datasets, models were built for each of the three different types of tumor specimens: i.e., fresh frozen, formalin-fixed, paraffin-embedded (FFPE), and plasma samples. The performance of the models was evaluated using cross-validation and/or testing on independent sets of samples. Lung adenocarcinoma samples from TCGA were used to build the fresh frozen model, which achieved an area under the ROC curve (AUC) of 0.9968 on an independent data set of lung squamous cell carcinoma samples, also from TCGA. In-house sample sets were used to build the FFPE model and the plasma model. The FFPE model achieved an ROC AUC of 0.993, while the model trained on plasma samples was able to classify with an AUC of 0.997 based on ten-fold cross-validation. Although sequencing matched normal tissue is the most preferred and accurate method for identifying high-confidence somatic mutations, this work demonstrates that it is possible to identify somatic mutations in tissue or blood with high accuracy when the matched normal samples may not be available. The future directions for this method are to add additional, and more diverse, cancer samples for both the training and independent validation so that potentially higher accuracy and generalizability can be obtained. 1. Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, 685-696 (2010) Citation Format: Lijing Yao, Preeti Lal, Li-Tai Fang, John Lee, John Palma, Andre Rosenthal, Bernd Hinzmann, Alex Lovejoy, Hugo Y. K. Lam. A method to identify somatic mutations from tumor samples in the absence of matched normal tissue [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 5293.
Read full abstract