Abstract

Abstract Background: As high throughput sequencing (HTS) technique advances, HTS has been routinely used in cancer research as well as in clinical setting. However, we often face the situation where we need to identify somatic variants from tumor tissue without matched normal control. This is challenging largely due to 1) Intrinsic nature of tumor such as purity, ploidy and clonality. 2) technical error introduced by imperfect experimental procedure. 3) The complexity of human genome. 4) unavailability of matched normal to distinguish germline variants from somatic variants. Several efforts have been made to resolve this issue such as SGZ and PureCN. However, their performance remains unsatisfactory. Most of these methods focused on tumor heterogeneity inference followed by variant allele frequency (VAF) estimation while largely ignored the manifestation of variant harboring alignment. A better understanding of the characteristics of non-reference allele in NGS pipeline is essential for developing an improved variant calling and filtering approach. Therefore, we present a comprehensive evaluation of the characteristic of non-reference alleles. Method: Whole Exome Sequencing (WES) was carried out. Three types of non-reference allele were identified as following: Somatic and germline variants were identified using MuTect (with matched normal) and HaplotypeCaller respectively according to GATK best practice. For artifacts identification, Pisces is used in tumor-only-mode. The retained variants were subtracted by somatic and germline dataset. Variants with either low VAF or low sequencing depth were excluded from all three datasets. In terms of feature engineering, we defined a set of features that may differentiate between variant types including VAF, population frequency, mappability related genomic element, alignment integrity, etc. Corresponding analysis were carried out to evaluate these features individually. Result and conclusion: The above analysis can be seen as the process of feature evaluation and prioritization. we fitted a selection of features to a number of supervised classifiers including random forest, adaptive boosting, gradient boosting and voting classifier (soft voting). Precision and recall were measured based on KFold (K=5) cross validation. Random forest achieved highest performance: 0.90 and 0.93 for germline prediction. 0.93 and 0.91 for somatic prediction. 0.95 and 0.94 for artifact prediction. This result suggests that, by carefully selecting relevant features, can we achieve acceptable accuracy of variant detection even when matched normal is not available. Admittedly, we found some drawbacks alone the way of our analysis, for example some repeatedly occurred artifacts that are unaccounted by our features. Besides, our approach may be compromised when analyzing tumor with extreme purity. Further research is required to include additional features to tackle these drawbacks and achieve better performance. Citation Format: Cheng Yan, Weihua Guo, Hongling Yuan, Yufei Yang. Comprehensive characterization of nonreference allele occurring in high throughput sequencing of tumor sample [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 4433.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call