Abstract

e18700 Background: Previous studies on mutation calling have documented capture kit batch effects in Whole Exome Sequencing (WES) data from The Cancer Genome Atlas (TCGA) database, hindering direct comparison between samples from different capture kits. For example, in classification, a cancer type exclusively sampled by a specific capture kit in the training set would have very low accuracy if the testing set was sampled by another capture kit. To enable cross-capture-kit between-cancer genotype analyses with the TCGA dataset, a novel read count transformation algorithm is developed to remove capture kit batch effects. This algorithm was tested with our Machine Learning model which uses Tandem Repeat Sequence (TRS) mutation markers as training features. Methods: The proposed algorithm transforms TRS read count data to remove low quality samples, read depth differences, and capture kit batch effects from the dataset. Results: 1) TRS read count of WES samples are investigated. Particularly, we show that TRS site read counts do not correlate across capture kits but correlate within capture kits. This suggests that WES read count is largely independent from an exon’s location in the genome and is more strongly correlated with capture kit probes. 2) TRS detection rate for each sample within each capture kit is found to be normally distributed. Outliers with very low TRS detection rate can be used for quality filtering. 3) The transformation algorithm effectively removes capture kit batch effects from the dataset. At the same time, it retains cancer-specific signals in the samples. Before applying the transformation algorithm, cancer type classification accuracy is low (̃0-25%) if the testing data set uses a different capture kit from the training data set. We show that applying the transformation algorithm allows cancer type classification accuracy to improve by over 65%. Conclusions: We demonstrated that direct comparison of WES TRS read count data across capture kits is possible after application of our transformation algorithm. This opens the path to cross-capture-kit between-cancer genotype analyses with the TCGA dataset, which were previously unfeasible due to capture kit batch effects.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.