Abstract

Existing cancer benchmark data sets for human sequencing data use germline variants, synthetic methods, or expensive validations, none of which are satisfactory for providing a large collection of true somatic variation across a whole genome. Here we propose a data set, Lineage derived Somatic Truth (LinST), of short somatic mutations in the HT115 colon cancer cell-line, that are validated using a known cell lineage that includes thousands of mutations and a high confidence region covering 2.7 gigabases per sample.

Highlights

  • Existing cancer benchmark data sets for human sequencing data use germline variants, synthetic methods, or expensive validations, none of which are satisfactory for providing a large collection of true somatic variation across a whole genome

  • We provide a benchmarking data set of validated somatic mutations, Lineage-derived Somatic Truth (LinST), in a human colon cancer cell line with a DNA polymerase epsilon (POLE) proofreading deficiency (HT115)[13]

  • lineage sequencing (LinSeq) could be repeated on other cancer types and samples to generate other benchmarking data sets, with the potential for testing various wet laboratory techniques as well

Read more

Summary

Introduction

Existing cancer benchmark data sets for human sequencing data use germline variants, synthetic methods, or expensive validations, none of which are satisfactory for providing a large collection of true somatic variation across a whole genome. We propose a data set, Lineage derived Somatic Truth (LinST), of short somatic mutations in the HT115 colon cancer cell-line, that are validated using a known cell lineage that includes thousands of mutations and a high confidence region covering 2.7 gigabases per sample. Even variants that are called by multiple methods are not guaranteed to be true positives. This demonstrates a critical need for high-quality benchmarking data that could be used to disambiguate the discrepancies. Synthetic truth data penalize callers that model somatic variation better than the simulations

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call