Abstract

To test the performance of a new sequencing platform, develop an updated somatic calling pipeline and establish a reference for future benchmarking experiments, we performed whole-genome sequencing of 3 common cancer cell lines (COLO-829, HCC-1143 and HCC-1187) along with their matched normal cell lines to great sequencing depths (up to 278x coverage) on both Illumina HiSeqX and NovaSeq sequencing instruments. Somatic calling was generally consistent between the two platforms despite minor differences at the read level. We designed and implemented a novel pipeline for the analysis of tumor-normal samples, using multiple variant callers. We show that coupled with a high-confidence filtering strategy, the use of combination of tools improves the accuracy of somatic variant calling. We also demonstrate the utility of the dataset by creating an artificial purity ladder to evaluate the somatic pipeline and benchmark methods for estimating purity and ploidy from tumor-normal pairs. The data and results of the pipeline are made accessible to the cancer genomics community.

Highlights

  • The field of cancer genomics has exploded with the development of high-throughput sequencing, largely driven by Illumina’s short read sequencing technology

  • In both Read 1 and Read 2, NovaSeq instruments produced more stretches of Gs than HiSeq X Ten (HiSeqX), which we attributed to an artifact resulting from the fact that G is detected as the absence of signal in the 2-color chemistry of the NovaSeq platform

  • While there were some differences between Single Nucleotide Variants (SNVs) and indel calls between the two pipelines, we found that the Copy-number variants (CNVs) recall was very similar between the two pipelines based on a gene-level comparison (99.8% recall for both our pipeline and the Sanger pipeline)

Read more

Summary

Introduction

The field of cancer genomics has exploded with the development of high-throughput sequencing, largely driven by Illumina’s short read sequencing technology. With the introduction of any new sequencing technology, it is important to investigate the error profiles and biases of the technology, and to understand the subsequent impact of those on downstream analyses. This is especially important for cancer data analysis where varying tumor purity and intra-tumor heterogeneity make distinguishing low frequency somatic variants from sequencing noise challenging. We have created a whole genome reference dataset of 3 matched tumor-normal cell lines sequenced deeply on both HiSeqX and NovaSeq, employed it to evaluate our somatic pipeline, and released it to the genomics community. We decided to share with the scientific community the data we generated and believe that it can be used as reference dataset, together with other similar dataset of real tumors[12,13] or cancer cell lines[8,14]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call