Abstract

Bringing together cancer genomes from different projects increases power and allows the investigation of pan-cancer, molecular mechanisms. However, working with whole genomes sequenced over several years in different sequencing centres requires a framework to compare the quality of these sequences. We used the Pan-Cancer Analysis of Whole Genomes cohort as a test case to construct such a framework. This cohort contains whole cancer genomes of 2832 donors from 18 sequencing centres. We developed a non-redundant set of five quality control (QC) measurements to establish a star rating system. These QC measures reflect known differences in sequencing protocol and provide a guide to downstream analyses and allow for exclusion of samples of poor quality. We have found that this is an effective framework of quality measures. The implementation of the framework is available at: https://dockstore.org/containers/quay.io/jwerner_dkfz/pancanqc:1.2.2.

Highlights

  • We found similar results for a subset of 348 samples sequenced at the Broad Institute (Supplementary Fig. 9), which had metadata recorded in CGHub[27] about the time and instruments used to sequence

  • While the proportion of calls supporting the four callers varies greatly by sample, we find that the samples with four stars or more tended to have higher proportions than samples with less than four stars for single base mutations (SSM), somatic insertion/deletion mutations (SIM) and somatic structural mutations (SStM)

  • Though the model only explains a small amount of variance in the data, the results show that an increasing percentage of paired reads mapping to different chromosomes in tumour samples, has a significant negative effect on the proportion of calls supported by four callers for SSM, SIM and SStM

Read more

Summary

Methods

The individual QC measures and the star rating for each of these normal-tumour sample pairs in PCAWG is provided in Supplementary Data 1: 2959 normal-tumour genome pairs from 2832 donors. Included in our analysis were samples that were later placed on the exclusion or grey list by the PCAWG consortium. Some due quality measures we highlighted, others due to incomplete metadata or other issues like contamination. The number of reads covering each base of the genome was determined and the mean was calculated: 1⁄4

Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.