Abstract

IntroductionWhole-genome sequencing of cancer tumours is more a research tool nowadays, but going to be used in clinical settings in the near future to facilitate precision medicine. While large institutions have built up in-house bioinformatics solutions for their own data analysis, robust and portable workflows combining multiple software have been lacking, making it difficult for individual research groups to utilise the potential of this research field. Here we present Sarek, a robust, easy-to-instal workflow for identification of both somatic and germline mutations from paired tumour/normal/relapse samples.Material and methodsSarek is open source and implemented in Nextflow; a domain specific programming language to enable portability and reproducibility. With the help of docker containers the versions of the underlying software can be maintained. Furthermore, with Singularity it is possible to run the workflow on protected clusters with no internet connexion.The workflow starts from raw FASTQ files, and follows the GATK best practices to prepare the recalibrated files with joint realignment around indels for both the tumour and the normal data. Reads are alignment to the GRCh38 human reference in an ALT-aware settings using BWA, however, it is possible to assign other references. HaplotypeCaller and Strelka2 germline calls are collected for both the tumour and the normal sample, and Manta provides germline structural variants. The somatic variations are calculated by running MuTect2, Strelka and FreeBayes (and MuTect1 optionally). Somatic structural variants are delivered by Manta, and ASCAT estimates ploidy, tumour heterogeneity and CNVs.The resulting variant call files are annotated by SnpEff and Ensembl-VEP. The annotated calls are further filtered and prioritised by our custom methods. During running the workflow quality control metrics are also calculated and aggregated by MultiQC.Results and discussionsSarek was validated on a real dataset with known golden set of somatic mutations. In a real settings, whole-genome sequencing (WGS, 45–60x coverage) of patient-matched tumour and blood derived-DNA is being performed on a set of 80 paediatric brain tumour samples of the Swedish Childhood Tumour Biobank. The workflow helps to produce, filter, prioritise and characterise both germline and somatic variations.ConclusionSarek is a portable bioinformatics pipeline for WGS normal/tumour matched samples, aiding precision medicine by improved subtyping and to gain novel functional insights in a reproducible framework.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call