Building an optimized pipeline for whole-exome sequencing

M D'Antonio,N Sanna,G Pesole,T Castrignanò,P D'Onorio De Meo,B Elmi

doi:10.14806/ej.18.a.389

Abstract

Motivations. Managing the huge amount of data produced by NGS platforms requires non trivial IT skills. Furthermore the wide list of freely available analytical tools for NGS data analysis makes difficult to choose easily the pipeline components. An additional layer of complexity is due to the need of integrate all steps in a single analysis: the best tool for a specific purpose could be incompatible with other tools of the pipeline, i.e. a tool performing a statistical calculation when raw data are required for the next step. In case of a whole exome analysis building effective pipelines that relate variants to their samples and controls, annotate them from multiple sources requires a large customization effort. Methods. Our proposed pipeline walks through several steps to perform a full analysis. 1) Before mapping the short-reads against the reference genome, a pre-process is necessary. FastQ files should be checked for integrity and cleaned up from any unwanted symbols that can alter any NGS tool behavior. Quality checks can be pursued with tools like FastQC to ensure that sequences provided reach the minimum level of mean quality necessary for a complete analysis. 2) Alignment is usually performed with BWA [1], which is capable of finding gaps. It results to be a good compromise between speed and accuracy. When there are known problems in the sequence provided, e.g. FastQC outlines a poor quality in the last or first bases sequenced, other tools can perform a more sensible alignment at a lower speed. 3) BWA provides mapping results in SAM format [2]. This is the most widely used format for alignment output. This text-based format should be converted into its binary equivalent BAM format through the SAMtools; BAM can be indexed and sorted to enable faster operations at subsequent steps. 4) Before searching any variant in mapping binary data, some other editing are required to prevent artifacts in results. Quality recalibration is required to refine some oddness caused by sequencing and mapping on quality scores. Duplicates are in most of the case result of PCR amplification and should be avoided as they lead to false positives. A re-alignment around known indels position should be also carried on to delete other artifacts. 5) Single Nucleotide Polymorphism (SNP, a single nucleotide occurring in one member is replaced by another nucleotide in the other member) and Deletion-Insertion Polymorphism (DIP, refers to the fact that a short nucleotide sequence in one member is omitted in the other member) can be now called from the mapping data obtained from the previous 4 steps. 6) SNP and DIP obtained have various score to consider to ensure a minimum depth coverage and quality score in order to remove any false positive in the list. 7) When dealing with multiple WES data lanes, the usual scenario is a combination of affected/unaffected tissue samples. In this case a critical information is about the haplotype phasing, which allow discovering complex heterozygous or homologous mutations. 8) The last critical aspect of variants calling is to associate as many annotation as possible to the variant list i.e. annotation stored in database like dbSNP, 1000genomes, etc. After these steps data can be saved into custom databases to allow cross-linking and intersections, statistics and much more. Results. We have tested different freely available algorithms used at the alignment and post alignment stage and integrated them with custom-build scripts to provide the most suitable and complete combination to create significant whole exome dataset results. Hence, we have customized the whole-exome data analysis pipeline to preferentially held true variants by minimizing the incidence of false positives and providing the benchmarks for the best choice of right analytical tools.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Building an optimized pipeline for whole-exome sequencing

Abstract

Talk to us

Similar Papers

More From: EMBnet.journal

Lead the way for us

Similar Papers

Mammography Clinical Image Quality and the False Positive Rate in a Canadian Breast Cancer Screening Program
Marie-Hélène Guertin ... Jacques Brisson
Canadian Association of Radiologists Journal | VOL. 69
Marie-Hélène Guertin, et. al.Marie-Hélène Guertin ... Jacques Brisson
26 Apr 2018
Canadian Association of Radiologists Journal | VOL. 69

Author response: A method for low-coverage single-gamete sequence analysis demonstrates adherence to Mendel’s first law across a large sample of human sperm
Kathryn J Weaver ... Avery Davis Bell
-
Kathryn J Weaver, et. al.Kathryn J Weaver ... Avery Davis Bell
05 May 2022
05 May 2022

Decision letter: A method for low-coverage single-gamete sequence analysis demonstrates adherence to Mendel’s first law across a large sample of human sperm
Molly Przeworski
-
Molly PrzeworskiMolly Przeworski
19 Apr 2022
19 Apr 2022

Editor's evaluation: A method for low-coverage single-gamete sequence analysis demonstrates adherence to Mendel’s first law across a large sample of human sperm
Daniel R Matute
-
Daniel R MatuteDaniel R Matute
19 Apr 2022
19 Apr 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Building an optimized pipeline for whole-exome sequencing

Abstract

Talk to us

Similar Papers

More From: EMBnet.journal