SnpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data.

Christina Vasilopoulou,Benjamin Wingfield,Andrew P Morris,William Duddy

doi:10.12688/f1000research.53821.2

Christina Vasilopoulou, Benjamin Wingfield + Show 2 more

Open Access

https://doi.org/10.12688/f1000research.53821.2

Copy DOI

Abstract

Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Software incompatibilities, and inconsistencies across computing environments, are recurrent challenges, leading to poor reproducibility. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with numerous user-modifiable thresholds, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.

Highlights

Quality control of genomic data is an essential but complicated multistep procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools
O –profile standard,singularity can be replaced using any of the other profile choices we provide depending on your installation and needs e.g. whether local imputation is needed or not (-profile standard,[docker/conda]), and if you wish to run your experiments in a High Performance Computing (HPC) cluster (-profile cluster,[singularity/ modules])
For a logistic regression analysis on the toy dataset, the user can run the following command: nextflow run main.nf \ -profile standard,singularity \ -resume \ -params-file parameters.yaml \ --bed data/toy.bed \ --bim data/toy.bim \ --fam data/toy.fam \ --qc \ --pop_strat \ --gwas \ --results results_toy/ \ --sexcheck false

Summary

14 Jul 2021

Any reports and responses or comments on the article can be found at the end of the article. The analyst may encounter incompatibility and scalability problems, installation difficulties as well as spending valuable time familiarizing themselves with a number of different tools that sometimes lack detailed documentation Software architecture tools such as nextflow and BioContainers can address these issues and have been proposed as automated solutions; restrictions exist in terms of limited and relatively rigid QC analysis, lacking such steps as imputation, limited variety of threshold choice and plot outputs, and the requirement for users to have extensive knowledge of the software in order to tailor their analysis. We present snpQT (shown in Figure 1): a standardised, flexible, scalable, automatic pipeline tool that provides comprehensive quality control, with imputation and association analysis, including publication-ready figures for data interpretation and validation for every QC step. Detailed reports, including distribution plots both before and after applying each QC threshold, aid the user in decisionmaking and it is easy to re-run an analysis with modified thresholds to arrive at optimal output

Methods

Conclusions

Teo YY

19. Chang CC

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: F1000Research	Publication Date: Nov 29, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SnpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research

Lead the way for us

Similar Papers

SnpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data
William Duddy ... William Duddy
F1000Research | VOL. 10
William Duddy, et. al.William Duddy ... William Duddy
29 Oct 2021
F1000Research | VOL. 10

SnpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data.
Christina Vasilopoulou ... Benjamin Wingfield
F1000Research | VOL. 10
Christina Vasilopoulou, et. al.Christina Vasilopoulou ... Benjamin Wingfield
14 Jul 2021
F1000Research | VOL. 10

Implementing comprehensive quality control in the andrology laboratory.
S Clements ... C.L.R Barratt
Human reproduction (Oxford, England) | VOL. 10
S Clements, et. al.S Clements ... C.L.R Barratt
01 Aug 1995
Human reproduction (Oxford, England) | VOL. 10

ChiLin: a comprehensive ChIP-seq and DNase-seq quality control and analysis pipeline.
Qian Qin ... Lewyn Li
BMC Bioinformatics | VOL. 17
Qian Qin, et. al.Qian Qin ... Lewyn Li
03 Oct 2016
BMC Bioinformatics | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SnpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: F1000Research