Abstract

Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Software incompatibilities, and inconsistencies across computing environments, are recurrent challenges, leading to poor reproducibility. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with numerous user-modifiable thresholds, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.

Highlights

  • Quality control of genomic data is an essential but complicated multistep procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools

  • O –profile standard,singularity can be replaced using any of the other profile choices we provide depending on your installation and needs e.g. whether local imputation is needed or not (-profile standard,[docker/conda]), and if you wish to run your experiments in a High Performance Computing (HPC) cluster (-profile cluster,[singularity/ modules])

  • For a logistic regression analysis on the toy dataset, the user can run the following command: nextflow run main.nf \ -profile standard,singularity \ -resume \ -params-file parameters.yaml \ --bed data/toy.bed \ --bim data/toy.bim \ --fam data/toy.fam \ --qc \ --pop_strat \ --gwas \ --results results_toy/ \ --sexcheck false

Read more

Summary

14 Jul 2021

Any reports and responses or comments on the article can be found at the end of the article. The analyst may encounter incompatibility and scalability problems, installation difficulties as well as spending valuable time familiarizing themselves with a number of different tools that sometimes lack detailed documentation Software architecture tools such as nextflow and BioContainers can address these issues and have been proposed as automated solutions; restrictions exist in terms of limited and relatively rigid QC analysis, lacking such steps as imputation, limited variety of threshold choice and plot outputs, and the requirement for users to have extensive knowledge of the software in order to tailor their analysis. We present snpQT (shown in Figure 1): a standardised, flexible, scalable, automatic pipeline tool that provides comprehensive quality control, with imputation and association analysis, including publication-ready figures for data interpretation and validation for every QC step. Detailed reports, including distribution plots both before and after applying each QC threshold, aid the user in decisionmaking and it is easy to re-run an analysis with modified thresholds to arrive at optimal output

Methods
Conclusions
Teo YY
19. Chang CC
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call