Omics Analysis and Quality Control Pipelines in a High-Performance Computing Environment.

Darrell O Ricke,Derek Ng,Philip Fremont-Smith,Adam Michaleas

doi:10.1089/omi.2023.0078

Abstract

Data quality is often an overlooked feature in the analysis of omics data. This is particularly relevant in studies of chemical and pathogen exposures that can modify an individual's epigenome and transcriptome with persistence over time. Portable, quality control (QC) pipelines for multiple different omics datasets are therefore needed. To meet these goals, portable quality assurance (QA) metrics, metric acceptability criterion, and pipelines to compute these metrics were developed and consolidated into one framework for 12 different omics assays. Performance of these QA metrics and pipelines were evaluated on human data generated by the Defense Advanced Research Projects Agency (DARPA) Epigenetic CHaracterization and Observation (ECHO) program. Twelve analytical pipelines were developed leveraging standard tools when possible. These QC pipelines were containerized using Singularity to ensure portability and scalability. Datasets for these 12 omics assays were analyzed and results were summarized. The quality thresholds and metrics used were described. We found that these pipelines enabled early identification of lower quality datasets, datasets with insufficient reads for additional sequencing, and experimental protocols needing refinements. These omics data analysis and QC pipelines are available as open-source resources as reported and discussed in this article for the omics and life sciences communities.

Full Text