NPARS-A Novel Approach to Address Accuracy and Reproducibility in Genomic Data Science.

Li Ma,Jason Muesse,Erich A Peterson,Katy Marino,Donald J Johann,Matthew A Steliga,Ik Jae Shin

doi:10.3389/fdata.2021.725095

Abstract

Background: Accuracy and reproducibility are vital in science and presents a significant challenge in the emerging discipline of data science, especially when the data are scientifically complex and massive in size. Further complicating matters, in the field of genomic-based science high-throughput sequencing technologies generate considerable amounts of data that needs to be stored, manipulated, and analyzed using a plethora of software tools. Researchers are rarely able to reproduce published genomic studies. Results: Presented is a novel approach which facilitates accuracy and reproducibility for large genomic research data sets. All data needed is loaded into a portable local database, which serves as an interface for well-known software frameworks. These include python-based Jupyter Notebooks and the use of RStudio projects and R markdown. All software is encapsulated using Docker containers and managed by Git, simplifying software configuration management. Conclusion: Accuracy and reproducibility in science is of a paramount importance. For the biomedical sciences, advances in high throughput technologies, molecular biology and quantitative methods are providing unprecedented insights into disease mechanisms. With these insights come the associated challenge of scientific data that is complex and massive in size. This makes collaboration, verification, validation, and reproducibility of findings difficult. To address these challenges the NGS post-pipeline accuracy and reproducibility system (NPARS) was developed. NPARS is a robust software infrastructure and methodology that can encapsulate data, code, and reporting for large genomic studies. This paper demonstrates the successful use of NPARS on large and complex genomic data sets across different computational platforms.

Highlights

The intersection of data science, analytics, and precision medicine are having an increasingly important role in the formation and delivery of health care, especially in cancer where the treatment regimens are complex and becoming more individualized (Ginsburg and Phillips, 2018)
Our understanding of the genomic basis of disease is being transformed by the combination of generation sequencing (NGS) and state-of-the-art computational data analysis, which are empowering the entry of innovative molecular assays into the clinic, and further enabling precision medicine (Berger and Mardis, 2018)
A central tenet in science that distinctly extends into data science is accuracy, which is the quality or state of being correct or precise

Summary

Introduction

The intersection of data science, analytics, and precision medicine are having an increasingly important role in the formation and delivery of health care, especially in cancer where the treatment regimens are complex and becoming more individualized (Ginsburg and Phillips, 2018). Data science is a nascent, cross-disciplinary field that can be viewed as an amalgamation of classic disciplines. These include, but are not limited to: statistics, applied mathematics, and computer science, and importantly is focused on finding nonobvious and useful patterns from large datasets (Kelleher and Tierney, 2018). A central tenet in science that distinctly extends into data science is accuracy, which is the quality or state of being correct or precise. It is defined as the ratio of correctly predicted observations to the total observations, and is utilized to measure predictive power.

Methods

Results

Discussion

Conclusion