Abstract

BackgroundThe massive amounts of data from next generation sequencing (NGS) methods pose various challenges with respect to data security, storage and metadata management. While there is a broad range of data analysis pipelines, these challenges remain largely unaddressed to date.ResultsWe describe the integration of the open-source metadata management system iRODS (Integrated Rule-Oriented Data System) with a cancer genome analysis pipeline in a high performance computing environment. The system allows for customized metadata attributes as well as fine-grained protection rules and is augmented by a user-friendly front-end for metadata input. This results in a robust, efficient end-to-end workflow under consideration of data security, central storage and unified metadata information.ConclusionsIntegrating iRODS with an NGS data analysis pipeline is a suitable method for addressing the challenges of data security, storage and metadata management in NGS environments.

Highlights

  • The massive amounts of data from generation sequencing (NGS) methods pose various challenges with respect to data security, storage and metadata management

  • The present paper describes the implementation of such a system which was designed to embed the in-house next generation sequencing (NGS) analysis pipeline into an end-to-end workflow utilizing the comprehensive data management system Integrated rule-oriented data system (iRODS) (Integrated Rule-Oriented Data System) [1] and a webbased in-house developed front-end for metadata input and workflow management

  • Based on our experiences with NGS workflows in high performance computing (HPC) environments [2, 3], we have decided to use iRODS since it allows for customized metadata attributes, fine-grained protection rules as well as a query system to quickly organize and review the results of a cancer genome analysis workflow

Read more

Summary

Introduction

The massive amounts of data from generation sequencing (NGS) methods pose various challenges with respect to data security, storage and metadata management. Generation sequencing (NGS) is an increasingly cost efficient and reliable method to provide whole genomes or exomes (i.e., the protein coding part of the genome) in a relatively short time. The massive amounts of resulting data pose various challenges that need to be addressed in order to enable their exploration, analysis and effective dissemination. Based on our experiences with NGS workflows in HPC environments [2, 3], we have decided to use iRODS since it allows for customized metadata attributes, fine-grained protection rules as well as a query system to quickly organize and review the results of a cancer genome analysis workflow.

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call