Abstract

With ever-increasing amounts of data being produced by next-generation sequencing (NGS) experiments, the requirements placed on supporting e-infrastructures have grown. In this work, we provide recommendations based on the collective experiences from participants in the EU COST Action SeqAhead for the tasks of data preprocessing, upstream processing, data delivery, and downstream analysis, as well as long-term storage and archiving. We cover demands on computational and storage resources, networks, software stacks, automation of analysis, education, and also discuss emerging trends in the field. E-infrastructures for NGS require substantial effort to set up and maintain over time, and with sequencing technologies and best practices for data analysis evolving rapidly it is important to prioritize both processing capacity and e-infrastructure flexibility when making strategic decisions to support the data analysis demands of tomorrow. Due to increasingly demanding technical requirements we recommend that e-infrastructure development and maintenance be handled by a professional service unit, be it internal or external to the organization, and emphasis should be placed on collaboration between researchers and IT professionals.Electronic supplementary materialThe online version of this article (doi:10.1186/s13742-016-0132-7) contains supplementary material, which is available to authorized users.

Highlights

  • Parallel sequencing, known as nextgeneration sequencing (NGS), has reduced the cost and increased the throughput of biological sequencing enabling the study of biological phenomena on a detailed level with great promise for improving clinical care [1,2,3]

  • We observe that the most common e-infrastructure components include high-performance computing (HPC) resources equipped with batch systems, commonly connected to shared network-attached storage (NAS). Another e-infrastructure component that is gaining in popularity in NGS is cloud computing [7] on virtualized resources, and in this context we focus primarily on infrastructure as a service (IaaS)

  • The time to plan, procure, install, and test e-infrastructure is considerably longer than the time necessary to obtain an operational datagenerating instrument, and it is not uncommon that new sequencers are acquired before the supporting einfrastructure is fully deployed, forcing them to run at reduced capacity

Read more

Summary

Background

Known as nextgeneration sequencing (NGS), has reduced the cost and increased the throughput of biological sequencing enabling the study of biological phenomena on a detailed level with great promise for improving clinical care [1,2,3]. The operation can be more challenging when the two phases run on different infrastructures This case is likely most common when the two phases run at different centers, but it can happen when all users are under the same roof, since it can be advantageous to use separate einfrastructures for upstream processing and downstream analysis as these operations have different usage patterns and require different system configurations (e.g., memory size, storage bandwidth, etc.). For partners invested in a long-term collaboration, the upstream organization can consider providing the downstream users access to its computational resources near the data storage – a solution adopted by the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Embassy cloud [16]; in this manner, the most voluminous data never needs to be transferred from where it was generated. To help devise rational archival policy one should estimate the total cost of long-term data storage and compare it with the cost of regenerating the data – even considering resequencing if it is possible to store or obtain new samples

Discussion and outlook
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call