Abstract
Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.
Highlights
Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools
In this paper we present a Swedish infrastructure, UPPMAX Cluster and Storage for Next-Generation Sequencing (UPPNEX), aimed at meeting these challenges by providing a high-performance cluster and storage system equipped with an actively maintained bioinformatics software suite, as well as application experts to assist with bioinformatics analysis
Number of projects 0 100 200 300 parallel storage system at Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) have been the implea mentation of more strict policies for allowances, cleaning up of temporary data, compressing files in inefficient file formats like raw text, and an increased use of the SweStore national storage
Summary
Storage Since its inauguration in 2009, UPPNEX has displayed a roughly linear increase in the number of projects, which amounted to 357 active projects in April 2013 (see Figure 2a). To access the UPPNEX system, users need to use a terminal, login via Secure Shell (SSH) and use Linux command line tools to submit and monitor jobs — skills that are not common among biologists This has required substantial effort from UPPNEX over the years to educate a large number of new users, many of whom had only used graphical operating systems, such as Microsoft Windows. There are similar characteristics with UPPNEX, such as the use of a resource management system, a central NFS-mounted file system, a variety of node sizes (in terms of RAM size) and a large selection of pre-installed software for NGS analysis. GenomeSpace is a rather new project and it will be interesting to see how well this architecture and strategy will perform compared to the more traditional HPC-based approach taken by UPPNEX and other organizations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.