Virtual Grid Engine: a simulated grid engine environment for large-scale supercomputers

Satoshi Ito,Tatsuo Nishiki,Rui Yamaguchi,Satoru Miyano,Masaaki Yadome,Hikaru Inoue,Shigeru Ishiduki

doi:10.1186/s12859-019-3085-x

Abstract

BackgroundSupercomputers have become indispensable infrastructures in science and industries. In particular, most state-of-the-art scientific results utilize massively parallel supercomputers ranked in TOP500. However, their use is still limited in the bioinformatics field due to the fundamental fact that the asynchronous parallel processing service of Grid Engine is not provided on them. To encourage the use of massively parallel supercomputers in bioinformatics, we developed middleware called Virtual Grid Engine, which enables software pipelines to automatically perform their tasks as MPI programs.ResultWe conducted basic tests to check the time required to assign jobs to workers by VGE. The results showed that the overhead of the employed algorithm was 246 microseconds and our software can manage thousands of jobs smoothly on the K computer. We also tried a practical test in the bioinformatics field. This test included two tasks, the split and BWA alignment of input FASTQ data. 25,055 nodes (2,000,440 cores) were used for this calculation and accomplished it in three hours.ConclusionWe considered that there were four important requirements for this kind of software, non-privilege server program, multiple job handling, dependency control, and usability. We carefully designed and checked all requirements. And this software fulfilled all the requirements and achieved good performance in a large scale analysis.

Highlights

The use of supercomputers in bioinformatics has become common with the unprecedented increase in the amount of biomedical data, e.g., DNA sequence data, and demands of complex data analysis using multiple software tools
We developed Message Passing Interface (MPI)-based middleware named Virtual Grid Engine (VGE) that enables software pipelines based on GE system to run on massively parallel supercomputers
The VGE basic performance depends on the overhead time for assigning jobs

Summary

Introduction

The use of supercomputers in bioinformatics has become common with the unprecedented increase in the amount of biomedical data, e.g., DNA sequence data, and demands of complex data analysis using multiple software tools. Only a few studies have utilized massively parallel supercomputers ranked in TOP500 [3,4,5,6]. Most state-of-the-art scientific results utilize massively parallel supercomputers ranked in TOP500. Their use is still limited in the bioinformatics field due to the fundamental fact that the asynchronous parallel processing service of Grid Engine is not provided on them. To encourage the use of massively parallel supercomputers in bioinformatics, we developed middleware called Virtual Grid Engine, which enables software pipelines to automatically perform their tasks as MPI programs

Objectives

Results

Discussion

Conclusion