A parallel and fault tolerant file system based on NFS servers

F Garcia,J Fernandez,A Calderon,J Carretero,J.M Perez

doi:10.1109/empdp.2003.1183570

Abstract

One important piece of system software for clusters is the parallel file system. All current parallel file systems and parallel I/O libraries for clusters do not use standard servers, thus it is very difficult to use these systems in heterogeneous environments. However why use proprietary or special-purpose servers on the server end of a parallel file system when you have most of the necessary functionality in NFS servers already? This paper describes the fault tolerance implemented in Expand (Expandable Parallel File System), a parallel file system based on NFS servers. Expand allows the transparent use of multiple NFS servers as a single file system, providing a single name space. The different NFS servers are combined to create a distributed partition where files are stripped. Expand requires no changes to the NFS server and uses RPC operations to provide parallel access to the same file. Expand is also independent of the clients, because all operations are implemented using RPC and NFS protocol. Using this system, we can join heterogeneous servers (Linux, Solaris, Windows 2000, etc.) to provide a parallel and distributed partition. Fault tolerance is achieved using RAID techniques applied to parallel files. The paper describes the design of Expand and the evaluation of a prototype of Expand, using the MPI-IO interface. This evaluation has been made in Linux clusters and compares Expand with PVFS.

Full Text