Launching large computing applications on a disk-less cluster

Rainer Schwemmer,Juan Manuel Caicedo Carvajal,Niko Neufeld

doi:10.1088/1742-6596/331/5/052032

Rainer Schwemmer, Juan Manuel Caicedo Carvajal + Show 1 more

Open Access

https://doi.org/10.1088/1742-6596/331/5/052032

Copy DOI

Abstract

The LHCb Event Filter Farm system is based on a cluster of the order of 1.500 disk-less Linux nodes. Each node runs one instance of the filtering application per core. The amount of cores in our current production environment is 8 per machine for the old cluster and 12 per machine on extension of the cluster.Each instance has to load about 1.000 shared libraries, weighting 200 MB from several directory locations from a central repository. The repository is currently hosted on a SAN and exported via NFS. The libraries are all available in the local file system cache on every node. Loading a library still causes a huge number of requests to the server though, because the loader will try to probe every available path. Measurements show there are between 100.000-200.000 calls per application instance start up. Multiplied by the numbers of cores in the farm, this translates into a veritable DDoS attack on the servers, which lasts several minutes. Since the application is being restarted frequently, a better solution had to be found.scpRolling out the software to the nodes is out of the question, because they have no disks and the software in it's entirety is too large to put into a ram disk. To solve this problem we developed a FUSE based file systems which acts as a permanent, controllable cache that keeps the essential files that are necessary in stock.

Full Text