Controlled overflowing of data-intensive jobs from oversubscribed sites

I Sfiligoi,F Wuerthwein,D C Bradley,J Letts,B Bockelman,A Mrak Tadel,M Tadel,K Bloom

doi:10.1088/1742-6596/396/3/032102

Abstract

The CMS analysis computing model was always relying on jobs running near the data, with data allocation between CMS compute centers organized at management level, based on expected needs of the CMS community. While this model provided high CPU utilization during job run times, there were times when a large fraction of CPUs at certain sites were sitting idle due to lack of demand, all while Terabytes of data were never accessed. To improve the utilization of both CPU and disks, CMS is moving toward controlled overflowing of jobs from sites that have data but are oversubscribed to others with spare CPU and network capacity, with those jobs accessing the data through real time Xrootd streaming over WAN. The major limiting factor for remote data access is the ability of the source storage system to serve such data, so the number of jobs accessing it must be carefully controlled. The CMS approach to this is to implement the overflowing by means of glideinWMS, a Condor based pilot system, and by providing the WMS with the known storage limits and let it schedule jobs within those limits. This paper presents the detailed architecture of the overflow-enabled glideinWMS system, together with operational experience of the past 6 months.

Full Text