Pilot factory – a Condor-based system for scalable Pilot Job generation in the Panda WMS framework

Po-Hsiang Chiu,Maxim Potekhin

doi:10.1088/1742-6596/219/6/062041

Po-Hsiang Chiu, Maxim Potekhin

Open Access

https://doi.org/10.1088/1742-6596/219/6/062041

Copy DOI

Abstract

The Panda Workload Management System is designed around the concept of the Pilot Job – a "smart wrapper" for the payload executable that can probe the environment on the remote worker node before pulling down the payload from the server and executing it. Such design allows for improved logging and monitoring capabilities as well as flexibility in Workload Management. In the Grid environment (such as the Open Science Grid), Panda Pilot Jobs are submitted to remote sites via mechanisms that ultimately rely on Condor-G. As our experience has shown, in cases where a large number of Panda jobs are simultaneously routed to a particular remote site, the increased load on the head node of the cluster, which is caused by the Pilot Job submission, may lead to overall lack of scalability. We have developed a Condor-inspired solution to this problem, which is using the schedd-based glidein, whose mission is to redirect pilots to the native batch system. Once a glidein schedd is installed and running, it can be utilized exactly the same way as local schedds and therefore, from the user's perspective, Pilots thus submitted are quite similar to jobs submitted to the local Condor pool.

Full Text