Producing Madgraph5_aMC@NLO gridpacks and using TensorFlow GPU resources in the CMS HTCondor Global Pool

Brian Paul Bockelman,David Mason,Marco Mascheroni,Diego Davila Foyo,Edgar Fajardo Hernandez,Kenyi Hurtado Anampa,James Letts,Todor Trendafilovz Ivanov,Antonio Perez-Calero Yzquierdo,Krista Larson,Farrukh Aftab Khan,A Forti,M Litmaath,P Hristov,L Betev,O Smirnova

doi:10.1051/epjconf/201921403004

Abstract

The CMS experiment has an HTCondor Global Pool, composed of more than 200K CPU cores available for Monte Carlo production and the analysis of da.The submission of user jobs to this pool is handled by either CRAB, the standard workflow management tool used by CMS users to submit analysis jobs requiring event processing of large amounts of data, or by CMS Connect, a service focused on final stage condor-like analysis jobs and applications that already have a workflow job manager in place. The latest scenario canbring cases in which workflows need further adjustments in order to efficiently work in a globally distributed pool of resources. For instance, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO and the usage of tools not (yet) fully supported by the CMS software, such as Ten-sorFlow with GPUsupport, are tasks with particular requirements. A special adaption, either at the pool factory level (advertising GPU resources) or at the execute level (e.g: to handle special parameters that describe certain needs for the remote execute nodes during submission) is needed in order to adequately work in the CMS global pool. This contribution describes the challenges and efforts performed towards adaptingsuch workflows so they can properly profit from the Global Pool via CMS Connect.

Highlights

The CMS experiment has an HTCondor Global Pool, composed of more than 200K CPU cores available for Monte Carlo production and the analysis of data
While submission of CMS [1] user jobs to the Global Pool [2] is mostly managed by CRAB [3], the standard analysis workflow management tool, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO [4] and the usage of machine learning tools with GPU resources are independent use-cases that require special adaptation in order to take advantage of the Global Pool resources
CMS Connect [5] provides a service where users can submit HTCondor jobs to the CMS Global Pool with a submission interface similar to those provided by analysis facilities physicists are familiar with, such as the CERN Analysis Facility [6]

Summary

The submission system

While submission of CMS [1] user jobs to the Global Pool [2] is mostly managed by CRAB [3], the standard analysis workflow management tool, the generation of matrix elements for high energy physics processes via Madgraph5_aMC@NLO [4] and the usage of machine learning tools with GPU resources are independent use-cases that require special adaptation in order to take advantage of the Global Pool resources. CMS Connect [5] provides a service where users can submit HTCondor jobs to the CMS Global Pool (a global HTCondor pool provisioned by GlideinWMS) with a submission interface similar to those provided by analysis facilities physicists are familiar with, such as the CERN Analysis Facility [6] This service complements CRAB, as illustrated, dealing with a different set of analysis workflows, such as Madgraph gridpacks and the use of GPU resources with TensorFlow [7] jobs. The name for each gridpack was stored as an HTCondor classad that is later used at the monitoring side in order to make this classification

Deep learning and GPU resources

Using TensorFlow and GPU resources in the Global Pool

Conclusions