Accelerating Pythonic Coupled-Cluster Implementations: A Comparison Between CPUs and GPUs.

Maximilian H Kriebel,Marta Gałyńska,Katharina Boguslawski,Aleksandra Leszczyk,Paweł Tecmer

doi:10.1021/acs.jctc.3c01110

Maximilian H Kriebel, Marta Gałyńska + Show 3 more

Open Access

https://doi.org/10.1021/acs.jctc.3c01110

Copy DOI

Abstract

In this work, we benchmark several Python routines for time and memory requirements to identify the optimal choice of the tensor contraction operations available. We scrutinize how to accelerate the bottleneck tensor operations of Pythonic coupled-cluster implementations in the Cholesky linear algebra domain, utilizing a NVIDIA Tesla V100S PCIe 32GB (rev 1a) graphics processing unit (GPU). The NVIDIA compute unified device architecture API interacts with CuPy, an open-source library for Python, designed as a NumPy drop-in replacement for GPUs. Due to the limitations of video memory, the GPU calculations must be performed batch-wise. Timing results of some contractions containing large tensors are presented. The CuPy implementation leads to a factor of 10-16 speed-up of the bottleneck tensor contractions compared to computations on 36 central processing unit (CPU) cores. Finally, we compare example CCSD and pCCD-LCCSD calculations performed solely on CPUs to their CPU-GPU hybrid implementation, which leads to a speed-up of a factor of 3-4 compared to the CPU-only variant.

Full Text