Energy consumption optimization of the Total-FETI solver and BLAS routines by changing the CPU frequency

David Horak,Radim Sojka,Martin Beseda,Lubomir Riha,Jakub Kruzik

doi:10.1109/hpcsim.2016.7568453

Abstract

The energy consumption of supercomputers is one of the critical problems for the upcoming Exascale supercomputing era. The awareness of power an energy consumption is required on both software and hardware side. This poster deals with the energy consumption evaluation of the Total-Finite Element Tearing and Interconnect (TFETI) based solvers [2] of linear systems implemented in PERMON toolbox [1], which is an established method for solving real-world engineering problems, and with the energy consumption evaluation of the BLAS routines. The experiments performed in the poster deal with CPU frequency. This work is performed in the scope of the READEX project (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) [6]. The measurements were performed on the Intel Xeon E5-2680 (Intel Haswell micro-architecture) based Taurus system installed at TU Dresden. The system contains over 1400 nodes that have an FPGA-based power instrumentation called HDEEM (High Definition Energy Efficiency Monitoring), that allows for fine-grained and more accurate power and energy measurements. The measurements can be accessed through the HDEEM library, allowing developers to take energy measurements before and after the region of interest. We have evaluated the effect of the CPU frequency on the energy consumption of the TFETI solver for a linear elasticity 3D cube synthetic benchmark. On the dualized problem MPFX=MPd, we have evaluated the effect of frequency tuning on the energy consumption of the essential processing kernels of the TFETI method. There are two main phases in TFETI — preprocessing and solve. In preprocessing it is necessary to regularize the stiffness matrix K and factorize it and to assemble the G and GGT matrices and the second one to factorize. Both operations belong to the most time and also energy consuming operations. The solve employs the Preconditioned Conjugate Gradient (PCG) algorithm, which consists of sparse matrix-vector multiplications (by F, P, M L , M D matrices) and vector dot products and AXPY functions. In each iteration, we need to apply the direct solver twice, i.e., for forward and backward solves for the pseudoinverse K+ action and for the coarse problem solution, the (GGT)−1 action. The multiplication by the dense Schur complement matrix adds an additional operator with different computational characteristics, potentially increasing the exploitable dynamism. The poster provides results for two types of frequency tuning: (1) static tuning and (2) dynamic tuning. For static tuning experiments, the frequency is set before execution and kept constant during the runtime. For dynamic tuning, the frequency is changed during the program execution to adapt the system to the actual needs of the application. The poster shows that static tuning brings up 11.84% energy savings when compared to default CPU settings (the highest clock rate). The dynamic tuning improves this further by up to 2.68%. In total, the approach presented in this paper shows the potential to save up to 14.52% of energy for TFETI based solvers, see Table1. Another energy consumption evaluations were done with selected Sparse and Dense BLAS Level 1, 2 and 3 routines. For benchmarking we have used a set of matrices from University Florida collection [4]. We have employed AXPY, Sparse Matrix-Vector, Sparse MatrixMatrix, Dense Matrix-Vector, Dense Matrix-Matrix and Sparse Matrix-Dense Matrix multiplication routines from Intel Math Kernel Library (MKL) [3]. The measured characteristics illustrate the different energy consumption of BLAS routines, as some operations are memory-bounded and others are compute-bounded. Based on our recommendations one can explore dynamic frequency switching to achieve significant energy savings up to 23%, for more details see Table 2.

Full Text