Abstract
SummaryTwo essential problems in computer algebra, namely polynomial factorization and polynomial greatest common divisor computation, can be efficiently solved thanks to multiple polynomial evaluations in two variables using modular arithmetic. In this article, we focus on the efficient computation of such polynomial evaluations on one single CPU core. We first show how to leverage SIMD (single instruction, multiple data) computing for modular arithmetic on AVX2 and AVX‐512 units, using both intrinsics and OpenMP compiler directives. Then we manage to increase the operational intensity and to exploit instruction‐level parallelism in order to increase the compute efficiency of these polynomial evaluations. All this results in the end to performance gains up to about 5x on AVX2 and 10x on AVX‐512.
Highlights
Computer Algebra, called symbolic computation, consists of developing algorithms and data structures for manipulating mathematical objects in an exact way
We show that the optimized AVX version implementation of van der Hoeven et al.[14] can safely be used in our polynomial evaluation, and we propose the first implementation of such modular multiplication algorithm on AVX-512, as well as the corresponding FP-based modular addition
We have first justified the choice of a modular multiplication algorithm relevant for HPC and SIMD computing
Summary
Computer Algebra, called symbolic computation, consists of developing algorithms and data structures for manipulating mathematical objects in an exact way. Computing modulo a 64 bit prime p makes it possible to use machine integers and native CPU operations, instead of arbitrary-precision integers Since these partial modular polynomial evaluations are currently a performance bottleneck for polynomial factorizations and gcd computations, we aim in this article to speed-up their computation on modern CPUs. We focus here on one compute server since most symbolic computations are usually performed on personal workstations. We show how to significantly improve the performance of the modular polynomial evaluation by increasing the operational intensity via data reuse, and by filling the pipelines of the floating-point units. This is achieved thanks to the introduction of multiple “dependent” and “independent” evaluations and loop unrolling.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Concurrency and Computation: Practice and Experience
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.