Abstract

SummaryTwo essential problems in computer algebra, namely polynomial factorization and polynomial greatest common divisor computation, can be efficiently solved thanks to multiple polynomial evaluations in two variables using modular arithmetic. In this article, we focus on the efficient computation of such polynomial evaluations on one single CPU core. We first show how to leverage SIMD (single instruction, multiple data) computing for modular arithmetic on AVX2 and AVX‐512 units, using both intrinsics and OpenMP compiler directives. Then we manage to increase the operational intensity and to exploit instruction‐level parallelism in order to increase the compute efficiency of these polynomial evaluations. All this results in the end to performance gains up to about 5x on AVX2 and 10x on AVX‐512.

Highlights

  • Computer Algebra, called symbolic computation, consists of developing algorithms and data structures for manipulating mathematical objects in an exact way

  • We show that the optimized AVX version implementation of van der Hoeven et al.[14] can safely be used in our polynomial evaluation, and we propose the first implementation of such modular multiplication algorithm on AVX-512, as well as the corresponding FP-based modular addition

  • We have first justified the choice of a modular multiplication algorithm relevant for HPC and SIMD computing

Read more

Summary

Introduction

Computer Algebra, called symbolic computation, consists of developing algorithms and data structures for manipulating mathematical objects in an exact way. Computing modulo a 64 bit prime p makes it possible to use machine integers and native CPU operations, instead of arbitrary-precision integers Since these partial modular polynomial evaluations are currently a performance bottleneck for polynomial factorizations and gcd computations, we aim in this article to speed-up their computation on modern CPUs. We focus here on one compute server since most symbolic computations are usually performed on personal workstations. We show how to significantly improve the performance of the modular polynomial evaluation by increasing the operational intensity via data reuse, and by filling the pipelines of the floating-point units. This is achieved thanks to the introduction of multiple “dependent” and “independent” evaluations and loop unrolling.

Presentation
Polynomial factorization
Polynomial gcd
The matrix method
Multi-core parallel evaluation
Selection of the modular arithmetic algorithm
SIMD programming paradigms
SIMD intrinsics and the AVX-512 version
Microbenchmarks
Integration in polynomial evaluation
Multiple dependent evaluations
Multiple independent evaluations
Loop unrolling
Performance results
Without extra memory requirements
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.