The article presents algorithms for parallel and vectorized full-word addition of big unsigned integers with carry propagation. Because of the propagation, software parallelization and vectorization of non-polynomial addition of big integers have long been considered impractical due to data dependencies between digits of the operands. The presented algorithms are based upon parallel and vectorized detection of carry origins within elements of vector operands, masking bits which correspond to those elements and subsequent scalar addition of the resulting integers. The acquired bits can consequently be taken into account to adjust the sum using the proposed generalization of the Kogge-Stone method. Essentially, the article formalizes and experimentally verifies parallel and vectorized implementation of carry-lookahead adders applied at arbitrary granularity of data. This approach is noticeably beneficial for manycore, CUDA and vectorized implementation using AVX-512 with masked instructions. Experiments show that the parallel and vectorized implementations of the proposed algorithms can be multiple times faster compared to a sequential ripple-carry adder or adders based on redundant number systems such as one used in the GNU Multiple Precision library.
Read full abstract