Abstract

The Berlekamp–Massey algorithm finds the shortest linear feedback shift register for a binary input sequence. A wide range of applications like cryptography and digital signal processing use this algorithm. This research proposes novel parallel mechanisms offered by heterogeneous CPU and GPU hardwares in order to achieve the best possible performance for BMA. The proposed bitwise implementation of the BMA algorithm is almost 35 times faster than state of the art implementations. This further improvement is achieved by using SIMD instructions which provides data level parallelism. This new approach can be 4.6 and 35 times faster than a bitwise CPU and state of the art implementations, respectively. In order to achieve the highest possible speedup over a multi-core structure, a multi-threading implementation is introduced in this research. By leveraging on OpenMP we were able to obtain a speedup of 10 times over 12 cores server. The GPU device with thousands of processing cores can bring great speedup over the best CPU implementation. Two other parallel mechanisms offered by GPU are concurrent kernel execution and streaming. They achieve 14.5 and 2.2 times of speedup compared to CPU serial and typical CUDA implementations, respectively. Also, the performance of the openMP code with SIMD instructions is compared with GPU stream implementation. The effectiveness of the proposed method is evaluated in a real world error correction application and it achieves 6.8 times of speedup.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call