Graphics processors Unit (GPU) architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general purpose applications compared to contemporary general- purpose processors (CPUs). However, there are several optimization techniques which are used to maximize the benefit of the GPU resources. This research exploits optimization techniques for CUDA enabled GPU architecture in order to achieve the best possible performance for Berlekamp-Massey Algorithm (BMA) as a case study. Berlekamp-Massey Algorithm (BMA) is one of the best solutions to find the shortest linear feedback shift register which is very important for several applications such as digital processing and cryptography. The experimental results show that the optimized BMA implementation is almost 160 × faster than non-bit CPU serial implementation, 7 × faster than bit serial implementation and 4 × faster than an initial parallel bit implementation.