Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition

Pani Prithvi Raj,Pakala Akhil Reddy,Nitin Chandrachoodan

doi:10.1145/3510028

Abstract

Large Vocabulary Continuous Speech Recognition systems require Viterbi searching through a large state space to find the most probable sequence of phonemes that led to a given sound sample. This needs storing and updating of a large Active State List (ASL) in the on-chip memory (OCM) at regular intervals (called frames), which poses a major performance bottleneck for speech decoding. Most works use hash tables for OCM storage while beam-width pruning to restrict the ASL size. To achieve a decent accuracy and performance, a large OCM, numerous acoustic probability computations, and DRAM accesses are incurred. We propose to use a binary search tree for ASL storage and a max heap data structure to track the worst cost state and efficiently replace it when a better state is found. With this approach, the ASL size can be reduced from over 32K to 512 with minimal impact on recognition accuracy for a 7,000-word vocabulary model. This, combined with a caching technique for acoustic scores, reduced the DRAM data accessed by 31 \( \times \) and the acoustic probability computations by 26 \( \times \) . The approach has also been implemented in hardware on a Xilinx Zynq FPGA at 200 MHz using the Vivado SDS compiler. We study the tradeoffs among the amount of OCM used, word error rate, and decoding speed to show the effectiveness of the approach. The resulting implementation is capable of running faster than real time with 91% lesser block-RAMs.

Full Text