This paper describes a scalable hardware accelerator for speech recognition, which uses a two pass decoding algorithm with word dependent N-best Viterbi Beam Search. The observation probability calculation (Senone scoring) and first pass of decoding using a Bigram language model is implemented in hardware. The word lattice output from the first pass is used by software for the second pass, with a trigram language model. The proposed design uses a logic-on-memory approach to make use of high bandwidth nor flash memory to improve random read performance for Senone scoring and first pass decoding, both of which are memory intensive operations. The proposed HW/SW co-design achieves an overall speed up of 4.3X over a 2.4-GHz Intel Core 2 Duo processor running the CMU Sphinx speech recognition software, while consuming an estimated 1.72 W of power. The hardware accelerator provides improved speech recognition accuracy by supporting larger acoustic models and word dictionaries while maintaining real-time performance.