Reduced Memory Viterbi Decoding for Hardware-accelerated Speech Recognition

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Large Vocabulary Continuous Speech Recognition systems require Viterbi searching through a large state space to find the most probable sequence of phonemes that led to a given sound sample. This needs storing and updating of a large Active State List (ASL) in the on-chip memory (OCM) at regular intervals (called frames), which poses a major performance bottleneck for speech decoding. Most works use hash tables for OCM storage while beam-width pruning to restrict the ASL size. To achieve a decent accuracy and performance, a large OCM, numerous acoustic probability computations, and DRAM accesses are incurred. We propose to use a binary search tree for ASL storage and a max heap data structure to track the worst cost state and efficiently replace it when a better state is found. With this approach, the ASL size can be reduced from over 32K to 512 with minimal impact on recognition accuracy for a 7,000-word vocabulary model. This, combined with a caching technique for acoustic scores, reduced the DRAM data accessed by 31 \( \times \) and the acoustic probability computations by 26 \( \times \) . The approach has also been implemented in hardware on a Xilinx Zynq FPGA at 200 MHz using the Vivado SDS compiler. We study the tradeoffs among the amount of OCM used, word error rate, and decoding speed to show the effectiveness of the approach. The resulting implementation is capable of running faster than real time with 91% lesser block-RAMs.

Similar Papers
  • Research Article
  • Cite Count Icon 74
  • 10.1109/tasl.2011.2155060
Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR
  • Nov 1, 2011
  • IEEE Transactions on Audio, Speech, and Language Processing
  • Tara N Sainath + 4 more

The use of exemplar-based methods, such as support vector machines (SVMs), k-nearest neighbors (kNNs) and sparse representations (SRs), in speech recognition has thus far been limited. Exemplar-based techniques utilize information about individual training examples and are computationally expensive, making it particularly difficult to investigate these methods on large-vocabulary continuous speech recognition (LVCSR) tasks. While research in LVCSR provides a good testbed to tackle real-world speech recognition problems, research in this area suffers from two main drawbacks. First, the overall complexity of an LVCSR system makes error analysis quite difficult. Second, exploring new research ideas on LVCSR tasks involves training and testing state-of-the-art LVCSR systems, which can render a large turnaround time. This makes a small vocabulary task such as TIMIT more appealing. TIMIT provides a phonetically rich and hand-labeled corpus that allows easy insight into new algorithms. However, research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we combine the advantages of using both small and large vocabulary tasks by taking well-established techniques used in LVCSR systems and applying them on TIMIT to establish a new baseline. We then utilize these existing LVCSR techniques in creating a novel set of exemplar-based sparse representation (SR) features. Using these existing LVCSR techniques, we achieve a phonetic error rate (PER) of 19.4% on the TIMIT task. The additional use of SR features reduce the PER to 18.6%. We then explore applying the SR features to a large vocabulary Broadcast News task, where we achieve a 0.3% absolute reduction in word error rate (WER).

  • Book Chapter
  • Cite Count Icon 12
  • 10.1007/11939993_45
Improved Large Vocabulary Continuous Chinese Speech Recognition by Character-Based Consensus Networks
  • Jan 1, 2006
  • Yi-Sheng Fu + 2 more

Word-based consensus networks have been verified to be very useful in minimizing word error rates (WER) for large vocabulary continuous speech recognition for western languages. By considering the special structure of Chinese language, this paper points out that character-based rather then word-based consensus networks should work better for Chinese language. This was verified by extensive experimental results also reported in the paper.

  • Research Article
  • Cite Count Icon 9
  • 10.1007/s10772-019-09637-2
A usage of the syllable unit based on morphological statistics in Korean large vocabulary continuous speech recognition system
  • Sep 25, 2019
  • International Journal of Speech Technology
  • Hyok-Chol Ri

In large vocabulary continuous speech recognition (LVCSR), it is important in improving the system’s performance to determine reasonably the recognition unit. In Korean continuous speech recognition, a morph rather than a word is used basically as the recognition unit due to Korean’s agglutinative property and a good performance is provided by combining high-frequency morph sequences, which leading to an increase of vocabulary size and high out-of-vocabulary (OOV) rate. Sub-lexical units such as a syllable and a graphone are widely used for inflectional languages, while they have not been introduced successfully for Korean speech recognition, due to a weakness of their linguistic information. In this paper, we investigate a usage of a syllable unit to resolve a mismatch problem between the recognition unit and vocabulary size that have occurred frequently in Korean large vocabulary speech recognition. We apply the local segmentation into syllables based on morphological statistics and perform experiments using the language model (LM) constructed from mixed unit types of morpheme, combined morpheme and syllable. By the proposed model, an absolute reduction of around 0.4% in word error rate (WER) is obtained compared to a traditional LM consisting of morphemes and combined morphemes.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.specom.2015.07.007
Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition
  • Jul 29, 2015
  • Speech Communication
  • Md Jahangir Alam + 2 more

Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

  • Research Article
  • Cite Count Icon 93
  • 10.1016/s0167-6393(02)00031-6
Korean large vocabulary continuous speech recognition with morpheme-based recognition units
  • Mar 4, 2002
  • Speech Communication
  • Oh-Wook Kwon + 1 more

Korean large vocabulary continuous speech recognition with morpheme-based recognition units

  • Conference Article
  • Cite Count Icon 1
  • 10.2991/iccsee.2013.244
Discriminative Language Model With Part-of-speech for Mandarin Large Vocabulary Continuous Speech Recognition System
  • Jan 1, 2013
  • Yujing Si + 4 more

Statistical language model, trained by a large number of text corpus, is an integral component in many speech and natural language model processing systems, such as speech recognition and machine translation. It is a probabilistic model which describes the distribution pattern of natural language. Over the last few decades, N-gram language model (LM) is the most popular technique since it is simple and effective. However, the training of the N-gram language model is based on the maximum likelihood rule resulting in suboptimal output in speech recognition systems. In this paper, a discriminative training based language model (DLM) which directly focused on minimizing speech recognition word error rate (WER) was employed to improve the performance of speech recognition system. In particular, the part-of-speech (POS) feature was used to train DLM as well as the n-gram features. Experimental results showed that DLM with n-gram features gave 1% absolute reduction in word error rate (WER). Combining n-gram features with POS feature, DLM could obtain another 0.4% absolute reduction in WER.

  • Conference Article
  • Cite Count Icon 15
  • 10.1109/asru.2003.1318472
Warping and scaling of the minimum variance distortionless response
  • Nov 30, 2003
  • M Wolfel + 2 more

Spectral estimation based on the minimum variance distortionless response (MVDR) is well-known in the signal processing literature and has been shown to be superior to linear prediction for robust speech recognition. In this work we propose two techniques to improve the resolution and the robustness of the MVDR spectral estimate: The first is a time-domain technique to estimate an all-pole model based on the warped short time frequency axis such as the Mel-frequency. The second is a method for scaling the height of the spectral envelope in order to extract robust features for large vocabulary continuous speech recognition systems which must operate in noisy conditions. Moreover, we show that these two techniques can be combined to good effect. In a series of speech recognition experiments on the Switchboard corpus, the combination of our proposed approaches achieved a word error rate (WER) of 35.9%, which is clearly superior to the 37.0% WER obtained by the common MVDR and the 37.2% WER obtained by the widely used Fourier transform.

  • Conference Article
  • 10.1109/cisp.2013.6743974
Improved lattice rescoring by using speech attributes in Large Vocabulary Continuous Speech Recognition systems
  • Dec 1, 2013
  • Xinglong Gao + 2 more

Acoustic modeling of Large Vocabulary Continuous Speech Recognition (LVCSR) system which is normally based on context-dependent phone is heavily limited by representative capability between transcriptions and corresponding variation of raw speech utterance. To describe this relationship more accurate, this paper presents an alternative strategy by which speech attributes are used to capture acoustic characteristics to improve performances of LVCSR. Validations on a series of relevant experiments, and it is proven that the speech attributes can be used as complementary knowledge resources that can bring more abundant information than basic phone based system. Hence, speech attribute information is used to be integrated into phone based LVCSR system during lattice rescoring. For both reading and Conversional Telephone Speech (CTS) style LVCSR tasks, experimental results showed that the combined system reduced Word Error Rate (WER) by about 3-5% relatively.

  • Research Article
  • Cite Count Icon 3
  • 10.15388/informatica.2004.048
Specifics of Hidden Markov Model Modifications for Large Vocabulary Continuous Speech Recognition
  • Jan 1, 2004
  • Informatica
  • Darius Šilingas + 1 more

Specifics of hidden Markov model-based speech recognition are investigated. Influence of modeling simple and context-dependent phones, using simple Gaussian, two and three-component Gaussian mixture probability density functions for modeling feature distribution, and incorporating language model are discussed. Word recognition rates and model complexity criteria are used for evaluating suitability of these modifications for practical applications. Development of large vocabulary continuous speech recognition system using HTK toolkit and WSJCAM0 English speech corpus is described. Results of experimental investigations are presented.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/o-cocosda50338.2020.9295036
Enhancing Large Vocabulary Continuous Speech Recognition System for Urdu-English Conversational Code-Switched Speech
  • Nov 5, 2020
  • Muhammad Umar Farooq + 4 more

This paper presents first step towards Large Vocabulary Continuous Speech Recognition (LVCSR) system for Urdu-English code-switched conversational speech. Urdu is the national language and lingua franca of Pakistan, with 100 million speakers worldwide. English, on the other hand, is official language of Pakistan and commonly mixed with Urdu in daily communication. Urdu, being under-resourced language, have no substantial Urdu-English code-switched corpus in hand to develop speech recognition system. In this research, readily available spontaneous Urdu speech corpus (25 hours) is revised to use it for enhancement of read speech Urdu LVCSR to recognize code-switched speech. This data set is split into 20 hours of train and 5 hours of test set. 10 hours of Urdu BroadCast (BC) data are collected and annotated in a semi-supervised way to enhance the system further. For acoustic modeling, state-of-the-art DNN-HMM modeling technique is used without any prior GMM-HMM training and alignments. Various techniques to improve language model using monolingual data are investigated. The overall percent Word Error Rate (WER) is reduced from 40.71% to 26.95% on test set.

  • Research Article
  • Cite Count Icon 42
  • 10.1016/j.specom.2013.01.008
Using different acoustic, lexical and language modeling units for ASR of an under-resourced language – Amharic
  • Feb 14, 2013
  • Speech Communication
  • Martha Yifiru Tachbelie + 2 more

Using different acoustic, lexical and language modeling units for ASR of an under-resourced language – Amharic

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icot.2013.6521203
Customizable cloud-healthcare dialogue system based on LVCSR with prosodic-contextual post-processing
  • Mar 1, 2013
  • Bo-Wei Chen + 5 more

This work presents a customized cloud-healthcare dialogue system design based on large vocabulary continuous speech recognition (LVCSR) with prosodic-contextual post-processing. The customized cloud-healthcare dialogue system includes two parts. The first part is the cloud dialogue management and strategy, which manage and provide the services on demand. The second part is a web-based reminder and a customizable interface, which offer settings of reminding events and the customizable dialogue system. Moreover, for higher accuracy of speech recognition, this work proposes prosodic-contextual post-processing mechanism, which can find the best sentence from potential recognition results by using syllable segmentation, pitch analysis, and contextual analysis. In the experiment, five healthcare scenarios for the elderly are designed for evaluation. The analysis indicates that the average mean opinion score (MOS) can reach as high as 4.23. Additionally, the word error rate (WER) of LVCSR with the proposed prosodic-contextual post-processing is improved by 9.21%. Such results show that the proposed system is suitable for the elderly in daily living and demonstrates feasibility of our idea.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/asru.2017.8268923
Future vector enhanced LSTM language model for LVCSR
  • Dec 1, 2017
  • 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • Qi Liu + 2 more

Language models (LM) play an important role in large vocabulary continuous speech recognition (LVCSR). However, traditional language models only predict next single word with given history, while the consecutive predictions on a sequence of words are usually demanded and useful in LVCSR. The mismatch between the single word prediction modeling in trained and the long term sequence prediction in read demands may lead to the performance degradation. In this paper, a novel enhanced long short-term memory (LSTM) LM using the future vector is proposed. In addition to the given history, the rest of the sequence will be also embedded by future vectors. This future vector can be incorporated with the LSTM LM, so it has the ability to model much longer term sequence level information. Experiments show that, the proposed new LSTM LM gets a better result on BLEU scores for long term sequence prediction. For the speech recognition rescoring, although the proposed LSTM LM obtains very slight gains, the new model seems obtain the great complementary with the conventional LSTM LM. Rescoring using both the new and conventional LSTM LMs can achieve a very large improvement on the word error rate.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/icsda.2011.6085990
Morpheme concatenation approach in language modeling for large-vocabulary Uyghur speech recognition
  • Oct 1, 2011
  • Mijit Ablimit + 2 more

For large-vocabulary continuous speech recognition (LVCSR) of highly-inflected languages, selection of an appropriate recognition unit is the first important step. The morpheme-based approach is often adopted because of its high coverage and linguistic properties. But morpheme units are short, often consisting of one or two phonemes, thus they are more likely to be confused in ASR than word units. Generally, word units provide better linguistic constraint, but increases the vocabulary size explosively, causing OOV (out-of-vocabulary) and data sparseness problems in language modeling. In this research, we investigate approaches of selecting word entries by concatenating morpheme sequences, which would reduce word error rate (WER). Specifically, we compare the ASR results of the word-based model and those of the morpheme-based model, and extract typical patterns which would reduce the WER. This method has been successfully applied to an Uyghur LVCSR system, resulting in a significant reduction of WER without a drastic increase of the vocabulary size.

  • Research Article
  • Cite Count Icon 2
  • 10.1186/s13636-017-0121-5
Classification-based spoken text selection for LVCSR language modeling
  • Oct 17, 2017
  • EURASIP Journal on Audio, Speech, and Music Processing
  • Vataya Chunwijitra + 1 more

Large vocabulary continuous speech recognition (LVCSR) has naturally been demanded for transcribing daily conversations, while developing spoken text data to train LVCSR is costly and time-consuming. In this paper, we propose a classification-based method to automatically select social media data for constructing a spoken-style language model in LVCSR. Three classification techniques, SVM, CRF, and LSTM, trained by words and parts-of-speech are comparatively experimented to identify the degree of spoken style in each social media sentence. Spoken-style utterances are chosen by incremental greedy selection based on the score of the SVM or the CRF classifier or the output classified as “spoken” by the LSTM classifier. With the proposed method, just 51.8, 91.6, and 79.9% of the utterances in a Twitter text collection are marked as spoken utterances by the SVM, CRF, and LSTM classifiers, respectively. A baseline language model is then improved by interpolating with the one trained by these selected utterances. The proposed model is evaluated on two Thai LVCSR tasks: social media conversations and a speech-to-speech translation application. Experimental results show that all the three classification-based data selection methods clearly help reducing the overall spoken test set perplexities. Regarding the LVCSR word error rate (WER), they achieve 3.38, 3.44, and 3.39% WER reduction, respectively, over the baseline language model, and 1.07, 0.23, and 0.38% WER reduction, respectively, over the conventional perplexity-based text selection approach.

Save Icon
Up Arrow
Open/Close