Abstract

This paper aims to design an online, low-latency, and high-performance speech recognition system using a bidirectional long short-term memory (BLSTM) acoustic model. To achieve this, we adopt a server-client model and a context-sensitive-chunk-based approach. The speech recognition server manages a main thread and a decoder thread for each client and one worker thread. The main thread communicates with the connected client, extracts speech features, and buffers the features. The decoder thread performs speech recognition, including the proposed multichannel parallel acoustic score computation of a BLSTM acoustic model, the proposed deep neural network-based voice activity detector, and Viterbi decoding. The proposed acoustic score computation method estimates the acoustic scores of a context-sensitive-chunk BLSTM acoustic model for the batched speech features from concurrent clients, using the worker thread. The proposed deep neural network-based voice activity detector detects short pauses in the utterance to reduce response latency, while the user utters long sentences. From the experiments of Korean speech recognition, the number of concurrent clients is increased from 22 to 44 using the proposed acoustic score computation. When combined with the frame skipping method, the number is further increased up to 59 clients with a small accuracy degradation. Moreover, the average user-perceived latency is reduced from 11.71 s to 3.09–5.41 s by using the proposed deep neural network-based voice activity detector.

Highlights

  • The main thread communicates with the connected client, manages the decoder thread, extracts the speech feature vectors from the received audio segments, and buffers the feature vectors into a ring buffer

  • This paper presents an online multichannel automatic speech recognition (ASR) system employing a bidirectional long short-term memory (BLSTM) acoustic model (AM), which is hardly deployed in industries even though it is one of the best performing AMs

  • We presented a server-client-based online ASR system employing a BLSTM AM, which is a state-of-the-art AM

Read more

Summary

Introduction

Deep learning with GPU and considerable speech data has greatly accelerated the advance of speech recognition [1,2,3,4,5]. In line with this advancement, automatic speech recognition (ASR). The research on an ASR deployment can be classified into (a) an on-device system and (b). There exists a trade-off between ASR accuracy and real-time performance. The trade-off between accuracy and real-time performance can

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.