End-to-End Speech Endpoint Detection Utilizing Acoustic and Language Modeling Knowledge for Online Low-Latency Speech Recognition

Inyoung Hwang,Joon-Hyuk Chang

doi:10.1109/access.2020.3020696

Abstract

Speech endpoint detection (EPD) benefits from the decoder state features (DSFs) of online automatic speech recognition (ASR) system. However, the DSFs are obtained via the ASR decoding process, which can become prohibitively expensive especially in limited-resource scenarios such as the embedded devices. To address this problem, this paper proposes a language model (LM)-based end-of-utterance (EOU) predictor, which is trained to determine the framewise probabilities of the EOU token conditioned on the previous word history obtained from the 1-best decoding hypothesis of the ASR system in an end-to-end manner without an actual decoding process in the test step. Further, a novel end-to-end EPD strategy is presented to incorporate a phonetic embedding (PE)-based acoustic modeling knowledge and the proposed EOU predictor-based language modeling knowledge into an acoustic feature embedding (AFE)-based EPD approach within the recurrent neural networks (RNN)-based EPD framework. The proposed EPD algorithm is built upon the ensemble RNNs, which are independently trained for the three parts, which are the proposed LM-based EOU predictor, AFE-based EPD, and PE-based acoustic model (AM) in accordance with each target. The ensemble RNNs are concatenated at the level of the last hidden layers and then attached into the fully-connected deep neural networks (DNN)-based EPD classifier, which is trained in accordance with the ultimate EPD target. Thereafter, they are jointly retrained at the second step of the DNN training to yield the lower endpoint error. The proposed EPD framework was evaluated in terms of the endpoint accuracy and word error rate for the CHiME-3 and large-scale ASR tasks. The experimental results turn out that the proposed EPD algorithm efficiently outperforms the conventional EPD approaches.

Highlights

Spoken dialogue systems make it possible to control contemporary devices, such as smartphones, navigation systems, and AI speakers through natural voice interaction
Since the decoder state features (DSFs)-based approach in [39] and the proposed endpoint detection (EPD) approach are commonly based on the combination of the trained embeddings, such as [acoustic feature embedding (AFE), word embedding (WE), DSFs] and [AFE, phonetic embedding (PE), decoder embedding (DE)], respectively, the performances of the sub-EPD systems based on single embedding alone and their combinations were tested to verify the superiority of the DE for the proposed EPD algorithm
In order to evaluate the performance of the EPD systems in terms of early endpoint error itself we reported the word error rate (WER) as well as the early endpoint time, which describes how the final EPD decision is prematurely triggered compared with the true EPD label

Summary

INTRODUCTION

Spoken dialogue systems make it possible to control contemporary devices, such as smartphones, navigation systems, and AI speakers through natural voice interaction. It was observed that the bottleneck features of the DNN-based acoustic model (AM), called phonetic embedding (PE), which is trained to predict senones (tied triphone states) [18], lead to improved SAD and EPD performances [19]–[21] Another way is to directly find the EOU from the sequential input features by employing a long short-term memory (LSTM) [22], whereas the traditional EPD schemes consist of the separate SAD and online decoder. A grid-LSTM DNN (GLDNN) [31] was introduced by employing the grid-LSTM in the first layer instead of the convolutional layer of the CLDNN to improve the EPD performance These feature mapping-based EPD approaches often prematurely abandon the speech region due to a pause hesitation or cause a higher detection latency since they cannot adequately consider the context of input feature sequences such as phone or word alignments.

REVIEW OF PREVIOUS WORKS

PROPOSED END-TO-END ENDPOINT DETECTION BASED ON ENSEMBLE RNNs

EXPERIMENTS AND RESULTS

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

End-to-End Speech Endpoint Detection Utilizing Acoustic and Language Modeling Knowledge for Online Low-Latency Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Exploring recurrent neural network based acoustic and linguistic modeling for children's speech recognition
Sreeram Ganji ... Rohit Sinha
-
Sreeram Ganji, et. al.Sreeram Ganji ... Rohit Sinha
01 Nov 2017
01 Nov 2017

Personalizing Speech Start Point and End Point Detection in ASR Systems from Speaker Embeddings
Aditya Jayasimha ... Periyasamy Paramasivam
-
Aditya Jayasimha, et. al.Aditya Jayasimha ... Periyasamy Paramasivam
19 Jan 2021
19 Jan 2021

Non-Native Pronunciation Variation Modeling for Automatic Speech Recognition
Hong Kook ... Yoo Rhee
-
Hong Kook, et. al.Hong Kook ... Yoo Rhee
16 Aug 2010
16 Aug 2010

Using Auxiliary Sources of Knowledge for Automatic Speech Recognition

-

01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

End-to-End Speech Endpoint Detection Utilizing Acoustic and Language Modeling Knowledge for Online Low-Latency Speech Recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access