Abstract

Many end-to-end approaches have been proposed to detect predefined keywords. For scenarios of multi-keywords, there are still two bottlenecks that need to be resolved: (1) the distribution of important data that contains keyword(s) is sparse, and (2) the timestamps of the detected keywords are inaccurate. In this paper, to alleviate the first issue and further improve the performance of the end-to-end ASR front-end, we propose the biased loss function for guiding the recognizer to pay more attention to the speech segments containing the predefined keywords. As for the second issue, we solve this problem by modifying the force alignment applied to the end-to-end ASR front-end. To get the frame-level alignment, we utilize a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) based acoustic model (AM) for auxiliary. The proposed system is evaluated in the OpenSAT20 held by the National Institute of Standards and Technology (NIST). The performance of our end-to-end KWS system is comparable to the conventional hybrid KWS system, sometimes even slightly better. With fusion results of the end-to-end and conventional KWS systems, we won the first prize in the KWS track. On the dev dataset (a part of SAFE-T corpus), the system outperforms the baseline by a large margin, i.e., our system with GMM-HMM aligner has a lower segmentation-aware word error rates (relatively 7.9–19.2% decrease) and higher overall Actual term-weighted values (relatively 3.6–11.0% increase), which demonstrates the effectiveness of the proposed method. For more precise alignments, we can use DNN-based AM as alignmentor at the cost of more computation.

Highlights

  • Accurate spoken term detection (STD), named keyword searching (KWS), is a vital downstream application of automatic speech recognition (ASR)

  • We evaluate the impact on text aligned by connectionist temporal classification (CTC) and Gaussian mixture model (GMM)-Hidden markov model (HMM) based aligners

  • In order to evaluate the impact of segmentation error on ASR, we first calculate segmentation-aware word error rate (WER) of different parameters and the results are shown in Tables 6 and 7

Read more

Summary

Introduction

Accurate spoken term detection (STD), named keyword searching (KWS), is a vital downstream application of automatic speech recognition (ASR). The typical KWS pipeline is based on a lattice obtained from a Large Vocabulary Continuous Speech Recognition (LVCSR) system, wherein a neural acoustic model (AM) and a word-based language model (LM) are both applied. End-to-end ASR systems only require the segmented speech and the corresponding text. There are two popular approaches for end-to-end ASR systems, i.e., connectionist temporal classification (CTC) [2,3,4] and attention mechanism [5,6,7,8,9]. The OpenSAT20 Evaluation makes use of simulated public safety communications spoken in English and offers three evaluation tasks: SAD, KWS, and ASR. We focus on the ASR and KWS tasks, our team completed all the tasks

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call