Abstract

Spoken utterance retrieval was largely studied in the last decades, with the purpose of indexing large audio databases or of detecting keywords in continuous speech streams. While the indexing of closed corpora can be performed via a batch process, on-line spotting systems have to synchronously detect the targeted spoken utterances. We propose a two-level architecture for on-the-fly term spotting. The first level performs a fast detection of the speech segments that probably contain the targeted utterance. The second level refines the detection on the selected segments, by using a speech recognizer based on a query-driven decoding algorithm. Experiments are conducted on both broadcast and spontaneous speech corpora. We investigate the impact of the spontaneity level on system performance. Results show that our method remains effective even if the recognition rates are significantly degraded by disfluencies.

Highlights

  • Term detection has been extensively studied in the last decades in the two different contexts of spoken term detection (STD): large speech databases and keyword spotting in continuous speech streams

  • Since the STD task relies on the indexing of the whole speech database, word spotting systems perform a sequential parsing of the speech stream with the purpose of detecting the targeted word sequence

  • We presented a two-level architecture for on-the-fly term spotting, where the full process is query driven

Read more

Summary

Introduction

Term detection has been extensively studied in the last decades in the two different contexts of spoken term detection (STD): large speech databases and keyword spotting in continuous speech streams. We focus on on-the-fly term spotting, where the detection must be synchronously notified, at the moment where it occurs in the speech stream. This task refers to a usage scenario where early detection is critical, such as supervision and automation of operator-assisted calls [1, 2]. For all these detection tasks, performances reported in the literature are quite good on clean conditions, especially on broadcast news data that were largely used for speech processing system benchmarking [3, 4]. In more difficult conditions, such as noisy or spontaneous speech, performances are dramatically degraded by recognition errors [5,6,7]

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.