Abstract

The huge amount of information stored in audio and video repositories makes search on speech (SoS) a priority area nowadays. Within SoS, Query-by-Example Spoken Term Detection (QbE STD) aims to retrieve data from a speech repository given a spoken query. Research on this area is continuously fostered with the organization of QbE STD evaluations. This paper presents a multi-domain internationally open evaluation for QbE STD in Spanish. The evaluation aims at retrieving the speech files that contain the queries, providing their start and end times, and a score that reflects the confidence given to the detection. Three different Spanish speech databases that encompass different domains have been employed in the evaluation: MAVIR database, which comprises a set of talks from workshops; RTVE database, which includes broadcast television (TV) shows; and COREMAH database, which contains 2-people spontaneous speech conversations about different topics. The evaluation has been designed carefully so that several analyses of the main results can be carried out. We present the evaluation itself, the three databases, the evaluation metrics, the systems submitted to the evaluation, the results, and the detailed post-evaluation analyses based on some query properties (within-vocabulary/out-of-vocabulary queries, single-word/multi-word queries, and native/foreign queries). Fusion results of the primary systems submitted to the evaluation are also presented. Three different teams took part in the evaluation, and ten different systems were submitted. The results suggest that the QbE STD task is still in progress, and the performance of these systems is highly sensitive to changes in the data domain. Nevertheless, QbE STD strategies are able to outperform text-based STD in unseen data domains.

Highlights

  • The huge amount of information stored in audio and audiovisual repositories makes it necessary to develop efficient methods for search on speech (SoS)

  • In case participants employ large vocabulary continuous speech recognition (LVCSR) for processing the audio, these OOV words must be previously removed from the system dictionary, and other methods have to be used for searching OOV queries

  • 3.2 Evaluation metrics In QbE spoken term detection (STD), a hypothesized occurrence is called a detection; if the detection corresponds to an actual occurrence, it is called a hit; otherwise it is called a false alarm (FA)

Read more

Summary

Introduction

The huge amount of information stored in audio and audiovisual repositories makes it necessary to develop efficient methods for search on speech (SoS). QbE STD has been traditionally addressed using three different approaches: methods based on the word/subword transcription of the query, methods based on template matching of features, and hybrid approaches. An interesting alternative is [54] which proposes the use of hashing of the phone posteriors to speed-up search and to enable searching on massively large datasets These template matching-based methods were found to outperform subword transcription-based techniques in QbE STD [67] and can be effectively employed to build language-independent STD systems, since prior knowledge of the language involved in the speech data is not necessary. [77] employs a syllable-based speech recognizer and dynamic programming at the triphone state level to output detections and DNN posteriorgram-based rescoring

Methods
ALBAYZIN 2018 QbE STD evaluation
Comparison to other QbE STD international evaluations
Evaluation
Development data
Performance analysis of QbE STD systems for INV and OOV queries
Comparison to the ALBAYZIN 2016 QbE STD evaluation
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call