Abstract

This work assesses different approaches for speech and non-speech segmentation of audio data and proposes a new, high-level representation of audio signals based on phoneme recognition features suitable for speech/non-speech discrimination tasks. Unlike previous model-based approaches, where speech and non-speech classes were usually modeled by several models, we develop a representation where just one model per class is used in the segmentation process. For this purpose, four measures based on consonant-vowel pairs obtained from different phoneme speech recognizers are introduced and applied in two different segmentation-classification frameworks. The segmentation systems were evaluated on different broadcast news databases. The evaluation results indicate that the proposed phoneme recognition features are better than the standard mel-frequency cepstral coefficients and posterior probability-based features (entropy and dynamism). The proposed features proved to be more robust and less sensitive to different training and unforeseen conditions. Additional experiments with fusion models based on cepstral and the proposed phoneme recognition features produced the highest scores overall, which indicates that the most suitable method for speech/non-speech segmentation is a combination of low-level acoustic features and high-level recognition features.

Highlights

  • Speech/non-speech (SNS) segmentation is the task of partitioning audio streams into speech and non-speech segments

  • A good segmentation of continuous audio streams into speech and non-speech has many practical applications. It is usually applied as a preprocessing step in real-world systems for automatic speech recognition (ASR) [28], like broadcast news (BN) transcription [4, 7, 34], automatic audio indexing and summarization [17, 18], audio and speaker diarization [12, 20, 24, 30, 37], and all other applications where efficient speech detection helps to greatly reduce computational complexity and generate more understandable and accurate outputs

  • The main issue was to find the best combination of representations and classifications, which should be robust to different BN shows, different environments, different languages, and different non-speech types of signals, and should be integrated into systems for further speech processing of the BN data

Read more

Summary

Introduction

Speech/non-speech (SNS) segmentation is the task of partitioning audio streams into speech and non-speech segments. A good segmentation of continuous audio streams into speech and non-speech has many practical applications. It is usually applied as a preprocessing step in real-world systems for automatic speech recognition (ASR) [28], like broadcast news (BN) transcription [4, 7, 34], automatic audio indexing and summarization [17, 18], audio and speaker diarization [12, 20, 24, 30, 37], and all other applications where efficient speech detection helps to greatly reduce computational complexity and generate more understandable and accurate outputs. The research focused more on developing and evaluating characteristic features for classification, and systems were designed to work on already-segmented data

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call