Abstract

Abstract Computational Auditory Scene Analysis (CASA) has been the focus in recent literature for speech separation from monaural mixtures. The performance of current CASA systems on voiced speech separation strictly depends on the robustness of the algorithm used for pitch frequency estimation. We propose a new system that estimates pitch (frequency) range of a target utterance and separates voiced portions of target speech. The algorithm, first, estimates the pitch range of target speech in each frame of data in the modulation frequency domain, and then, uses the estimated pitch range for segregating the target speech. The method of pitch range estimation is based on an onset and offset algorithm. Speech separation is performed by filtering the mixture signal with a mask extracted from the modulation spectrogram. A systematic evaluation shows that the proposed system extracts the majority of target speech signal with minimal interference and outperforms previous systems in both pitch extraction and voiced speech separation.

Highlights

  • Speech separation, as a solution to the cocktail party problem, is a well-known challenge with important applications

  • Pitch range estimation First, the proposed system is evaluated in the pitch range estimation process with utterances chosen from the Lee’s database [23] and a corpus of 100 mixtures of speech and interference [24], commonly used for Computational Auditory Scene Analysis (CASA) research, see, e.g., [13,25,26]

  • These intrusions are N0: 1 kHz pure tone; N1: white noise; N2: noise bursts; N3: cocktail party noise; N4: rock music; N5: siren, N6: trill telephone; N7: female speech; N8: male speech; and N9: female speech. These intrusions have a considerable variety; for example, N3 is noise-like, while N5 contains strong harmonic sounds. They form a realistic corpus for evaluating the capacity of a CASA system when it deals with various types of interference

Read more

Summary

Introduction

As a solution to the cocktail party problem, is a well-known challenge with important applications. Many methods have been proposed for monaural speech enhancement; for example, see [3,4,5,6,7]. These methods usually assume certain statistical properties for interference and tend to lack the capacity of dealing with a variety of interferences. While the monaural speech separation works awkwardly, the human auditory system performs proficiently. According to Bregman [5], ASA procedure can be separated into two theoretical stages: segmentation and grouping. Speech is transformed into a higher-dimensional space (such as a time-frequency two-dimensional representation) and similar timefrequency (T-F) units are segmented in order to compose different regions [6].

Acoustic Frequency
Pitch range estimation
Modulation Frequency
Onset front member Corresponding Matching offset offset front member
Percentage of error detection
Proposed Algorithm
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call