Abstract

In this paper, we propose a novel speech enhancement paradigm which can effectively solve the problem of retrieving a desired speech signal in a multi-talker environment. The proposed speech enhancement paradigm involves a three-step procedure consisting of separation, ranking, and enhancement. First, a speech separation system – which could be a conventional spatial filter bank or more advanced separation systems – separates mixtures of speech signals captured by microphones into speech signals from candidate speakers. Next, novel ranking algorithms – proposed in this paper – are applied to determine the talker-of-interest amongst the separated speech signals. Finally, the speech signal of the talker-of-interest is estimated as a linear combination of the separated signals, whose weights are determined by the ranking algorithms. We propose ranking algorithms, which exploit turn-taking patterns between conversational partners in order to determine the talker-of-interest amongst competing speakers. Unlike some existing solutions, our ranking algorithms do not require access to additional sensors, e.g., EEG electrodes, cameras, etc., but only rely on microphone signals. Specifically, the proposed algorithms rank the separated speech signals based on the probability of speech overlaps and gaps with the user’s own voice. The speech signal with highest ranking is the talker with <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">minimum</i> probability of speech overlap and gap with the user’s own voice. The proposed ranking algorithms are shown highly effective at determining the talker-of-interest, since conversational partners, i.e., the user and the talker-of-interest, behaviorally avoid speech overlaps and gaps. We evaluate the proposed speech enhancement paradigm in two practical hearing aid related applications, where the objective is to enhance a speech signal of a conversational partner in a multi-talker environment. The results of the evaluation demonstrate that the proposed speech enhancement systems in both applications significantly outperform conventional speech enhancement systems.

Highlights

  • The cocktail party problem is often regarded as one of the most difficult situations any speech enhancement system may encounter

  • We see a significant improvement in terms of both ESTOI and PESQ when using minimum overlap-gap (MOG) and Bayesian MOG (BMOG) compared to normalized cross-correlation (NCC) and MMI

  • The softer gain function translates to higher ESTOI and PESQ scores, but a slightly lower segmental SNR score. Both speech enhancement systems using MOG and BMOG are extremely effective at retrieving a conversational partner in a multi-speaker situation as they perform close to the oracle beamformer

Read more

Summary

INTRODUCTION

The cocktail party problem is often regarded as one of the most difficult situations any speech enhancement system may encounter. Current DOA estimators such as SRPPHAT [6], maximum likelihood [7,8], and deep learningbased DOA estimators [9], are not able to robustly handle a conversational partner in a multi-speaker environment, without additional a priori information on the conversational partner’s location or voice activity These DOA estimators will indecisively switch between the candidate speakers as being the conversational partner leading to an enhanced signal of unacceptable intelligibility and quality. Estimated candidate speakers may be ranked using EEG-signals, retrieved from EEG-electrodes placed on the scalp of the user, to detect the user’s attention on conversational partner, EOG-signals to estimate eye-gaze from in-ear electrodes, and cameras to track eye-movements and estimate eye-gaze [14]–[18] While these signals have the potential to support the decision of determining the talkerof-interest, they require additional sensors which increase equipment cost, increase wearing inconvenience, and likely increase computational cost and power consumption. V, we evaluate the performance of the proposed speech enhancement paradigm and (B)MOG algorithms in two speech enhancement applications

SPEECH INTERACTION IN CONVERSATIONS
THE MINIMUM OVERLAP-GAP ALGORITHM
BAYESIAN MOG FOR PROBABILITY-BASED SPEAKER RANKING
PARAMETER ESTIMATION FROM CONVERSATIONAL SPEECH DATABASE
EVALUATION IN SPEECH ENHANCEMENT APPLICATIONS
IMPLEMENTATION OF THE MOG AND BMOG ALGORITHMS
STATE-OF-THE-ART METHODS FOR SPEAKER RANKING
APPLICATION 1
APPLICATION 2
CONCLUSION
EXPECTED MISCLASSIFICATION RATE FOR MOG
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call