Voice activity classification using beamformer-output-ratio

Thuy N Tran ,William G Cowley ,André Pollok

doi:10.1109/ausctw.2012.6164913

Abstract

In a conversation between multiple speakers, each person participates in the speech at different times. Therefore the active speakers in each speech segment are unknown. However, identifying the voice activity (VA) of the speakers of interest is required for adaptive beamforming techniques such as minimum variance distortionless response beamforming and the adaptive blocking beamforming (AB). Considering two speakers, this paper addresses a voice activity classification (VAC) problem that focuses on identifying the active speaker(s) in each speech segment. The proposed method is based on a new concept, the beamformer-output-ratio (BOR). This value is calculated from the outputs of two different beamformers steering at two speakers. The first part of the paper introduces the definition of BOR, the VAC method using BOR and simulation results. The simulations are based on real recordings and show a high classification accuracy. In the second part of the paper, the theoretical results of the BOR of the delay-and-sum (DS) beamforming are presented, including BOR formula derived in different environments and its behaviour in relation to parameter errors.

Full Text