Distant Microphone Research Articles

There has been little work in the literature on the speaker diarization of meetings with multiple distance microphones since the publications in 2012 related to the last National Institute of Standards (NIST) Rich Transcription Evaluation Campaign in 2009 (RT09). Lately, the Second DIHARD Challenge Evaluation has also covered diarization at dinner party meetings that include multiple distant microphones. Dinner party meetings are somehow harder than office meetings because their participants can move freely around the room. In this paper, we studied some of the algorithms on speaker diarization for meetings with multiple distant microphones for the NIST Rich Transcription Evaluation Campaign in 2007 (RT07) and RT09 and provide definite and clear improvements. On the one hand, little or no care has been taken to the problem of penalizing or favoring transitions between speakers other than proposing a minimum duration of a speaker turn or calculating the speakers’ probabilities using Variational Bayes (VB). We have studied this issue and determined that a transition penalty term is needed that should be independent both of the number of active speakers and the minimum duration of speaker turns. On the other hand, the determination of a method to automatically select the right number of parameters is crucial in developing good models for speakers. Previous studies have proposed the dynamic selection of the number of parameters based on the duration of the speaker’s speech with a mixed performance when tested at one distant microphone meetings or multiple distant microphones meetings. In this paper, we propose a new method that takes into account both the duration of speaker’s speech to determine a minimum number of parameters, and the question of overfitting issue to determine a maximum number of them, also taking into account the computation time in order to reduce it.We have carried out experiments to support our findings, and we have been able to improve our baseline speaker error rate with multiple distant-microphone meetings. Both methods achieve improved performance over the baseline. The first method obtains a 21.6% decrease in relative speaker error for the development set and a 4.6% decrease in relative speaker error for the test set (RT09). The second method obtains a 46.47% decrease in relative speaker error for the development set and a 17.54% decrease in relative speaker error for the test set. Both methods complement each other, and when they are applied in combination, we obtain a 47.2% decrease in relative speaker error for the development set and a 22.02% decrease in relative speaker error for the test set.The performance obtained with our proposal is outstanding in some subsets of the development test such as the NIST RT07 and among the best for RT09 using our proposed simple modifications. Furthermore, with our algorithm we obtain gains in computation time without jeopardizing performance. Results with a different publicly available database, augmented multiparty interaction (AMI) obtains a 28.44% decrease in relative speaker error confirming the validity of our methods. Preliminary experiments with a single stream (mfcc) endorse the validity of our findings. Comparisons with an x-vector system deliver superior performance of our system on unseen test data.

The selection of the best features to be used in expert systems is a key issue in obtaining a satisfactory performance. Unsupervised speaker segmentation and clustering is the task of the automatic identification of the number of participants in a meeting and the determination of their speaking turns (also called “diarization”). This is part of an intelligent system that replaces human intervention in several tasks related to automatic language and speech processing. The segmentation and clustering of speakers is crucial if we want to transcribe any audio recording automatically when several people take their turn. It is a task necessary to archive automatically interventions of several people in meetings, broadcast radio, lectures, parliamentary sessions etc. since a simple transcription of what is said without assigning it to a specific speaker makes the information unusable. The automation of this task would save enormous amounts of resources currently spent on human transcribers. When used online it could also be useful to point a video camera automatically to the person talking when a videoconference with multiple speakers is taking place thus replacing a human operator. Furthermore it could also help to scan large amounts of audio automatically in search of crimes or audio interventions of a particular person. In the case of recordings with several distant microphones (MDM), spatial features may and should be used. The most widely used spatial features in diarization are the Time Delay of Arrival (TDOA) features. These delays are extracted from pairs of microphones of unknown location and quality, which makes the selection of the best pairs highly advisable. This paper analyses this issue and proposes and evaluates several methods that significantly improve the performance both in speaker error rate (SER) and in computational time. The methods propose a selection ofTDOA features based on the quality of the cross-correlation of signals coming from different pairs of microphones. We prove that the use of the wrong pairs can be highly detrimental to the overall performance. The methods proposed, based on cross correlation, are compared and combined with other two selection methods, based on the dynamic range of the delay features and the selection of every pair of microphones available followed by a reduction in dimensionality. Although all algorithms achieve some improvements, it is proved that selection methods based on cross correlation have the fewest errors. The improvements on the baseline system for the two best proposed systems are 25.14% and 33.70% for the development set, and 55.06% and 46.09% for the test set. Furthermore the best method for the test set also reduces the computational cost by 20%.

Distant Microphone Research Articles

Related Topics

Articles published on Distant Microphone

The Impact of Speaker Diarization on DNN-based Autism Severity Estimation.

Application of Fusion of Various Spontaneous Speech Analytics Methods for Improving Far-Field Neural-Based Diarization

Overlapped Speech Detection and speaker counting using distant microphone arrays

Adjustable Structure for Feedback Active Headrest System Using the Virtual Microphone Method

Analysis of transition cost and model parameters in speaker diarization for meetings

Split Bregman Approach to Linear Prediction Based Dereverberation With Enforced Speech Sparsity

Comparison of the Acoustic Effects of Face Masks on Speech

Early reflection detection using autocorrelation to improve robustness of speaker verification in reverberant conditions

Latent class model with application to speaker diarization

Building and evaluation of a real room impulse response dataset

Enhancing Target Speech Based on Nonlinear Soft Masking Using a Single Acoustic Vector Sensor

The speakers in the room corpus

Spatial acoustic radiation of respiratory sounds for sleep evaluation.

Evaluation of scene analysis using real and simulated acoustic mixtures: Lessons learnt from the CHiME speech recognition challenges

Bayesian feature enhancement using independent vector analysis and reverberation parameter re-estimation for noisy reverberant speech recognition

Multi-Channel Speech Enhancement and Amplitude Modulation Analysis for Noise Robust Automatic Speech Recognition

Spatial features selection for unsupervised speaker segmentation and clustering

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition

Speaker Diarization and Linking of Meeting Data

A Generalized Nonnegative Tensor Factorization Approach for Distant Speech Recognition With Distributed Microphones

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Distant Microphone Research Articles

Related Topics

Articles published on Distant Microphone

The Impact of Speaker Diarization on DNN-based Autism Severity Estimation.

Application of Fusion of Various Spontaneous Speech Analytics Methods for Improving Far-Field Neural-Based Diarization

Overlapped Speech Detection and speaker counting using distant microphone arrays

Adjustable Structure for Feedback Active Headrest System Using the Virtual Microphone Method

Analysis of transition cost and model parameters in speaker diarization for meetings

Split Bregman Approach to Linear Prediction Based Dereverberation With Enforced Speech Sparsity

Comparison of the Acoustic Effects of Face Masks on Speech

Early reflection detection using autocorrelation to improve robustness of speaker verification in reverberant conditions

Latent class model with application to speaker diarization

Building and evaluation of a real room impulse response dataset

Enhancing Target Speech Based on Nonlinear Soft Masking Using a Single Acoustic Vector Sensor

The speakers in the room corpus

Spatial acoustic radiation of respiratory sounds for sleep evaluation.

Evaluation of scene analysis using real and simulated acoustic mixtures: Lessons learnt from the CHiME speech recognition challenges

Bayesian feature enhancement using independent vector analysis and reverberation parameter re-estimation for noisy reverberant speech recognition

Multi-Channel Speech Enhancement and Amplitude Modulation Analysis for Noise Robust Automatic Speech Recognition

Spatial features selection for unsupervised speaker segmentation and clustering

Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition

Speaker Diarization and Linking of Meeting Data

A Generalized Nonnegative Tensor Factorization Approach for Distant Speech Recognition With Distributed Microphones