Abstract

Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left- and right-channel audio signals in a few different ways and then by extracting the embedded features (also called d-vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter-sharing Gaussian mixture model was obtained to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multiperson discussion sessions showed that the proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono-audio signals in more complicated conditions.

Highlights

  • With artificial intelligence (AI) development, many techniques are applied in our daily life, such as automatic speech recognition (ASR) [1] and speaker recognition

  • Utterance clustering is a popular topic in speech processing that can be used for speaker diarization [2] and ASR

  • Here, a new method of audio signal processing was proposed for utterance clustering [9]

Read more

Summary

Introduction

With artificial intelligence (AI) development, many techniques are applied in our daily life, such as automatic speech recognition (ASR) [1] and speaker recognition. Most studies are based on laboratory data sets, and those cannot process the real-world problem very well. Both formal and informal meetings have more segments with overlapping speaking than segments with only one speaker [3]. A key aspect of performance improvement in utterance clustering is audio feature embeddings. Is study initially tried several different published methods for our own experimental research, but their results were not as good as we had hoped To address this problem, here, a new method of audio signal processing was proposed for utterance clustering [9]. E challenge this study aims to address is how to handle low-quality audio data recorded in real-world discussion settings.

Related Work
Methods
Disclosure

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.