A Distributed Approach to Speaker Count Problem in an Open-Set Scenario by Clustering Pitch Features

Sakshi Pandey,Amit Banerjee

doi:10.3390/info12040157

Abstract

Counting the number of speakers in an audio sample can lead to innovative applications, such as a real-time ranking system. Researchers have studied advanced machine learning approaches for solving the speaker count problem. However, these solutions are not efficient in real-time environments, as it requires pre-processing of a finite set of data samples. Another approach for solving the problem is via unsupervised learning or by using audio processing techniques. The research in this category is limited and does not consider the large-scale open set environment. In this paper, we propose a distributed clustering approach to address the speaker count problem. The separability of the speaker is computed using statistical pitch parameters. The proposed solution uses multiple microphones available in smartphones in a large geographical area to capture and extract statistical pitch features from the audio samples. These features are shared between the nodes to estimate the number of speakers in the neighborhood. One of the major challenges is to reduce the error count that arises due to the proximity of the users and multiple microphones. We evaluate the algorithm’s performance using real smartphones in a multi-group arrangement by capturing parallel conversations between the users in both indoor and outdoor scenarios. The average error count distance is 1.667 in a multi-group scenario. The average error count distances in indoor environments are 16% which is better than in the outdoor environment.

Highlights

Advancements in smart devices have surged the demand for applications that can serve customized user experiences in real-time, such as finding nearby restaurants with a high rating
We evaluated the performance of the proposed distributed architecture for both single- and multi-group scenarios
We considered our lab and the university canteen during the peak hours of lunchtime for maximum background noise

Summary

Introduction

Advancements in smart devices have surged the demand for applications that can serve customized user experiences in real-time, such as finding nearby restaurants with a high rating. Determining the number of speakers in a conversation is one such attribute, commonly referred to as the speaker count problem It can be useful for applications such as real-time ranking systems [1]. The ranking algorithms statistically quantify the user’s feedback to rank the popularity/usefulness of a product or an object These ranking systems are useful for determining the popularity of a restaurant or movie. Applications: The real-time distributed speaker count architecture can be used in a restaurant, movie theater, or shopping mall to rank the popularity of an event, object, or place. This is based on the assumption that a place’s popularity is directly related to the number of people present nearby. One can use the methodology for determining audience participation in a lecture room for analyzing the popularity of lectures in the university

Objectives

Results

Conclusion