Abstract

Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

Highlights

  • Wearable devices are used widely in a variety of health and lifestyle related applications, from tracking personal fitness to monitoring patients suffering from physical and mental ailments

  • We propose a method for localizing foreground speech within audio clips using multiple instance learning

  • Since this model is trained on frame-level (10 ms duration) features, we use the max operation to aggregate frame-level posteriors to obtain utterance predictions

Read more

Summary

Introduction

Wearable devices are used widely in a variety of health and lifestyle related applications, from tracking personal fitness to monitoring patients suffering from physical and mental ailments. Audio signal can provide important cues about a person’s environment, their speech communication, and social interaction patterns [3]. Quantity and quality of communication and social interactions have been shown to be linked to a person’s well-being, happiness, and overall sense of life satisfaction [4, 5]. Multiple wearable technologies aimed at obtaining unobtrusive audio recordings in natural, non-laboratory real-world conditions have been proposed [8,9,10]. In such an egocentric setting, we are typically interested in detecting and analyzing speech uttered by the participant wearing the device, which is commonly referred to as foreground speech [11]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call