Detection and classification of human-produced nonverbal audio events

Jérémie Voix,Philippe Chabot,Patrick Cardinal,Rachel E Bouserhal

doi:10.1016/j.apacoust.2020.107643

Abstract

Audio wearable devices, or hearables, are becoming an increasingly popular consumer product. Some of these hearables contain an in-ear microphone to capture audio signals inside the user’s occluded earcanal. Mainly, the microphone is used to pick up speech in noisy environments, but it can also capture other signals, such as nonverbal events that could be used to interact with the device or a computer. Teeth or tongue clicking could be used to interact with a device in a discreet manner, and coughing or throat-clearing sounds could be used to monitor the health of a user. In this paper, 10 human produced nonverbal audio events are detected and classified in real-time with a classifier using the Bag-of-Audio-Words algorithm. To build this algorithm, different clustering and classification methods are compared. Mel-Frequency Cepstral Coefficient features are used alongside Auditory-inspired Amplitude Modulation features and Per-Channel Energy Normalization features. To combine the different features, concatenation performance at the input level and at the histogram level is compared. The real-time detector is built using the detection by classification technique, classifying on a 400 ms window with 75% overlap. The detector is tested in a controlled noisy environment on 10 subjects. The classifier had a sensitivity of 81.5% while the detector using the same classifier had a sensitivity of 69.9% in a quiet environment.

Full Text