Despite advances in audio- and motion-based human activity recognition (HAR) systems, a practical, power-efficient, and privacy-sensitive activity recognition system has remained elusive. State-of-the-art activity recognition systems often require power-hungry and privacy-invasive audio data. This is especially challenging for resource-constrained wearables, such as smartwatches. To counter the need for an always-on audio-based activity classification system, we first make use of power and compute-optimized IMUs sampled at 50 Hz to act as a trigger for detecting activity events. Once detected, we use a multimodal deep learning model that augments the motion data with audio data captured on a smartwatch. We subsample this audio to rates ≤ 1 kHz, rendering spoken content unintelligible, while also reducing power consumption on mobile devices. Our multimodal deep learning model achieves a recognition accuracy of 92.2% across 26 daily activities in four indoor environments. Our findings show that subsampling audio from 16 kHz down to 1 kHz, in concert with motion data, does not result in a significant drop in inference accuracy. We also analyze the speech content intelligibility and power requirements of audio sampled at less than 1 kHz and demonstrate that our proposed approach can improve the practicality of human activity recognition systems.
Read full abstract