7 [formula omitted]J/inference end-to-end gesture recognition from dynamic vision sensor data using ternarized hybrid convolutional neural networks

Georg Rutishauser,Moritz Scherer,Tim Fischer,Luca Benini

doi:10.1016/j.future.2023.07.017

Georg Rutishauser, Moritz Scherer + Show 2 more

Open Access

https://doi.org/10.1016/j.future.2023.07.017

Copy DOI

Abstract

Dynamic vision sensor (DVS) cameras enable energy-activity proportional visual sensing by only propagating events produced by changes in the observed scene. Furthermore, by generating these events asynchronously, they offer μs-scale latency while eliminating the redundant data transmission inherent to classical, frame-based cameras. However, the potential of DVS to improve the energy efficiency of IoT sensor nodes can only be fully realized with efficient and flexible systems that tightly integrate sensing, processing, and actuation capabilities. In this paper, we propose a complete end-to-end pipeline for DVS event data classification implemented on the Kraken parallel ultra-low power (PULP) system-on-chip and apply it to gesture recognition. A dedicated on-chip peripheral interface for DVS cameras aggregates the received events into ternary event frames. We process these video frames with a fully ternarized two-stage temporal convolutional network (TCN). The neural network can be executed either on Kraken’s PULP cluster of general-purpose RISC-V cores or on CUTIE, the on-chip ternary neural network accelerator. We perform extensive ablations on network structure, training, and data generation parameters. We achieve a validation accuracy of 97.7% on the DVS128 11-class gesture dataset, a new record for embedded implementations. With in-silicon power and energy measurements, we demonstrate a classification energy of 7 μJ at a latency of 0.9ms when running the TCN on CUTIE, a reduction of inference energy by 67× when compared to the state of the art in embedded gesture recognition. The processing system consumes as little as 4.7mW in continuous inference, enabling always-on gesture recognition and closing the gap between the efficiency potential of DVS cameras and application scenarios.

Full Text