Abstract

Lip reading is a visual alternative to enhance the intelligence of traditional speech recognition, which can benefit from retina-like event cameras that focus on dynamic movements. Spiking Neural Networks (SNNs) are inherently well-suited to cooperate with event cameras. However, employing SNN for event-based lip reading presents significant challenges, particularly in effectively extracting spatio-temporal features from input events and distinguishing between phonetically similar elements. To address these challenges, this paper proposes a novel event-based lip-reading model that leverages the SNN framework, enriched by a designed Spatial-Temporal Attention Block (STAB) and a triplet loss. Specifically, STAB comprises both spatial and temporal attention branches, dynamically emphasizing those spatial and temporal characteristics most reflective of lip movements. STAB also includes a fusion mechanism to integrate these spatial and temporal insights, providing a comprehensive and focused representation of lip movements. In addition, we enhance the model by incorporating triplet loss into the SNN training process, further improving the model's ability to distinguish between visually similar words. Experimental results show the superior performance of our SNN and validate the effect of STAB and triplet loss. We also conduct the energy consumption analysis to affirm the model's energy efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call