Abstract
Event-based object recognition has drawn increasing attention for event cameras’ distinguished advantages of low power consumption and high dynamic range. For this new modality, previous works based on customizing low-level descriptors are vulnerable to noise and with limited generalizability. Although recent works turn to design various deep neural networks to extract event features, they either suffer from data insufficiency to fully train the event-based model or fail to encode spatial and temporal cues simultaneously with their single view network. In this work, we address these limitations by proposing a multi-view attention-aware network, in which an event stream is projected to multi-view 2D maps to utilize well-trained 2D models and explore spatio-temporal complements. Besides, the attention mechanism is used to boost the complements in different streams for better joint inference. Comprehensive experiments show the large superiority of our model over state-of-the-art methods as well as the efficacy of our multi-view fusion framework for event data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE Transactions on Circuits and Systems for Video Technology
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.