Abstract

Semantic segmentation is an important part of scene understanding in autonomous systems. Accurate segmentation results help the autonomous system better understand the surrounding environment, which improve the task success rate and enhance the system reliability. However, due to the hardware limitations of frame-based cameras, semantic segmentation based on RGB images under low-light conditions is still a difficult problem. The event camera, an emerging bio-inspired vision sensor, tries to fill these gaps with its high dynamic range. Most previous works resort to cross-modality knowledge distillation from a pretrained image-based teacher network to train an event-based network for semantic segmentation. However, a direct knowledge distillation is inadequate because of the different input modalities in the two networks. Utilizing image features to supervise event features cannot make full use of the knowledge in the teacher network. Thus, we propose a Modality Translation and Fusion (MTF) framework to distill diverse cross-modality knowledge. Specifically, we first develop a Modality Translation (MT) module to convert events into image modality. With different input modalities, two feature extractors are built to learn diverse and complementary knowledge from the teacher network. Then, to better utilize the knowledge, we propose a Residual-based Coordinate Attention Fusion (RCAF) module to fuse the multi-scale features from different modalities. Finally, extensive experiments show that our MTF is superior to state-of-the-art (SOTA) approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call