BFMNet: Bilateral feature fusion network with multi-scale context aggregation for real-time semantic segmentation

Jin Liu,Fangyu Zhang,Ziyin Zhou,Jiajun Wang

doi:10.1016/j.neucom.2022.11.084

Abstract

As a key technology for scene understanding, real-time semantic segmentation is an important topic in the field of computer vision in recent years. However, some lightweight networks designed for real-time semantic segmentation only have limited receptive field, so they cannot effectively perceive multi-scale objects in images. In addition, although the fast downsampling feature extraction network reduces the amount of computation, it has the problem of loss of detailed information, resulting in poor prediction accuracy. In this paper, we propose an efficient lightweight semantic segmentation network called BFMNet to address these issues. First, we use a lightweight bilateral structure to encode the semantic and detailed information from images respectively and introduce feature interactions during the encoding process. Furthermore, we design a novel Multi-Scale Context Aggregation Module (MSCAM) to help the network perceive the information of multi-scale objects, which is crucial for semantic segmentation. Finally, we introduce a new fusion module (AEFM) that uses attention perception to facilitate bilateral feature fusion. Our network achieves competitive results on three popular semantic segmentation benchmarks: Cityscapes, CamVid and COCO-Stuff. Specifically, on a single 2080Ti GPU, our network yields 77.7% mIoU at 63.7 FPS on Cityscapes test set with the input resolution of 768×1536. Considering the speed-accuracy trade-off, we also report the results with 1024×2048 input resolution: 78.9% mIoU at 31.4 FPS. On the Camvid test set, our network achieves 75.6% mIoU at 95.8 FPS, while on the COCO-Stuff validation set, our network achieves 31.2% mIoU.

Full Text