Mask-Refined R-CNN: A Network for Refining Object Details in Instance Segmentation.

Yiqing Zhang,Jun Chu,Lu Leng,Jun Miao

doi:10.3390/s20041010

Abstract

With the rapid development of flexible vision sensors and visual sensor networks, computer vision tasks, such as object detection and tracking, are entering a new phase. Accordingly, the more challenging comprehensive task, including instance segmentation, can develop rapidly. Most state-of-the-art network frameworks, for instance, segmentation, are based on Mask R-CNN (mask region-convolutional neural network). However, the experimental results confirm that Mask R-CNN does not always successfully predict instance details. The scale-invariant fully convolutional network structure of Mask R-CNN ignores the difference in spatial information between receptive fields of different sizes. A large-scale receptive field focuses more on detailed information, whereas a small-scale receptive field focuses more on semantic information. So the network cannot consider the relationship between the pixels at the object edge, and these pixels will be misclassified. To overcome this problem, Mask-Refined R-CNN (MR R-CNN) is proposed, in which the stride of ROIAlign (region of interest align) is adjusted. In addition, the original fully convolutional layer is replaced with a new semantic segmentation layer that realizes feature fusion by constructing a feature pyramid network and summing the forward and backward transmissions of feature maps of the same resolution. The segmentation accuracy is substantially improved by combining the feature layers that focus on the global and detailed information. The experimental results on the COCO (Common Objects in Context) and Cityscapes datasets demonstrate that the segmentation accuracy of MR R-CNN is about 2% higher than that of Mask R-CNN using the same backbone. The average precision of large instances reaches 56.6%, which is higher than those of all state-of-the-art methods. In addition, the proposed method requires low time cost and is easily implemented. The experiments on the Cityscapes dataset also prove that the proposed method has great generalization ability.

Highlights

Thanks to the rapid development of flexible vision sensors and visual sensor networks, computer vision has entered a new phase
MR R-CNN is trained on the COCO 2017 dataset [24]
It is a large dataset released by Microsoft in 2014, which can be used for computer vision tasks such as object detection, semantic segmentation, keypoint detection, and instance segmentation

Summary

Introduction

Thanks to the rapid development of flexible vision sensors and visual sensor networks, computer vision has entered a new phase. Sensors 2020, 20, 1010 algorithms, such as instance segmentation [5,6], image classification [7,8,9], object localization [10,11,12,13], and semantic segmentation [14,15,16,17]. For sensors that use instance segmentation as the underlying algorithm, such as sensors for autonomous driving and 3D vision, refined segmentation helps improve the safety of driving, the effect of 3D reconstruction, etc. The current instance segmentation model with satisfactory results can be regarded as an extension of two-stage detection, in which Faster R-CNN (faster region-convolutional neural network) [11] and FPN (feature pyramid network) [20] are the basis. Faster R-CNN provides a network foundation with high detection accuracy, and FPN is a common method combining layer features to improve the network performance

Methods

Results

Conclusion