Abstract Instance segmentation of desktop objects is important for service robots. Most of the previous works for desktop environments are restricted to measuring the visible area of target objects. However, when a target object is placed behind another, the algorithm that only performs visible area segmentation is unable to provide accurate appearance information for the occluded objects. To solve this problem, we propose the invisible–visible query guided amodal mask measurement network based on a hierarchical transformer for desktop scenes, which can perceive the entire appearance of objects in the presence of occlusions. In this method, the RGB-D backbone is adopted to fuse the features from both RGB and depth images. Then, the pixel decoder is used to generate multi-scale feature maps. We then adopt a hierarchical transformer decoder to predict invisible, visible, and amodal masks simultaneously. To enhance the associations between the three prediction branches, we propose a query transform module to transfer object queries between adjacent branches. Since amodal masks are a combination of invisible and visible masks, we propose an invisible–visible mixture loss that takes masks from both invisible and visible branches to further supervise the network. Our method is trained on synthetic datasets for desktop objects and evaluated on both visible and amodal real-world datasets. Compared to other recent segmentation algorithms, our method achieves competitive performance.
Read full abstract