Referring Expression segmentation (RES), a task that involves localizing specific instance-level objects on the basis of free-form linguistic descriptions, has emerged as a crucial frontier in human–AI interactions. It demands an intricate understanding of both the visual and textual contexts, and often requires extensive training data. This paper introduces RESMatch, the first semi-supervised learning (SSL) approach for RES aimed at reducing reliance on exhaustive data annotation. Moreover, our proposed RESMatch can leverage the abundance of image–text paired data available in the current era of training large models, resulting in an improved RES task performance without the need for costly semantic annotation, as evidenced by our experimental results. Moreover, although existing SSL techniques are effective in image segmentation, we find that they fall short in RES. Facing challenges such as the comprehension of free-form linguistic descriptions and the variability in object attributes, RESMatch introduces a trifecta of adaptations: revised strong perturbation, text augmentation, and adjustments for pseudolabel quality and strong–weak supervision. RESMatch has demonstrated its effectiveness by achieving state-of-the-art (SOTA) results in various experimental settings. This pioneering work lays the groundwork for future research in semi-supervised learning for referring expression segmentation.
Read full abstract