AbstractSearch Anything is presented, a novel approach to perform similarity search in images. In contrast to other approaches to image similarity search, Search Anything enables users to utilize point, box, and text prompts to search for similar regions in a set of images. The region selected by a prompt is automatically segmented, and a binary feature vector is extracted. This feature vector is then used as a query for an image region index, and the images that contain the corresponding regions are returned. Search Anything is trained in a self-supervised manner on mask features extracted by the FastSAM foundation model and semantic features for masked image regions extracted by the CLIP foundation model to learn binary hash code representations for image regions. By coupling these two foundation models, images can be indexed and searched at a more fine-grained level than finding only entire similar images. Experiments on several datasets from different domains in a zero-shot setting demonstrate the benefits of Search Anything as a versatile region-based similarity search approach for images. The efficacy of the approach is further supported by qualitative results. Ablation studies are performed to evaluate how the proposed combination of semantic features and segmentation features together with masking improves the performance of Search Anything over the baseline using CLIP features alone. For large regions, relative improvements of up to 9.87% in mean average precision are achieved. Furthermore, considering context is beneficial for searching small image regions; a context of 3 times an object’s bounding box gives the best results. Finally, we measure computation time and determine storage requirements.
Read full abstract