Effective water management and flood prevention are critical challenges encountered by both urban and rural areas, necessitating precise and prompt monitoring of waterbodies. As a fundamental step in the monitoring process, waterbody segmentation involves precisely delineating waterbody boundaries from imagery. Previous research using satellite images often lacks the resolution and contextual detail needed for local-scale analysis. In response to these challenges, this study seeks to address them by leveraging common natural images that are more easily accessible and provide higher resolution and more contextual information compared to satellite images. However, the segmentation of waterbodies from ordinary images faces several obstacles, including variations in lighting, occlusions from objects like trees and buildings, and reflections on the water surface, all of which can mislead algorithms. Additionally, the diverse shapes and textures of waterbodies, alongside complex backgrounds, further complicate this task. While large-scale vision models have typically been leveraged for their generalizability across various downstream tasks that are pre-trained on large datasets, their application to waterbody segmentation from ground-level images remains underexplored. Hence, this research proposed the Visual Aquatic Generalist (VAGen) as a countermeasure. Being a lightweight model for waterbody segmentation inspired by visual In-Context Learning (ICL) and Visual Prompting (VP), VAGen refines large visual models by innovatively adding learnable perturbations to enhance the quality of prompts in ICL. As demonstrated by the experimental results, VAGen demonstrated a significant increase in the mean Intersection over Union (mIoU) metric, showing a 22.38% enhancement when compared to the baseline model that lacked the integration of learnable prompts. Moreover, VAGen surpassed the current state-of-the-art (SOTA) task-specific models designed for waterbody segmentation by 6.20%. The performance evaluation and analysis of VAGen indicated its capacity to substantially reduce the number of trainable parameters and computational overhead, and proved its feasibility to be deployed on cost-limited devices including unmanned aerial vehicles (UAVs) and mobile computing platforms. This study thereby makes a valuable contribution to the field of computer vision, offering practical solutions for engineering applications related to urban flood monitoring, agricultural water resource management, and environmental conservation efforts.
Read full abstract