Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

Weiming Huang,Jing Wang,Gao Cong

doi:10.1080/13658816.2024.2347322

Abstract

Inferring urban functions using street view images (SVIs) has gained tremendous momentum. The recent prosperity of large-scale vision-language pretrained models sheds light on addressing some long-standing challenges in this regard, for example, heavy reliance on labeled samples and computing resources. In this paper, we present a novel prompting framework for enabling the pretrained vision-language model CLIP to effectively infer fine-grained urban functions with SVIs in a zero-shot manner, that is, without labeled samples and model training. The prompting framework UrbanCLIP comprises an urban taxonomy and several urban function prompt templates, in order to (1) bridge the abstract urban function categories and concrete urban object types that can be readily understood by CLIP, and (2) mitigate the interference in SVIs, for example, street-side trees and vehicles. We conduct extensive experiments to verify the effectiveness of UrbanCLIP. The results indicate that the zero-shot UrbanCLIP largely surpasses several competitive supervised baselines, e.g. a fine-tuned ResNet, and its advantages become more prominent in cross-city transfer tests. In addition, UrbanCLIP’s zero-shot performance is considerably better than the vanilla CLIP. Overall, UrbanCLIP is a simple yet effective framework for urban function inference, and showcases the potential of foundation models for geospatial applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

Abstract

Talk to us

Similar Papers

More From: International Journal of Geographical Information Science

Lead the way for us

Similar Papers

Urban Function as a New Perspective for Adaptive Street Quality Assessment
Feng Hu ... Yuan Meng
Sustainability | VOL. 12
Feng Hu, et. al.Feng Hu ... Yuan Meng
11 Feb 2020
Sustainability | VOL. 12

Deep-learning generation of POI data with scene images
Jinbao Zhang ... Xia Li
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 188
Jinbao Zhang, et. al.Jinbao Zhang ... Xia Li
26 Apr 2022
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 188

Knowledge and topology: A two layer spatially dependent graph neural networks to identify urban functions with time-series street view image
Yan Zhang ... Filip Biljecki
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 198
Yan Zhang, et. al.Yan Zhang ... Filip Biljecki
16 Mar 2023
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 198

Migratable urban street scene sensing method based on vision language pre-trained model
Yan Zhang ... Fan Zhang
International Journal of Applied Earth Observation and Geoinformation | VOL. 113
Yan Zhang, et. al.Yan Zhang ... Fan Zhang
01 Sep 2022
International Journal of Applied Earth Observation and Geoinformation | VOL. 113

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

Abstract

Talk to us

Similar Papers

More From: International Journal of Geographical Information Science