SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

Zhecheng Wang,Rajanie Prabha,Jiajun Wu,Tianyuan Huang,Ram Rajagopal

doi:10.1609/aaai.v38i6.28393

Abstract

Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 1

Similar Papers

Cross-modal retrieval via label category supervised matrix factorization hashing
Feng Xue ... Wenbo Wang
Pattern Recognition Letters | VOL. 138
Feng Xue, et. al.Feng Xue ... Wenbo Wang
08 Aug 2020
Pattern Recognition Letters | VOL. 138

Semantic Modeling of Textual Relationships in Cross-modal Retrieval
Jing Yu ... Chenghao Yang
-
Jing Yu, et. al.Jing Yu ... Chenghao Yang
01 Jan 2019
01 Jan 2019

Attribute-Guided Network for Cross-Modal Zero-Shot Hashing.
Zhong Ji ... Yanwei Pang
IEEE Transactions on Neural Networks and Learning Systems | VOL. 31
Zhong Ji, et. al.Zhong Ji ... Yanwei Pang
11 Apr 2019
IEEE Transactions on Neural Networks and Learning Systems | VOL. 31

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
...
-
, et. al. ...
01 Aug 2021
01 Aug 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence