Satellite imagery depicts the Earth’s surface remotely and provides comprehensive information for many applications, such as land use monitoring and urban planning. Existing studies on unsupervised representation learning for satellite images only take into account the images’ geographic information, ignoring human activity factors. To bridge this gap, we propose using the Point-of-Interest (POI) data to capture human factors and designing a contrastive learning-based framework to consolidate the representation of satellite imagery with POI information. Besides, we introduce a season-invariant representation learning model on satellite imagery, considering that human factors are mostly unchanging with respect to seasons. An attention model is designed at last to merge the representations from the geographic, seasonal, and POI perspectives adaptively. On the basis of real-world datasets collected from Beijing, 1 we evaluate our method for predicting socioeconomic indicators. The results show that the representation containing POI information outperforms the geographic representation in estimating commercial activity-related indicators. Our proposed attentional framework can estimate the socioeconomic indicators with R 2 of 0.874 and outperforms the baseline methods. Furthermore, we explore the differences in the representations of satellite images with varying socioeconomic statuses. Finally, we investigate the impact of geographic and POI perspective information in the representation learning process, as well as the effect of satellite imagery on various spatial resolutions.
Read full abstract