MoviNet: A novel network for cross-modal map extraction by vision transformer and CNN

Zheng Chen,Junhua Fang,Pingfu Chao,Pengpeng Zhao,Jiajie Xu,Lei Zhao

doi:10.1016/j.knosys.2023.110890

Abstract

Map quality is of great importance to location-based-services(LBS) applications such as navigation and route planning. Typically, a map can be extracted from either vehicle GPS trajectories or aerial images. Unfortunately, the quality of the extracted maps is usually unsatisfactory due to the inherent quality issues in the two data sources. Compared with extracting maps from a single data source, cross-modal map extraction methods consider both data sources and often achieve better results. However, almost all existing cross-modal methods are based on CNN, which fail to sufficiently model global information. To overcome the above problem, we propose MoviNet, a novel cross-modal map extraction method that combines ViT (vision transformer) and CNN. Specifically, instead of partially integrating global information in the fusion scheme as in previous works, MoviNet introduces a lightweight ViT model MobileViT as the encoder to enhance the model’s ability to capture global information. Meanwhile, we introduce a new lightweight but effective fusion scheme that generates modal-unified fusion features from the features of the two modalities, to enhance the information representation ability of the respective modalities. Extensive experiments conducted on the Beijing and Porto datasets show the superior performance of our proposed method over all baselines. https://github.com/Chan6688/MoviNet

Full Text