Abstract

Scene change detection (SCD) is a task to identify changes of interest between bi-temporal images acquired at different times. A critical idea of SCD is how to identify interesting changes while overcoming noisy changes induced by camera motion or environment variation, such as viewpoint, dynamic changes and outdoor conditions. The noisy changes cause corresponding pixel pairs to have spatial difference (position relation) and temporal difference (intensity relation). Due to the limitation of local receptive field, it is difficult for traditional models based on convolutional neural network (CNN) to establish long-range relations for the semantic changes. In order to address the above challenges, we explore the potential of a transformer in SCD and propose a transformer-based SCD architecture (TransCD). From the intuition that a SCD model should be able to model both interesting and noisy changes, we incorporate a siamese vision transformer (SViT) in a feature difference SCD framework. Our motivation is that SViT is able to establish global semantic relations and model long-range context, which is more robust to noisy changes. In addition, different from the pure CNN-based models with high computational complexity, the proposed model is more efficient and has fewer parameters. Extensive experiments on the CDNet-2014 dataset demonstrate that the proposed TransCD (SViT-E1-D1-32) outperforms the state-of-the-art SCD models and achieves 0.9361 in terms of the F1 score with an improvement of 7.31%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call