For scene matching, the extraction of metric features is a challenging task in the face of multi-source and multi-view scenes. Aiming at the requirements of multi-source and multi-view scene matching, a siamese network model for Spatial Relation Aware feature perception and fusion is proposed. The key contributions of this work are as follows: (1) Seeking to enhance the coherence of multi-view image features, we investigate the relation aware feature perception. With the help of spatial relation vector decomposition, the distribution consistency perception of image features in the horizontal H→ and vertical W→ directions is realized. (2) In order to establish the metric consistency relationship, the large-scale local information perception strategy is studied to realize the relative trade-off scale selection under the size of mainstream aerial images and satellite images. (3) After obtaining the multi-scale metric features, in order to improve the metric confidence, the feature selection and fusion strategy is proposed. The significance of distinct feature levels in the backbone network is systematically assessed prior to fusion, leading to an enhancement in the representation of pivotal components within the metric features during the fusion process. The experimental results obtained from the University-1652 dataset and the collected real scene data affirm the efficacy of the proposed method in enhancing the reliability of the metric model. The demonstrated effectiveness of this method suggests its applicability to diverse scene matching tasks.