Siamese-based trackers have been widely utilized in UAV visual tracking due to their outstanding performance. However, UAV visual tracking encounters numerous challenges, such as similar targets, scale variations, and background clutter. Existing Siamese trackers face two significant issues: firstly, they rely on single-branch features, limiting their ability to achieve long-term and accurate aerial tracking. Secondly, current tracking algorithms treat multi-level similarity responses equally, making it difficult to ensure tracking accuracy in complex airborne environments. To tackle these challenges, we propose a novel UAV tracking Siamese network named the contextual enhancement–interaction and multi-scale weighted fusion network, which is designed to improve aerial tracking performance. Firstly, we designed a contextual enhancement–interaction module to improve feature representation. This module effectively facilitates the interaction between the template and search branches and strengthens the features of each branch in parallel. Specifically, a cross-attention mechanism within the module integrates the branch information effectively. The parallel Transformer-based enhancement structure improves the feature saliency significantly. Additionally, we designed an efficient multi-scale weighted fusion module that adaptively weights the correlation response maps across different feature scales. This module fully utilizes the global similarity response between the template and the search area, enhancing feature distinctiveness and improving tracking results. We conducted experiments using several state-of-the-art trackers on aerial tracking benchmarks, including DTB70, UAV123, UAV20L, and UAV123@10fps, to validate the efficacy of the proposed network. The experimental results demonstrate that our tracker performs effectively in complex aerial tracking scenarios and competes well with state-of-the-art trackers.