Remote sensing visual question answering (RSVQA) is a user-friendly method used for analyzing remote sensing images (RSIs) in various tasks. However, current methods often overlook geospatial objects, which possess a multi-scale representation and require contextual information. Furthermore, limited research has been conducted on modeling and reasoning the long-distance dependencies between entities, resulting in one-sided and inaccurate answer predictions. To overcome these limitations, we propose the Scale-Aware Multi-level Feature Pyramid Network (SAMFPN), which integrates contextual and multi-scale information using a Feature Pyramid Network (FPN) and Co-Attention mechanisms. The SAMFPN module incorporates a multi-level FPN to capture both global and local contextual information. Additionally, it introduces a Visual-Question Collaboration Fusion (VQCF) module that simultaneously embeds and learns visual and textual information. Our experimental results demonstrate the superior accuracy and robustness of our proposed model compared to existing models. These outcomes indicate that SAMFPN effectively captures multi-scale contextual information, making it a reliable solution for RSVQA tasks.