As one of the challenging cross-modal tasks, video question answering (VideoQA) aims to fully understand video content and answer relevant questions. The mainstream approach in current work involves extracting appearance and motion features to characterize videos separately, ignoring the interactions between them and with the question. Furthermore, some crucial semantic interaction details between visual objects are overlooked. In this paper, we propose a novel Relation-aware Graph Reasoning (ReGR) framework for video question answering, which first combines appearance–motion and location–semantic multiple interaction relations between visual objects. For the interaction between appearance and motion, we design the Appearance–Motion Block, which is question-guided to capture the interdependence between appearance and motion. For the interaction between location and semantics, we design the Location–Semantic Block, which utilizes the constructed Multi-Relation Graph Attention Network to capture the geometric position and semantic interaction between objects. Finally, the question-driven Multi-Visual Fusion captures more accurate multimodal representations. Extensive experiments on three benchmark datasets, TGIF-QA, MSVD-QA, and MSRVTT-QA, demonstrate the superiority of our proposed ReGR compared to the state-of-the-art methods.