ABSTRACT The utilization of Unmanned Aerial Vehicles (UAV) in remote sensing (RS) has witnessed a significant surge, offering valuable insights into Earth dynamics and human activities. However, this has led to a substantial increase in the volume of video data, rendering manual screening and analysis impractical. Consequently, there is a pressing need for the development of automated interpretation models for these aerial videos. In this paper, we propose a novel approach that leverages visual dialogue to enhance aerial video captioning. Our model adopts an encoder-decoder architecture, integrating a Visual Question Answering (VQA) task before the captioning task. The VQA task aims to enrich the captioning process by soliciting additional information about the image content. Specifically, our video encoder utilizes ViT-L/16, while the decoder employs Generative Pre-trained Transformer-2 (Distill-GPT-2). To validate our model, we introduce a novel benchmark dataset named CapERA-VQA, comprising videos accompanied by sets of questions, answers, and captions. Through experimental validation, we demonstrate the effectiveness of our proposed approach in enhancing the automated captioning of aerial videos.