Abstract

AbstractVisual Question Answering (VQA) is the problem of automatically answering a natural language question about a given image or video. Standard Arabic is the sixth most spoken language around the world. However, to the best of our knowledge, there are neither research attempts nor datasets for VQA in Arabic. In this paper, we generate the first Visual Arabic Question Answering (VAQA) dataset, which is fully automatically generated. The dataset consists of almost 138k Image-Question-Answer (IQA) triplets and is specialized in yes/no questions about real-world images. A novel database schema and an IQA ground-truth generation algorithm are specially designed to facilitate automatic VAQA dataset creation. We propose the first Arabic-VQA system, where the VQA task is formulated as a binary classification problem. The proposed system consists of five modules, namely visual features extraction, question pre-processing, textual features extraction, feature fusion, and answer prediction. Since it is the first research for VQA in Arabic, we investigate several approaches in the question channel, to identify the most effective approaches for Arabic question pre-processing and representation. For this purpose, 24 Arabic-VQA models are developed, where two question-tokenization approaches, three word-embedding algorithms, and four LSTM networks with different architectures are investigated. A comprehensive performance comparison is conducted between all these Arabic-VQA models on the VAQA dataset. Experiments indicate that the performance of all Arabic-VQA models ranges from 80.8 to 84.9%, while utilizing Arabic-specified question pre-processing approaches of considering the special case of separating the question tool "Image missing" and embedding the question words using fine-tuned Word2Vec models from AraVec2.0 have significantly improved the performance. The best-performing model is which treats the question tool "Image missing" as a separate token, embeds the question words using AraVec2.0 Skip-Gram model, and extracts the textual feature using one-layer unidirectional LSTM. Further, our best Arabic-VQA model is compared with related VQA models developed on other popular VQA datasets in a different natural language, considering their performance only on yes/no questions according to the scope of this paper, showing a very comparable performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.