Abstract Multilingual question answering (MQA) is an effective access to multilingual data to provide accurate and precise answers, irrespective of language. Although a wide range of datasets is available for monolingual QA systems in natural language processing, benchmark datasets specifically designed for MQA are considerably limited. The absence of comprehensive and benchmark datasets hinders the development and evaluation of MQA systems. To overcome this issue, the proposed work attempts to develop the EHMQuAD dataset, an MQA dataset for low-resource languages such as Hindi and Marathi accompanying the English language. The EHMQuAD dataset is developed using a synthetic corpora generation approach, and an alignment is performed after translation to make the dataset more accurate. Further, the EHMMQA model is proposed to create an abstract framework that uses a deep neural network that accepts pairs of questions and context and returns an accurate answer based on those questions. The shared question and shared context representation have been designed separately to develop this system. The experiments of the proposed model are conducted on the MMQA, Translated SQuAD, XQuAD, MLQA, and EHMQuAD datasets, and EM and F1-score are used as performance measures. The proposed model (EHMMQA) is collated with state-of-the-art MQA baseline models for all possible monolingual and multilingual settings. The results signify that EHMMQA is a considerable step toward the MQA system utilizing Hindi and Marathi languages. Hence, it becomes a new state-of-the-art model for Hindi and Marathi languages.
Read full abstract