THQuAD: Turkish Historic Question Answering Dataset for Reading Comprehension

Fatih Soygazi,Soner Cengiz,Okan Ciftci,Ugurcan Kok

doi:10.1109/ubmk52708.2021.9559013

Abstract

Question answering(QA) is a field in natural language processing and information retrieval, it aims to give answers to the questions using natural language. In this paper, we present the Turkish question answering dataset, which is THQuAD and baseline results with contextualized word embeddings. THQuAD consists of two different datasets one of them is TQuad on Turkish Islamic Science history within the scope of Teknofest 2018 “Artificial Intelligence competition”, the second dataset on Ottoman history within the scope of Teknofest 2020 “Doğal Dil İşleme Yarışması ” prepared by us. THQuAD is a reading comprehension dataset, consisting of questions, answers, and passages. Our objective is to give an answer to a specific question by understanding the passage and extracting the answer from this passage. We generate contextualized word embeddings from pre-trained Turkish Bert, Electra, Albert language models after fine-tuning on different hyperparameters with neural networks.

Full Text