Towards Automatically Creating Large Labeled Datasets for Training Question Domain Classifiers

Leandro L Tavares,Tiago A Almeida,Renato M Silva

doi:10.1109/ijcnn.2018.8489124

Abstract

Computers and human interaction evolved with the emergence of virtual assistants, which in practice is related to the evolution of Question Answering (QA) systems. This evolution demands more powerful, helpful, and aware QA systems, able to provide high-quality answers to a wide range of questions. A way to meet this level of requirement is to combine multiple restricted domain QAs to yield a high-quality open domain QA. This system can use a routing mechanism based on a hierarchical question domain classifier to select the proper restricted QA system to answer the user’s question. However, the creation and maintenance of a large and robust dataset of labeled questions required to train this mechanism are impractical to be done by hand. For tackling this problem, in this study, we present a strategy for automatically generating labeled datasets of questions from the same documents used as the sources of information demanded by the QA systems. To validate the proposed approach, we created a large dataset and applied it to train a hierarchical question domain classifier. Next, we evaluated the performance of this classifier using human-elaborated and labeled questions. The results indicated that the questions automatically created have high quality and thus they can be safely used in real-world applications.

Full Text