Abstract

With the proliferation of question and answering (Q&A) services, studies on building a knowledge base (KB) using various information extraction (IE) methodologies from unstructured data on the Web have received significant attention. Existing IE approaches, including machine reading comprehension (MRC), can find the correct answer to a question if the correct answer exists in the document. However, most are prone to extracting incorrect answers rather than producing no answers when the correct answer does not exist in the given documents. This problem is likely to cause serious real-world problems when we apply such technologies to practical services such as AI speakers. We propose a novel open-domain IE system to alleviate the weaknesses of previous approaches. The proposed system integrates an elaborated document selection, sentence selection, and knowledge extraction ensemble method to obtain high specificity while maintaining a realistically achievable level of precision. Based on this framework, we extract answers on Korean open-domain user queries from unstructured documents collected from multiple Web sources. For evaluating our system, we build a benchmark dataset with the SKTelecom AI Speaker log. The baseline models KYLIN infobox generator and BiDAF were used to evaluate the performance of the proposed approach. The experimental results demonstrate that the proposed method outperforms the baseline models and is practically applicable to real-world services.

Highlights

  • Formal knowledge bases (KBs), such as the Linked Open Data Cloud (LOD) [1] are used to express and share knowledge by connecting and assigning resources on the Web

  • The KB is a core element used in question and answering (Q&A) service systems and is considered an important research subject in the field of artificial intelligence as a technology storing and searching for answers to a user query

  • This machine reading comprehension (MRC) might result in poor performance on unstructured documents on the Web because it cannot guarantee that the retrieved document contains correct answers

Read more

Summary

Introduction

Formal knowledge bases (KBs), such as the Linked Open Data Cloud (LOD) [1] are used to express and share knowledge by connecting and assigning resources on the Web. The first type requires creating an IE rule by an expert in a specific domain and extract the knowledge when a matching rule pattern is found in the document. The second type requires extracting information based on supervised machine learning and deep learning models. The third type requires the study of machine reading comprehension (MRC). In this case, the information is extracted under the assumption that there is a correct answer in the document, such as in the Stanford Question Answering Dataset (SQuAD) [2]. The information is extracted under the assumption that there is a correct answer in the document, such as in the Stanford Question Answering Dataset (SQuAD) [2] This MRC might result in poor performance on unstructured documents on the Web because it cannot guarantee that the retrieved document contains correct answers

Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call