Improved Distant Supervised Model in Tibetan Relation Extraction Using ELMo and Attention

Yuan Sun,Chaofan Chen,Like Wang,Xiaobing Zhao,Tianci Xia

doi:10.1109/access.2019.2955977

Abstract

The task of relation extraction is classifying the relations between two entities in a sentence. Distant supervision relation extraction can automatically align entities in texts based on Knowledge Base without labeled training data. For low-resource language relation extraction, such as Tibetan, the main problem is the lack of labeled training data. In this paper, we propose an improved distant supervised relation extraction model based on Piecewise Convolutional Neural Network (PCNN) to expand the Tibetan corpus. We add self-attention mechanism and soft-label method to decrease wrong labels, and use Embeddings from Language Models (ELMo) to solve the semantic ambiguity problem. Meanwhile, according to the Tibetan characteristics, we combine the word vector and part of speech vector to extract deeply feature of words. Finally, the experimental results show that P@avg value increases by 14.4% than baseline.

Highlights

Relation extraction is one of the most fundamental tasks in Natural Language Processing (NLP), which purposes to classify the semantic relationship between two entities in a sentence
To consider the labels of sentences, Zeng et al [2] proposed Piecewise Convolutional Neural Network (PCNN) with Multi-instance Learning (MIL) to consider the uncertainty of instance labels, which uses convolutional architecture with piecewse max polling to automatically extract sentence features, and the P@avg value reached to 67.6% in New York Times (NYT)+Freebase dataset [16]
This paper conducts the comparative experiments in English and Tibetan corpus

Summary

INTRODUCTION

Relation extraction is one of the most fundamental tasks in Natural Language Processing (NLP), which purposes to classify the semantic relationship between two entities in a sentence. These methods require high-quality manual annotated data and rely on human-designed features, so it is difficult to be applied to low-resource languages directly, such as Tibetan To address this problem, Distant supervision (DS) method is proposed to obtain large-scale labeled training corpus automatically. Y. Sun et al.: Improved Distant Supervised Model in Tibetan Relation Extraction Using ELMo and Attention (1) Using ELMo to generate dynamic word vectors based on the current context, and as the input to the PCNN model, which can solve the semantic ambiguity problem. Since the Tibetan KB is small-scale, and Tibetan part of speech could clearly indicate the grammatical and semantic structure relation between sentences, we add self-attention to combine the word vector and part of speech vector in Tibetan to extract internal features of words, which could reduce the weights of noisy instances and alleviate the wrong labelling problems.

RELATED WORK

SENTENCE FEATURE EXTRACTION WITH PCNN

ATTENTION MECHANISM

SELECTIVE-ATTENTION MECHANISM

DATA PREPROCESSING

CONCLUSION