Entity mention aware document representation

Hongliang Dai,Siliang Tang,Fei Wu,Yueting Zhuang

doi:10.1016/j.ins.2017.11.032

Abstract

Representing variable length texts (e.g., sentences, documents) with low-dimensional continuous vectors has been a topic of recent interest due to its successful applications in various NLP tasks. During the learning process, most of existing methods tend to treat all the words equally regardless of their possibly different intrinsic nature. We believe that for some types of documents (e.g., news articles), entity mentions are more informative than ordinary words and it can be beneficial for certain tasks if they are properly utilized. In this paper, we propose a novel approach for learning low-dimensional vector representations of documents. The learned representations captures information of not only the words in documents, but also the entity mentions in documents and the connections between different entities. Experimental results demonstrate that our approach is able to significantly improve text clustering, text classification performance and outperform previous studies on the TAC-KBP entity linking benchmark.

Full Text