From character to document representation with global context awareness

Zhenzhou Wu,Xin Zheng,Daniel Dahlmeier

doi:10.1145/3162957.3162973

Abstract

Bag-of-Words with TF-IDF or other weighting schemes is commonly adopted ways for document representation. However, they fail to capture sequential or semantic information in the sentence, and would lead to high-dimensional vector due to misspelling, acronyms and so on. Distributed word embedding and even document embedding methods are proposed to encode the semantic or contextual information. Whereas, the quality of the representation is not always good. To relieve the above mentioned problems, we propose a high-quality document representation model, which takes word morphology, semantic and sequential information of global context into consideration. The proposed model could outperform state-of-the-art traditional ways, word embedding-based and character-aware models on text classification task.

Full Text