Document Vector Extension for Documents Classification

Shun Guo,Nianmin Yao

doi:10.1109/tkde.2019.2961343

Abstract

Simple linear models, which usually learn word-level representations that are later combined to form document representations, have recently shown impressive performance. To improve the performance of document-level classification, it is crucial to explore the factors affecting the quality of the document vector. In this paper, we propose the concept of containers and further explore the properties of word containers and document containers by experiments and theoretical demonstrations. We find that the document container has a fixed capacity and that the document vector obtained by a simple average of too many word embeddings undoubtedly cannot be fully loaded by the container and will lose some semantic and syntactic information on very large text datasets. We also propose an efficient approach for document representation, using clustering algorithms to divide a document container into several subcontainers and establishing the relationship between the subcontainers. We additionally report and discuss the properties of two methods of clustering algorithms, DVEM-Kmeans and DVEM-Random, on large text datasets by sentiment analysis and topic classification tasks. Compared to simple linear models, the results show that our models outperform the existing state-of-the-art in generating high-quality document representations for document-level classification relatedness tasks. Our approaches can also be introduced to other models based on neural networks, such as convolutional neural networks, recurrent neural networks and generative adversarial networks, in supervised or semisupervised settings.

Full Text