Abstract

Document representation is one of the foundations of natural language processing. The bag-of-words (BoW) model, as the representative of document representation models, is a method with the properties of simplicity and validity. However, the traditional BoW model has the drawbacks of sparsity and lacking of latent semantic relations. In this paper, to solve these mentioned problems, we propose two tolerance rough set-based BOW models, called as TRBoW1 and TRBoW2 according to different weight calculation methods. Different from the popular representation methods of supervision, they are unsupervised and no prior knowledge required. Extending each document to its upper approximation with TRBoW1 or TRBoW2, the semantic relations among documents are mined and document vectors become denser. Comparative experiments on various document representation methods for text classification on different datasets have verified optimal performance of our methods.

Highlights

  • With the explosive growth of the Internet, countless text data are accumulated constantly

  • We have proposed two novel document representation learning models, TRBoW1 model and TRBoW2 model, which adopt the tolerance rough set model to improve the traditional BoW

  • The proposed TRBoW1 model and TRBoW2 model extend each document to its upper approximation

Read more

Summary

Introduction

With the explosive growth of the Internet, countless text data are accumulated constantly. Unlike numerical data belonging to the structured data type, document or text data are unstructured data. As the basis of natural language processing (NLP) and text mining tasks, efficient text or document representation is important. The main challenges on document representation are the ways of transforming unstructured text data into structured data. For a good document representation, on the one hand, it should be able to truly reflect the content of the document, on the other hand, it should have the ability to distinguish different documents. Additional, it has optimal performance in some NLP applications such as text classification, information retrieval and text clustering

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call