Tolerance Rough Set-Based Bag-of-Words Model for Document Representation

Dong Qiu,Haihuan Jiang,Ruiteng Yan

doi:10.2991/ijcis.d.200808.001

Dong Qiu, Haihuan Jiang + Show 1 more

Open Access

https://doi.org/10.2991/ijcis.d.200808.001

Copy DOI

Abstract

Document representation is one of the foundations of natural language processing. The bag-of-words (BoW) model, as the representative of document representation models, is a method with the properties of simplicity and validity. However, the traditional BoW model has the drawbacks of sparsity and lacking of latent semantic relations. In this paper, to solve these mentioned problems, we propose two tolerance rough set-based BOW models, called as TRBoW1 and TRBoW2 according to different weight calculation methods. Different from the popular representation methods of supervision, they are unsupervised and no prior knowledge required. Extending each document to its upper approximation with TRBoW1 or TRBoW2, the semantic relations among documents are mined and document vectors become denser. Comparative experiments on various document representation methods for text classification on different datasets have verified optimal performance of our methods.

Highlights

With the explosive growth of the Internet, countless text data are accumulated constantly
We have proposed two novel document representation learning models, TRBoW1 model and TRBoW2 model, which adopt the tolerance rough set model to improve the traditional BoW
The proposed TRBoW1 model and TRBoW2 model extend each document to its upper approximation

Summary

Introduction

With the explosive growth of the Internet, countless text data are accumulated constantly. Unlike numerical data belonging to the structured data type, document or text data are unstructured data. As the basis of natural language processing (NLP) and text mining tasks, efficient text or document representation is important. The main challenges on document representation are the ways of transforming unstructured text data into structured data. For a good document representation, on the one hand, it should be able to truly reflect the content of the document, on the other hand, it should have the ability to distinguish different documents. Additional, it has optimal performance in some NLP applications such as text classification, information retrieval and text clustering

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Computational Intelligence Systems	Publication Date: Jan 1, 2020
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Tolerance Rough Set-Based Bag-of-Words Model for Document Representation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computational Intelligence Systems

Lead the way for us

Similar Papers

Semantic based features selection and weighting method for text classification
Aurangzeb Khan ... Baharum Baharudin
-
Aurangzeb Khan, et. al.Aurangzeb Khan ... Baharum Baharudin
01 Jun 2010
01 Jun 2010

Vector representation based on a supervised codebook for Nepali documents classification.
Chiranjibi Sitaula ... Sunil Aryal
PeerJ. Computer science | VOL. 7
Chiranjibi Sitaula, et. al.Chiranjibi Sitaula ... Sunil Aryal
03 Mar 2021
PeerJ. Computer science | VOL. 7

Comparing Sentiment Analysis and Document Representation Methods of Amazon Reviews
Katic Tamara ... Nemanja Milicevic
-
Katic Tamara, et. al.Katic Tamara ... Nemanja Milicevic
01 Sep 2018
01 Sep 2018

Fuzzy Bag-of-Words Model for Document Representation
Rui Zhao ... Kezhi Mao
IEEE Transactions on Fuzzy Systems | VOL. 26
Rui Zhao, et. al.Rui Zhao ... Kezhi Mao
01 Apr 2018
IEEE Transactions on Fuzzy Systems | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Tolerance Rough Set-Based Bag-of-Words Model for Document Representation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Computational Intelligence Systems