Abstract

With the Bag-of-Words model, a document corpus can be originally represented by a Terms-Documents matrix. However, the high-dimensional pure Terms-Documents matrix needs transforming to a lower-dimensional semantic Concepts-Documents matrix in order to not only reduce the feature space dimension but also create more meaningful features. This paper analyzes two feature transformation (FT) models on the Terms-Documents matrix, i.e. the FT model based on Low-Rank Approximation (LRA) and the FT model based on Word Embedding (WE). Both of them have their unique strength and weakness in the text transformation. The LRA-based FT only focuses on the mathematical perspective to statistically cover the original dispersed term set of the corpus as well as possible, while the WE-based FT utilizes the available word embedding vectors to enhance the contextual content of the corpus presentation. Therefore, the combinations of the LRA-based FT and the WE-based FT, named LRAintoWE-based FT and WEintoLRA-based FT, are possibly proposed to obtain comprehensive FTs capturing appropriately both the statistical information and the contextual information. The experiment results on three benchmark datasets show that the information of the WE-based FT and the LRA-based FT can be integrated, and their integration as LRAintoWE-based FT and WEintoLRA-based FT can improve the classification performance compared with that based on only either of them.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.