A Hybrid Model for Documents Representation

Dina Mohamed,Ayman El-Kilany,Hoda M

doi:10.14569/ijacsa.2021.0120339

Abstract

Text representation is a critical issue for exploring the insights behind the text. Many models have been developed to represent the text in defined forms such as numeric vectors where it would be easy to calculate the similarity between the documents using the well-known distance measures. In this paper, we aim to build a model to represent text semantically either in one document or multiple documents using a combination of hierarchical Latent Dirichlet Allocation (hLDA), Word2vec, and Isolation Forest models. The proposed model aims to learn a vector for each document using the relationship between its words’ vectors and the hierarchy of topics generated using the hierarchical Latent Dirichlet Allocation model. Then, the isolation forest model is used to represent multiple documents in one representation as one profile to facilitate finding similar documents to the profile. The proposed text representation model outperforms the traditional text representation models when applied to represent scientific papers before performing content-based scientific papers recommendation for researchers.

Highlights

With the rapid growth in the volume of text data and documents over the internet from social media, news articles, scientific papers, and surveys; it becomes a critical issue to find an effective model to represent the text features in the documents before using them in text mining, information retrieval, and recommendation systems
The model exploits the hierarchical Latent Dirichlet Allocation (hLDA) model to learn a hierarchy of topics that are generated from documents corpus, combined with the word2vec model to capture the semantics behind the document text
The evaluation for the proposed model is conducted through different experiments for recommending scientific papers to researchers against similar methods that apply similar techniques; the concept-based model and the Latent Dirichlet Allocation (LDA)+Word2vec model

Summary

Introduction

With the rapid growth in the volume of text data and documents over the internet from social media, news articles, scientific papers, and surveys; it becomes a critical issue to find an effective model to represent the text features in the documents before using them in text mining, information retrieval, and recommendation systems. Bag–Of–Words (BOW) model is one of the most popular models for representing documents [1] It relies on the frequencies of the words within the documents for building the document vector with a fixed length, while it fails to capture the word importance through a collection of documents. Term Frequency Inverse Document Frequency (TF-IDF) model has been applied for representing the document as a numeric vector [2, 3]. With the vital need to capture the semantics of the words to build an effective model for document representation, topic modeling techniques have been proposed for representing documents. Latent Dirichlet Allocation (LDA) [6] is one of the main topic modeling methods It represents the document as a distribution over a mixture of topics with a certain probability, while each topic is represented as a distribution over a mixture of words. The documents are represented as a distribution over a mixture of topics with a certain probability

Objectives

Discussion

Conclusion