Abstract

Network-based document clustering involves forming clusters of documents based on their significance and relationship strength. This approach can be used with various types of metadata that express the significance of the documents and the relationships among them. In this study, we defined a probabilistic network graph for fine-grained document clustering and developed a probabilistic generative model and calculation method. Furthermore, a novel neural-network-based network embedding learning method was devised that considers the significance of a document based on its rankings with external measures, such as the download counts of relevant files, and reflects the relationship strength between the documents. By considering the significance of a document, reputative documents of clusters can be centralized and shown as representative documents for tasks such as data analysis and data representation. During evaluation tests, the proposed ranking-based network-embedding method performs significantly better on various algorithms, such as the k-means algorithm and common word/phrase-based clustering methods, than the existing network embedding approaches.

Highlights

  • Document clustering involves the grouping of similar documents into clusters

  • PRELIMINARY STUDY: WORD2VEC AND DOC2VEC Word2Vec learns to predict context words wj, wj+1, . . ., wj+n of input word wi, which belongs to word set w consisting of V words by using a neural-network-based model that consists of an input layer, a projection layer, and an output layer

  • While Word2Vec is a technique for embedding words, Doc2Vec is a technique for embedding documents, paragraphs, or sentences

Read more

Summary

INTRODUCTION

Document clustering involves the grouping of similar documents into clusters. It is a well-established approach for obtaining insight and performing analyses by clustering large volumes of documents without using a prebuilt highcost learning set. Search-engine-based significance analysis ranks documents based on their semantic relevance and focuses on their internal-meaning-related information One disadvantage of this method is that is it prone to abuse, which refers to the intentional inclusion of important words irrelevant to the content of the actual documents. As stated above, we used a method of measuring document significance based on indices that are completely independent of the content of the documents in question, such as the number of downloads in the case of mobile apps This approach fundamentally prevents abuse, which is the biggest problem encountered in network-based document significance analysis. We calculated the similarity between documents by comparing the latent vectors of the documents that were learned by evaluating the quality of documents based on their external ranking information

FRAMEWORK
NET2VEC
PRELIMINARY STUDY
NE WITH AUTHORITY RANKING LOSS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.