Network-Based Document Clustering Using External Ranking Loss for Network Embedding

Yeo Chan Yoon,Hyung Kuen Gee,Heuiseok Lim

doi:10.1109/access.2019.2948662

Yeo Chan Yoon, Hyung Kuen Gee + Show 1 more

Open Access

https://doi.org/10.1109/access.2019.2948662

Copy DOI

Abstract

Network-based document clustering involves forming clusters of documents based on their significance and relationship strength. This approach can be used with various types of metadata that express the significance of the documents and the relationships among them. In this study, we defined a probabilistic network graph for fine-grained document clustering and developed a probabilistic generative model and calculation method. Furthermore, a novel neural-network-based network embedding learning method was devised that considers the significance of a document based on its rankings with external measures, such as the download counts of relevant files, and reflects the relationship strength between the documents. By considering the significance of a document, reputative documents of clusters can be centralized and shown as representative documents for tasks such as data analysis and data representation. During evaluation tests, the proposed ranking-based network-embedding method performs significantly better on various algorithms, such as the k-means algorithm and common word/phrase-based clustering methods, than the existing network embedding approaches.

Highlights

Document clustering involves the grouping of similar documents into clusters
PRELIMINARY STUDY: WORD2VEC AND DOC2VEC Word2Vec learns to predict context words wj, wj+1, . . ., wj+n of input word wi, which belongs to word set w consisting of V words by using a neural-network-based model that consists of an input layer, a projection layer, and an output layer
While Word2Vec is a technique for embedding words, Doc2Vec is a technique for embedding documents, paragraphs, or sentences

Summary

INTRODUCTION

Document clustering involves the grouping of similar documents into clusters. It is a well-established approach for obtaining insight and performing analyses by clustering large volumes of documents without using a prebuilt highcost learning set. Search-engine-based significance analysis ranks documents based on their semantic relevance and focuses on their internal-meaning-related information One disadvantage of this method is that is it prone to abuse, which refers to the intentional inclusion of important words irrelevant to the content of the actual documents. As stated above, we used a method of measuring document significance based on indices that are completely independent of the content of the documents in question, such as the number of downloads in the case of mobile apps This approach fundamentally prevents abuse, which is the biggest problem encountered in network-based document significance analysis. We calculated the similarity between documents by comparing the latent vectors of the documents that were learned by evaluating the quality of documents based on their external ranking information

FRAMEWORK

NET2VEC

PRELIMINARY STUDY

NE WITH AUTHORITY RANKING LOSS

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Network-Based Document Clustering Using External Ranking Loss for Network Embedding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Journal: IEEE Access	Publication Date: Jan 1, 2019
License type: CC BY 4.0

Similar Papers

Characteristics of scientific Web publications: Preliminary data gathering and analysis
Erik Thorlund Jepsen ... Lennart Björneborn
Journal of the American Society for Information Science and Technology | VOL. 55
Erik Thorlund Jepsen, et. al.Erik Thorlund Jepsen ... Lennart Björneborn
13 Aug 2004
Journal of the American Society for Information Science and Technology | VOL. 55

Knowledge Processes and Meta Processes in Ontology-Based Knowledge Management
Steffen Staab ... York Sure
-
Steffen Staab, et. al.Steffen Staab ... York Sure
01 Jan 2003
01 Jan 2003

The Design of Metadata Interchange for Chinese Information and Implementation of Metadata Management System
Chao‐Chen Chen ... Hsueh‐Hua Chen
Bulletin of the American Society for Information Science and Technology | VOL. 27
Chao‐Chen Chen, et. al.Chao‐Chen Chen ... Hsueh‐Hua Chen
01 Jun 2001
Bulletin of the American Society for Information Science and Technology | VOL. 27

Building a repository for workflow systems
Chengfei Liu ... Xuemin Lin
-
Chengfei Liu, et. al. Chengfei Liu ... Xuemin Lin
22 Sep 1999
22 Sep 1999

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Network-Based Document Clustering Using External Ranking Loss for Network Embedding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access