Abstract

Clustering analysis is a basic and essential method for mining heterogeneous information networks, which consist of multiple types of objects and rich semantic relations among different object types. Heterogeneous information networks are ubiquitous in the real-world applications, such as bibliographic networks and social media networks. Unfortunately, most existing approaches, such as spectral clustering, are designed to analyze homogeneous information networks, which are composed of only one type of objects and links. Some recent studies focused on heterogeneous information networks and yielded some research fruits, such as RankClus and NetClus. However, they often assumed that the heterogeneous information networks usually follow some simple schemas, such as bityped network schema or star network schema. To overcome the above limitations, we model the heterogeneous information network as a tensor without the restriction of network schema. Then, a tensor CP decomposition method is adapted to formulate the clustering problem in heterogeneous information networks. Further, we develop two stochastic gradient descent algorithms, namely, SGDClus and SOSClus, which lead to effective clustering multityped objects simultaneously. The experimental results on both synthetic datasets and real-world dataset have demonstrated that our proposed clustering framework can model heterogeneous information networks efficiently and outperform state-of-the-art clustering methods.

Highlights

  • Heterogeneous information networks are ubiquitous in the real-world applications

  • The hand drawing blue circles over the curve of SOSClus in Figure 4 shows that SOSClus can escape from a local minimum and find the global minimum, while SGDClus just obtains the first reaching local minimum

  • OPT SGDClus SOSClus object type in each iteration is O(K|E| + (K2 + K)N), where K is the number of clusters, |E| is the number of edges in the network, and N is the total number of objects in the network

Read more

Summary

Introduction

Heterogeneous information networks are ubiquitous in the real-world applications. Heterogeneous information networks consist of multiple types of objects and rich semantic relations among different object types. The DBLP database is an open resource that contains most bibliographic information on computer science. The bibliographic network contains four types of objects: author (A), paper (P), venue (i.e., conference or journal) (V), and term (T). The edges are labeled by “write” or “written by” between author and paper or labeled by “publish” or “published by” between paper and venue or labeled by “contain” or “contained in” between paper and term

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call