A Methodology for Mining Document-Enriched Heterogeneous Information Networks

M Grcar,N Lavrac,N Trdin

doi:10.1093/comjnl/bxs058

Abstract

The paper presents a new methodology for mining heterogeneous information networks, motivated by the fact that, in many real-life scenarios, documents are available in heterogeneous information networks, such as interlinked multimedia objects containing titles, descriptions and subtitles. The methodology consists of transforming documents into bag-of-words vectors, decomposing the corresponding heterogeneous network into separate graphs, computing structural-context feature vectors with PageRank, and finally, constructing a common feature vector space in which knowledge discovery is performed. We exploit this feature vector construction process to devise an efficient centroid-based classification algorithm. We demonstrate the approach by applying it to the task of categorizing video lectures. We show that our approach exhibits low time and space complexity without compromising the classification accuracy. In addition, we provide a qualitative analysis of the results by employing a data visualization technique.

Full Text