Distributed Relationship Mining over Big Scholar Data

Da Zhang,Mansur R Kabuka

doi:10.1109/tetc.2018.2829772

Abstract

In this paper, we propose a system infrastructure to construct the big scholar data as a large knowledge graph, discover the meta paths between the entities and calculate the relevancy between entities in the graph. The core infrastructure is established on the secured and private Amazon Elastic Compute Cloud(Amazon EC2) platform. The infrastructure maintains the data evenly across the repositories, processes the data parallel by utilizing open source Spark framework, manages computing resources optimally by utilizing YARN and Hadoop HDFS, and discovers the relationship distributedly between different types of entities. We incorporate four relationship discovery tasks including citation recommendation, potential collaborator discovery, similar venue measurement and paper to venue recommendation on top of this infrastructure. For relationship mining tasks, we propose a mixed and weighted meta path (MWMP) method to explore the potential relationship between different types of entities. To verify the accuracy and measure parallelization speedup of our algorithm, we set up clusters through Amazon EC2 platform.

Full Text