An Agglomerative-adapted Partition Approach for Large-scale Graphs

Chen Tao,Rongrong Shan,Dongsheng Wang,Hui Li,Wei Liu

doi:10.23974/ijol.2019.vol4.1.106

Abstract

In recent years, an increasing number of knowledge bases have been built using linked data, thus datasets have grown substantially. It is neither reasonable to store a large amount of triple data in a single graph, nor appropriate to store RDF in named graphs by class URIs, because many joins can cause performance problems between graphs. This paper presents an agglomerative-adapted approach for large-scale graphs, which is also a bottom-up merging process. The proposed algorithm can partition triples data in three levels: blank nodes, associated nodes, and inference nodes. Regarding blank nodes and classes/nodes involved in reasoning rules, it is better to store with an optimal neighbor node in the same partition instead of splitting into separate partitions. The process of merging associated nodes needs to start with the node in the smallest cost and then repeat it until the final number of partitions is met. Finally, the feasibility and rationality of the merging algorithm are analyzed in detail through bibliographic cases. In summary, the partitioning methods proposed in this paper can be applied in distributed storage, data retrieval, data export, and semantic reasoning of large-scale triples graphs. In the future, we will research the automation setting of the number of partitions with machine learning algorithms.

Highlights

With the rapid development of linked data, more and more organizations are using this mature technology to build and publish their knowledge bases or datasets (Erkimbaev, Zitserman, Kobzev, Serebrjakov and Teymurazov, 2013; Knoblock et al, 2017; Chen Tao, Zhang Yongjuan, Chen et al / International Journal of Librarianship 4(1)Liu Wei and Zhu Qinghua, 2019)
This paper presents an agglomerative-adapted partition approach for large-scale Resource Description Framework 6 (RDF) graphs
According to the characteristics of ontology structure and triples, we propose a bottom-up and multi-layer node-merging algorithm which contains blank nodes merging, associated nodes merging, and inference nodes merging

Summary

INTRODUCTION

With the rapid development of linked data, more and more organizations are using this mature technology to build and publish their knowledge bases or datasets (Erkimbaev, Zitserman, Kobzev, Serebrjakov and Teymurazov, 2013; Knoblock et al, 2017; Chen Tao, Zhang Yongjuan, Chen et al / International Journal of Librarianship 4(1). As can be seen from the latest linked open data (LOD) cloud, there is a growing number of big datasets, such as DBpedia, SciGraph, VIAF4, UniProt, and so on All of these large datasets have become the infrastructure and core components of their fields. DBpedia data is categorized into hundreds of entity classes, and has been linked by a number of applications and datasets. These large datasets often provide segmented downloads for different classes in official publishing sites. When these databases can be applied, we need to dump and restore them in a local repository which introduces a number of challenges. In view of this background, we propose a simple agglomerative-adapted partition approach that can be used to split triples in large-scale graphs

LITERATURE REVIEW

METHODS

EXPERIMENTS AND ANALYSIS

CONCLUSION AND FUTURE WORK