Abstract

In this paper, we present LODsyndesis, a suite of services over the datasets of the entire Linked Open Data Cloud, which offers fast, content-based dataset discovery and object co-reference. Emphasis is given on supporting scalable cross-dataset reasoning for finding all information about any entity and its provenance. Other tasks that can be benefited from these services are those related to the quality and veracity of data since the collection of all information about an entity, and the cross-dataset inference that is feasible, allows spotting the contradictions that exist, and also provides information for data cleaning or for estimating and suggesting which data are probably correct or more accurate. In addition, we will show how these services can assist the enrichment of existing datasets with more features for obtaining better predictions in machine learning tasks. Finally, we report measurements that reveal the sparsity of the current datasets, as regards their connectivity, which in turn justifies the need for advancing the current methods for data integration. Measurements focusing on the cultural domain are also included, specifically measurements over datasets using CIDOC CRM (Conceptual Reference Model), and connectivity measurements of British Museum data. The services of LODsyndesis are based on special indexes and algorithms and allow the indexing of 2 billion triples in around 80 min using a cluster of 96 computers.

Highlights

  • In recent years, a large volume of open data has been published and this number keeps increasing.it is necessary such open data to be Findable, Accessible, Interoperable and Reusable (FAIR; see more information for the FAIR principles in [1]), and for this reason there is an attempt for using standards and good practices, to achieve these targets

  • The main difficulties follow: (i) publishers tend to use different models and formats for the representation of their data; (ii) different URIs (Uniform Resource Identifiers) or languages are used for describing the same entities; (iii) publishers describe their data by using different concepts, e.g., CIDOC CRM (Conceptual Reference Model) [3] represents the birth date of a person as an event, while DBpedia [4] uses a single triple for the same fact; (iv) data from different sources can be inconsistent or conflicting; (v) a lot of complementary information occur in different sources; and (vi) many datasets are updated very frequently

  • We observed that publications domain is more connected comparing to the average connectivity in LOD Cloud

Read more

Summary

Introduction

A large volume of open data has been published and this number keeps increasing. In order to find all URIs and facts about an entity, say El Greco, we have to index and enrich numerous datasets, through cross-dataset inference For this reason, i.e., assisting the process of semantic integration of data at large scale, we have designed and developed novel indexes, methods and tools [5,6,7]. The major characteristic of LODsyndesis is that it indexes the whole content of hundreds of datasets in the Linked Open Data cloud, by taking into consideration the closure of equivalence relationships, and to the best of our knowledge LODsyndesis is the “largest knowledge graph of Linked Data that includes all inferred equivalence relationships” All these semantics-aware indexes are exploited, to perform fast connectivity analytics and to offer advanced connectivity services that are of primary importance for several real world tasks.

RDF and Linked Data
Related Work
Semantic Indexing Process
Performing Connectivity Analytics
LODsyndesis Services and Use Cases
How to Find the URI of an Entity
Connectivity Analytics for Publications Domain
Connectivity Analytics for British Museum
Conclusions about Connectivity of the LOD Cloud
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.