Abstract

Ontology-based data management and knowledge graphs have emerged in recent years as efficient approaches for managing and utilizing diverse and large data sets. In this regard, research on algorithms for automatic semantic labeling and modeling as a prerequisite for both has made steady progress in the form of new approaches. The range of algorithms varies in the type of information used (data schema, values, or metadata), as well as in the underlying methodology (e.g., use of different machine learning methods or external knowledge bases). Approaches that have been established over the years, however, still come with various weaknesses. Most approaches are evaluated on few small data corpora specific to the approach. This reduces comparability and also limits statements for the general applicability and performance of those approaches. Other research areas, such as computer vision or natural language processing solve this problem by providing unified data corpora for the evaluation of specific algorithms and tasks. In this paper, we present and publish VC-SLAM to lay the necessary foundation for future research. This corpus allows the evaluation and comparison of semantic labeling and modeling approaches across different methodologies, and it is the first corpus that additionally allows to leverage textual data documentations for semantic labeling and modeling. Each of the contained 101 data sets consists of labels, data and metadata, as well as corresponding semantic labels and a semantic model that were manually created by human experts using an ontology that was explicitly built for the corpus. We provide statistical information about the corpus as well as a critical discussion of its strengths and shortcomings, and test the corpus with existing methods for labeling and modeling.

Highlights

  • Semantic mapping as an essential component of ontology-based data management (OBDM) and knowledge graph creation has received increased attention in recent years

  • In order to explore the use of data corpora in the field of semantic mapping, we first distinguish between the different methods, as they depend on different types of data that need to be available in a data corpus

  • We introduced Versatile Corpus for Semantic Labeling And Modeling (VC-SLAM), a versatile corpus for semantic labeling and modeling

Read more

Summary

Introduction

Semantic mapping as an essential component of ontology-based data management (OBDM) and knowledge graph creation has received increased attention in recent years. In this context, we understand semantic mapping as the linking of data attributes of a data set with elements of an ontology. Existing sets are adapted to the previously used methods for semantic modeling (schema- and data-driven) and contain only necessary labels and values, since there was no need for more data This hinders the investigation of modeling methodologies that have not been in focus of research in the past, but offer further potential, such as meta-data-based semantic modeling [2].

Landscape of Semantic Corpora
Semantic Labeling and Modeling
Utilized Corpora and Resulting Obstacles
Objectives for Semantic Mapping Corpora
Description of the Corpus
Data Set Identification and Acquisition
Modeling Setup
Modeling
Statistics and Discussion
Raw Data
Metadata
Semantic Models
Final Ontology
VC-SLAM for Semantic Labeling
VC-SLAM for Semantic Modeling
Findings
Limitations
Conclusions and Outlook

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.