Abstract

Big data plays a relevant role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Semantic web technologies have also experienced great progress, and scientific communities and practitioners have contributed to the problem of big data management with ontological models, controlled vocabularies, linked datasets, data models, query languages, as well as tools for transforming big data into knowledge from which decisions can be made. Despite the significant impact of big data and semantic web technologies, we are entering into a new era where domains like genomics are projected to grow very rapidly in the next decade. In this next era, integrating big data demands novel and scalable tools for enabling not only big data ingestion and curation but also efficient large-scale exploration and discovery. Federated query processing techniques provide a solution to scale up to large volumes of data distributed across multiple data sources. Federated query processing techniques resort to source descriptions to identify relevant data sources for a query, as well as to find efficient execution plans that minimize the total execution time of a query and maximize the completeness of the answers. This chapter summarizes the main characteristics of a federated query engine, reviews the current state of the field, and outlines the problems that still remain open and represent grand challenges for the area.

Highlights

  • Federated query processing techniques provide a solution to scale up to large volumes of data distributed across multiple data sources

  • RDF Molecule Templates (RDF-MTs) are merged based on their semantic descriptions defined by the ontology, e.g., in RDFS

  • SPLENDID provides a hybrid solution by combining Vocabulary of Interlinked Datasets (VoID) descriptions for data source selection along with SPARQL ASK queries submitted to each dataset at run-time for verification

Read more

Summary

Introduction

The number and variety of data collections have grown exponentially over recent decades and a similar growth rate is expected in the coming years. Data is usually ingested in myriad unstructured formats and may suffer reduced quality due to biases, ambiguities, and noise These issues impact on the complexity of the solutions for data integration. Techniques able to solve interoperability issues while addressing data complexity challenges imposed by big data characteristics are required [402]. Exemplary approaches include GEMMS [365], PolyWeb [244], BigDAWG [119], Ontario [125], and Constance [179] These systems collect metadata about the main characteristics of the heterogeneous data collections, e.g., formats and query capabilities. Rich descriptions of the properties and capabilities of the data have shown to be crucial for enabling these systems to effectively perform query processing.

Data Integration Systems
Classification of Data Integration Systems
Data Integration in the Era of Big Data
Federated Query Processing
Data Source Description
Query Decomposition and Source Selection
Query Planning and Optimization
Query Execution
Grand Challenges and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call