Abstract

AbstractWith the advance of high-throughput technologies, biological data sources are growing at an exponential rate. Data integration systems that combine data from heterogeneous sources help biologists to investigate the outcomes of their experiments. However, the heterogeneity of the different data sources, at the syntactic, schema, and semantic level, still holds considerable challenges for achieving interoperability among biological data sources. In this chapter, a new semantic data integration system, which uses a mediator approach, is proposed. This system offers a unified interface for query processing and data exploration on four well-known proteomic data sources: UniProt (protein annotation), String (protein-protein interaction), PDB (protein structure), and PubMed (biomedical citation). We use a domain ontology that allows the user to formulate its queries in terms defined in the ontology. We present a query rewriting algorithm that, using the annotated ontology, converts queries posed over the ontology to queries over the sources. This architecture takes advantage of the Apache Spark framework to perform the query rewriting and execution needed to question the integrated data sources.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.