Background and objectiveThe adoption of new technologies in clinical care systems has propitiated the availability of a great amount of valuable data. However, this data is usually heterogeneous, requiring its harmonization to be integrated and analysed. We propose a semantic-driven harmonization framework that (1) enables the meaningful sharing and integration of healthcare data across institutions and (2) facilitates the analysis and exploitation of the shared data. MethodsThe framework includes an ontology-based common data model (i.e. SCDM), a data transformation pipeline and a semantic query system. Heterogeneous datasets, mapped to different terminologies, are integrated by using an ontology-based infrastructure rooted in a top-level ontology. A graph database is generated by using these mappings, and web-based semantic query system facilitates data exploration. ResultsSeveral datasets from different European institutions have been integrated by using the framework in the context of the European H2020 Precise4Q project. Through the query system, data scientists were able to explore data and use it for building machine learning models. ConclusionsThe flexible data representation using RDF, together with the formal semantic underpinning provided by the SCDM, have enabled the semantic integration, query and advanced exploitation of heterogeneous data in the context of the Precise4Q project.
Read full abstract