Abstract

Open tabular data published as part of the open government initiatives typically contain a spatial dimension, a temporal dimension and the actual numeric data capturing information such as health indicators, pollution readings, sanitation status etc. Semantic Harmonisation of numeric data entails linking numeric data columns with web-accessible semantic entities from an ontology - a machine readable knowledge representation. These semantic entities are embedded in a knowledge graph, allowing integration of information from disparate sources under common semantic definitions across spatial and temporal dimensions. Multiple research efforts have contributed to recovering semantics of numeric columns in tables, however they are either restricted to a single domain or rely on the existence of numeric data as linked data tuples in known ontologies. We present a novel yet simple approach using a supervised machine learning classifier (Random Forests) and semantic web techniques to generate semantics for numeric columns in tabular data. This approach has been tested with encouraging results for over 100 tabular datasets from data.gov.in (Indian Open Government Data Portal) downloaded from multiple domains such as Health and Family Welfare, Agriculture, Environment etc. We also present a use case for this work, being implemented in collaboration with the ministries of the Government of Karnataka for knowledge aggregation and dissemination of sustainable development data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call