GIScience 2016 Short Paper Proceedings A Data-Driven Approach for Detecting and Quantifying Modeling Biases in Geo-Ontologies Using a Discrepancy Index Bo Yan, Krzysztof Janowicz, and Yingjie Hu STKO Lab, Department of Geography, University of California, Santa Barbara, USA {boyan,jano,yingjiehu}@geog.ucsb.edu Abstract Geo-ontologies play an important role in fostering the publication, retrieval, reuse, and integration of geographic data within and across domains. The status quo of geo-ontology engineering often follows a centralized top-down approach, namely a group of domain experts collaboratively formalizing key concepts and their relationships. On the one hand, such an approach makes use of the invaluable knowledge and experience of subject matter experts and captures their perception of the world. On the other hand, however, it can introduce biases and ontological commitments that do not well correspond to the data that will be semantically lifted using these ontologies. In this work, we propose a data-driven method to calculate a Discrepancy Index in order to identify and quantify the potential modeling biases in current geo-ontologies. In other words, instead of trying to measure quality, we determine how much the ontology di↵ers from what would be expected when looking at the data alone. Keywords: geo-ontology; ontology engineering; DBpedia; Linked Data; Discrepancy Index Introduction Due to the diverse and eclectic nature of geographic information, geographic data usually comes from di↵erent sources, in di↵erent formats, and are conceptualized from di↵erent perspectives. These hetero- geneities in terms of provenance and standards create a barrier for integrating data to perform more comprehensive analysis. Geo-ontologies provide a promising way to alleviate this long-standing issue by enabling a flexible integration of geographic information based on semantics, i.e., regardless of represen- tational choices and syntax. However, the common ways in which geo-ontologies are developed top-down by a team of knowledge engineers and domain experts carry the risk of generating biased or unsuitable geo-ontologies (Hu and Janowicz, 2016). To give a concrete example, in the current version of DBpedia’s ontology (DBpedia 2015-10), the class Canal is classified as a sibling class of River, and both are defined as subclasses of Stream. This seems to be a rational classification at first glance since canals are usually channels of water. However, Stream is a subclass of BodyOfWater and BodyOfWater is a subclass of NaturalPlace. Due to the transitivity of the rdfs:subClassOf relationship, canals become natural places. However, this seems like an odd modeling choice as canals are defined as “an artificial waterway constructed to allow the passage of boats or ships inland or to convey water for irrigation” according to the Oxford dictionary. Words such as “artificial” and “constructed” make canals man-made features rather than natural place. This example indicates that top-down geo-ontologies may su↵er from the issues such as modeling biases, oversights, and ontological commitments that do not well represent the real data needs. Scrutinizing the geo-ontologies and making revisions manually on a regular basis are common solutions to such problems. But such methods are usually labor-intensive and create a gap between the geo-ontology and its corresponding Linked Dataset. In this research, we introduce initial results on a Discrepancy Index that helps geo-ontology engineers by detecting and quantifying potential issues using a series of data mining steps. Proposed Method Our approach consists of two parallel threads. The first thread comes from Linked Datasets that are transformed from unstructured data, such as Wikipedia pages. This thread focuses on the bottom-up
Read full abstract