Abstract

GIScience 2016 Short Paper Proceedings Which Kobani? A Case Study on the Role of Spatial Statistics and Semantics for Coreference Resolution Across Gazetteers Rui Zhu, Krzysztof Janowicz, Bo Yan, and Yingjie Hu STKO Lab, Department of Geography, University of California, Santa Barbara, USA {ruizhu,jano,boyan,yingjiehu}@geog.ucsb.edu Abstract Identifying the same places across di↵erent gazetteers is a key prerequisite for spatial data con- flation and interlinkage. Conventional approaches mostly rely on combining spatial distance with string matching and structural similarity measures, while ignoring relations among places and the semantics of place types. In this work, we propose to use spatial statistics to mine semantic signatures for place types and use these signatures for coreference resolution, i.e., to determine whether records form di↵erent gazetteers refer to the same place. We implement 27 statistical features for computing these signatures and apply them to the type and entity levels to determine the corresponding places between two gazetteers, which are GeoNames and DBpedia. The city of Kobani, Syria, is used as a running example to demonstrate the feasibility of our approach. The experimental results show that the proposed signatures have the potential to improve the performance of coreference resolution. Keywords: Spatial statistics, coreference resolution, gazetteers, semantic signatures Introduction and Motivation Coreference resolution across gazetteers is an important prerequisite for spatial data conflation and in- terlinkage. Conventional approaches, such as coordinate matching, string matching, and feature type matching, often focus on the footprints, names, and types of places, as well as the combination of these three properties (Sehgal et al., 2006; Shvaiko and Euzenat, 2013). However, such approaches have their limitations. Today, most gazetteers still rely on centroids for representing geographic features (even for feature types such as counties, rivers, or oceans). These centroids di↵er significantly across datasets, often by more than 100km. Furthermore, it is difficult to select a place type agnostic distance threshold as initial search radius. Polygon and polyline based matching, e.g., using Hausdor↵ distance, comes with its own limitations, scale and the resulting generalization being key problems. For string matching, such as using Levenshtein distance, the same place may have substantially di↵erent toponyms (e.g., Ayn al-Arab in TGN and Kobani in DBpedia) while di↵erent places may share common names. In addition, simply relying on direct feature type matching is likely to fail since di↵erent gazetteers employ incompatible typing schemata/ontologies. In conjunction, these problems often lead to either false negative or false positive matches. In previous work, we proposed using spatial signatures, which are derived from spatial statistics, to understand the semantics of places types bottom-up (Zhu et al., 2016). In this work, we apply these signatures to coreference resolution. The used spatial statistics are selected from three perspectives; a detailed list is shown in Table 1: • Spatial point pattern analysis. Point coordinates are used to quantitatively measure the spatial point patterns of place types (such as populated place). Kernel density estimation, Ripley’s K, and standard deviational ellipse analysis are conducted and corresponding statistics are obtained for representing the signatures. Furthermore, we computed these statistics from both local and global aspects. • Spatial autocorrelation analysis. In order to capture the interaction between places, we con- verted the point patterns into raster maps where each pixel represents the intensity of points. Spatial correlation statistics, such as Moran’s I and semivariograms, are subsequently used to improve the signatures. • Spatial interactions with other geographic features. In contrast to the first two perspec- tives, this group of statistical features is derived by integrating other geographic features. These

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call