AbstractThis abstract presents an approach to building a geospatial ontology from Wikipedia and using it in BioCaster, a system for detecting and tracking infectious disease outbreaks from online news. Motivated by the need to interpret the geospatial dynamics of events we built a database containing the names of countries and major cities from Wikipedia. We started by automatically extracting country and dependent territory names and sub-country (subdivision and dependent area) names in the form of ISO 3166-1 and ISO 3166-2, respectively. Then, we re-created the part-whole relation between countries and sub-countries by verifying links from countries to their sub-countries. Verification was done by manual checking. The building process is semi-automatically implemented with automatically extracting locations and verification with human-aid. In addition,we extracted absolute longitudes/latitudes of each location for the use in Google Map and Google Earth applications. Finally we combined the geospatial hierarchy from Wikipedia with the BioCaster ontology (BCO). The preliminary results show a geospatial ontology with two administrative levels: 243 countries and 4,025 sub-countries. The geospatial ontology was integrated into the extant BCO, a multilingual public health ontology focusing on infectious diseases and was available at "http://biocaster.nii.ac.jp":http://biocaster.nii.ac.jp.The geospatial ontology was used to develop an algorithm for detecting locations of outbreaks that occur in news stories. Firstly, locations in news stories are automatically tagged with a named entity recognizer based on a support vector machine trained on 1,000 manually annotated texts. Secondly, we mapped location names from the text to identifiers in the geospatial ontology at the country and sub-country levels. Grounding proceeded as follows: First, we ranked pairs of disease-location by frequency in a set of collected articles which shared similar date stamps. We then chose the top disease-location pairs to re-map into each news story. The re-mapping process is done by regular expression matching. In order to infer country names where this information was missing from the text we manually constructed a ranked list of sub-country and country pairs based on population size.Data collected in a 10 week period (Dec 20, 2007 to Feb 20, 2008) showed that the system detected 7,412 English articles, covering 110 countries and 360 sub-countries, of which 58.00% Africa, 18.23% Asia, 11.37% South America, 5.30 % North America, 3.40% Middle East, 2.86% Europe and 0.34% Ocean. Relevant articles came predominantly from a few sources such as Google News, the European Media Monitor and ProMED-mail. Among disease/country outbreaks successfully detected during this period were ebola in Uganda (Bundibugyo, Kampala, Mbarara), yellow fever in Brazil (Goias, Sao Paulo), avian influenza in Indonesia (Jakarta, Banten), and cholera in Vietnam (Ha Noi, Ha Tay).The results were plotted on a publicly available Google Map and indicate that our geospatial ontology met our requirements. In the future, we plan to extend the ontology into deeper levels like districts and sub-districts (wards, towns, villages). Evaluation and comparison of our geospatial ontology to other available resources like GAZ and dbpedia will also be considered.
Read full abstract