The Semantic Web realization depends on the availability of a critical mass of metadata for the web content, associated with the respective formal knowledge about the world. We claim that the Semantic Web, at its current stage of development, is in a state of a critically need of metadata generation and usage schemata that are specific, well-defined and easy to understand. This paper introduces our vision for a holistic architecture for semantic annotation, indexing, and retrieval of documents with regard to extensive semantic repositories. A system (called KIM), implementing this concept, is presented in brief and it is used for the purposes of evaluation and demonstration. A particular schema for semantic annotation with respect to real-world entities is proposed. The underlying philosophy is that a practical semantic annotation is impossible without some particular knowledge modelling commitments. Our understanding is that a system for such semantic annotation should be based upon a simple model of real-world entity classes, complemented with extensive instance knowledge. To ensure the efficiency, ease of sharing, and reusability of the metadata, we introduce an upper-level ontology (of about 250 classes and 100 properties), which starts with some basic philosophical distinctions and then goes down to the most common entity types (people, companies, cities, etc.). Thus it encodes many of the domain-independent commonsense concepts and allows straightforward domain specific extensions. On the basis of the ontology, a large-scale knowledge base of entity descriptions is bootstrapped, and further extended and maintained. Currently, the knowledge bases usually scales between 105 and 106 descriptions. Finally, this paper presents a semantically enhanced information extraction system, which provides automatic semantic annotation with references to classes in the ontology and to instances. The system has been running over a continuously growing document collection (currently about 0.5 million news articles), so it has been under constant testing and evaluation for some time now. On the basis of these semantic annotations, we perform semantic based indexing and retrieval where users can mix traditional IR (information retrieval) queries and ontology-based ones. We argue that such large-scale, fully automatic methods are essential for the transformation of the current largely textual web into a semantic web.
Read full abstract