Generating metadata from web documents: a systematic approach

Hsiang-Yuan Hsueh,Kun-Fu Huang,Chun-Nan Chen

doi:10.1186/2192-1962-3-7

Abstract

AbstractIn this paper, a mechanism generating RDF Semantic Web schema from Web document set as the semantic metadata is proposed. Analyzing both the structural and un-structural content of Web documents, semi-structured Web documents can be conceptualized as resource objects with inter-relationships in RDF diagram. Technically, hyperlinks, basic annotations, and keywords in web documents will be properly analyzed, and corresponding RDF schema will be generated following the mechanism and rules proposed in this paper. It is expected that with the semantic metadata of document sets on the Web being systematically translated instead of manually edited, the semantic operation on the Web, such as semantic query or semantic search, will be possible in the future.

Highlights

With the popularity of Internet and World Wide Web (WWW, Web), the size of documents on the Web grows dramatically
Solution to generate metadata from Web documents This paper proposed a mechanism for constructing Semantic Web with bi-directional approach: For content providers and developers, it is necessary to generate the schemamodel-like metadata as semantic information of web resource/documents they maintained; Data service providers such as search engine vendors can acquire and maintain the semantic information on the whole Web so that it is possible for semantic search including attribute-oriented or arithmetic-based query operations
In this paper, we propose a six-step systematic mechanism generating Resource Description Framework (RDF) Semantic Web schema from Web document set as the corresponding schema-model-like semantic metadata

Summary

Introduction

With the popularity of Internet and World Wide Web (WWW, Web), the size of documents on the Web grows dramatically. It is that content on the Web has become the dominant resource to users for problem solving purposes. The utilizing and query of such information resource is a challenge. Owing to the semi-structured nature of documents on the Web, people could not get the contents or documents what they really need from the search and query processes on the Web. Typically, the semi-structured documents can only be “navigated” by user. The semi-structured documents can only be “navigated” by user It is almost impossible for a web document to be semantically understood by machine without preprocessing

Objectives

Discussion

Conclusion