Abstract

Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrieval and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.