Abstract
Previous research on automatic information extraction experienced difficulties in acquiting and representing useful domain knowledge and in coping with the structural heterogeneity among different information sources. As a result, many real-world information sources with complex document structures could not be correctly analyzed. In order to resolve these problems, this paper presents a method of building intelligent systems for mining information extraction rules from semi-structured Web pages by using domain knowledge. This system automatically generates a wrapper for each information source and performs information extraction and information integration by applying this wrapper to the corresponding source. Both the domain knowledge and the wrapper are represented by ML documents to increase flexibility and interoperability. By testing our prototype system on several real-estate information sites, we can claim that it creates the correct wrappers for most Web sources and consequently facilitates effective information extraction for heterogeneous information sources.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.