Abstract

Existing wrapper learning methods need varying form of assumptions or information about the document structure. Many of them can only handle documents with simple structures. T o handle a richer set of semi-structured documents and minimize the burden of user, we develop a new method, known as HISER (HIerarchical record Structure and Extraction Rule learning). Our HISER approach employs a tw ostage learning task, namely, hierarc hical record structure learning and extraction rule learning. In hierarc hical record structure learning, we try to automatically generate a representation of hierarchical structure for the records in an information source. In extraction rule learning, extraction rules are induced for each node in the hierarchical record structure. This design can handle missing items, m ulti-valued items, and items in unrestricted order. We also incorporate both syntactic and semantic generalization in the learning process to enrich the expressiveness of the extraction rules.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.