Abstract
Hidden Markov Model (HMM) is an important approach for information extraction (IE). When applied to Web IE, several problems exist with HMM based approaches due to the lack of consideration on Web-specific features. In this paper we present a Generalized Hidden Markov Model (GHMM) that extends traditional HMMs by making use of Web-specific information for Web IE. In our approach we use Web content block instead of term as basic extraction unit. Besides, instead of using the traditional sequential state transition order, we detect the state transition order of GHMM based on layout structure of the corresponding web page. Furthermore, we use multiple emission features instead of single emission feature. In this way GHMM can better accommodate Web IE. Experiments show promising results comparing to traditional HMM based Web IE.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.