Abstract
As it is of great importance to extract useful information from heterogeneous Web data, in this paper, we propose a novel heterogeneous Web data extraction algorithm using a modified hidden conditional random fields model. Considering the traditional linear chain based conditional random fields can not effectively solve the problem of complex and heterogeneous Web data extraction, we modify the standard hidden conditional random fields in three aspects, which are 1) Using the hidden Markov model to calculate the hidden variables, 2) Modifying the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, parameters can be learned for a given sequence. (3) The objective functions of hidden conditional random fields are revised, and the heterogeneous Web data are extracted by maximizing the posterior probability of the modified hidden conditional random fields. Finally, experiments are conducted to make performance evaluation on two standard datasets-EData dataset and Research Papers dataset. Compared with the existing Web data extraction methods, it can be seen that the proposed algorithm can extract useful information from heterogeneous Web data effectively and efficiently.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have