Abstract

Web tables widely exist in the real world, including online shopping, supply-demand information pages and searching results. It is hence a necessary and significant issue to extract structural table data from Web tables. However, semi-structured Web tables are inexpediently used for Web application systems, such as users' recommend, supply and demand analysis systems. Web pages can be parsed into tree structures. Web table information in the parse tree presents a conspicuous hierarchy structure. Meanwhile, for homologous Web table data regions, their corresponding sub-tree structures present a similar characteristic. Motivated by this, a data region extraction method based on the top-down tree edit distance is proposed in this paper, called EtractDRs. It uses the tree edit distance to measure the similarity of tree structures, merges those structures whose edit distances are lower than a pre-specified threshold to form candidate table data regions, and adopts heuristic rules to get the final data regions. Experimental studies conducted on table data from 25 Web sites demonstrate that in comparison to the state-of-the-art MDR algorithm using the string edit distance, our algorithm can improve the recall value and the F value by a large margin up to 39.4% and 26.15% respectively, while it still maintains a better performance on the accuracy.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.