A web table extraction algorithm based on tree edit distance

Ying Liu Ying Liu,Xue-Gang Hu Xue-Gang Hu,Gong-Qing Wu Gong-Qing Wu

doi:10.1109/anthology.2013.6784738

Abstract

Web tables widely exist in the real world, including online shopping, supply-demand information pages and searching results. It is hence a necessary and significant issue to extract structural table data from Web tables. However, semi-structured Web tables are inexpediently used for Web application systems, such as users' recommend, supply and demand analysis systems. Web pages can be parsed into tree structures. Web table information in the parse tree presents a conspicuous hierarchy structure. Meanwhile, for homologous Web table data regions, their corresponding sub-tree structures present a similar characteristic. Motivated by this, a data region extraction method based on the top-down tree edit distance is proposed in this paper, called EtractDRs. It uses the tree edit distance to measure the similarity of tree structures, merges those structures whose edit distances are lower than a pre-specified threshold to form candidate table data regions, and adopts heuristic rules to get the final data regions. Experimental studies conducted on table data from 25 Web sites demonstrate that in comparison to the state-of-the-art MDR algorithm using the string edit distance, our algorithm can improve the recall value and the F value by a large margin up to 39.4% and 26.15% respectively, while it still maintains a better performance on the accuracy.

Full Text