INFORMATION EXTRACTION VERSUS TEXT SEGMENTATION FOR WEB CONTENT MINING

Pavlina Fragkou

doi:10.1142/s0218194013500332

Abstract

The information explosion of the Web aggravates the problem of effective information retrieval. Even though various approaches in the literature aim to enhance retrieval, they prove to be insufficient because the actual content of a page is poorly exploited with regard to a specific semantic content. This paper extends an existing method for performing automatic semantic segmentation. The existing method initially partitions a web page into blocks based on its visual layout and the application of a set of heuristics. The subsequent step performs partitioning based on the appearance of specific types of named entities with the help of a machine learning algorithm. Our work extends the initial method in multiple directions. First of all, it examines alternative named entities as features in the learning step. Secondly, it extends the initial corpus. Thirdly, it evaluates and compares the initial method with metrics used in text segmentation. Furthermore, the result of text segmentation is incorporated as feature in the learning process. Finally, two text segmentation algorithms are applied to evaluate the effectiveness of manual annotation. Reported results show that the synergy of semantic-based and text segmentation algorithms strongly depends on the predefined semantic model used for text segmentation.

Full Text