HTML text segmentation for Web page summarization by a key sentence extraction method

Wataru Sunayama,Akihiro Iyama,Masahiko Yachida

doi:10.1002/scj.20416

Abstract

AbstractThe information displayed as the search result by search engines is important for quickly finding the desired information. In particular, the summary of each Web page in the search results is important for determining the Web page content, as well as for determining how the input search term is used in each Web page, namely, the relation between the search term and the Web page. However, the summaries of the search results in conventional search engines have problems such as extracting only the opening text and not containing the search term, or containing the search term but having the sentence truncated in the middle so that the context of the term or the content of the Web page cannot be determined. Therefore, a summary in sentence units is desirable, but since HTML text includes many nonsentence items that do not contain punctuation, if they are unprocessed, it is difficult for a key sentence extraction system that treats sentences as units to provide a summary. Thus, in this paper, we propose an HTML text segmentation system that divides the source text of each Web page into meaningfully connected groups of text corresponding to sentences. We also verify experimentally that the text generated by this system can be used effectively in a Web page summarization. © 2006 Wiley Periodicals, Inc. Syst Comp Jpn, 37(7): 26–36, 2006; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.20416

Full Text