Web Content Extraction by Integrating Textual and Visual Importance of Web Pages

J Anitha,K Nethra

doi:10.5120/15861-4785

J Anitha, K Nethra

Open Access

PDF Available

https://doi.org/10.5120/15861-4785

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Web page has huge information and the information in the Web pages is useful in real world applications. The additional contents in the Web page like links, footers, headers and advertisements may cause the content extraction to be complicated. Irrelevant content in the Web page is treated as noisy content. A method is necessary to extract the informative content and discard the noisy content from Web pages. An integration of textual and visual importance is used to extract the informative content from Web pages. Initially a Web page is converted in to DOM (Document Object Model) tree. For each node in the DOM tree, textual and visual importance is calculated. Textual importance and visual importance is combined to form hybrid density. Density sum is calculated and used in content extraction algorithm to extract the informative content from Web pages. Performance of Web content extraction is obtained by calculating precision, recall, f-measure and accuracy. KeywordsContent Extraction, Web content Mining, DOM tree, Vision based Page Segmentation.

Full Text