Abstract
The extraction of the main content of a web page is a major issue in text mining. It provides less noisy input content prior to fine grained natural language processing methods. We present an unsupervised learning method to extract the main textual content of a web page. It relies on three stages: a clustering step of text blocks within a single web page, a phase of selection of the clusters associated with the main content, and a classification phase carried out on the data labeled by the two previous steps. The overall method allows to extract the main content at scale since it is fully unsupervised and the complexity at prediction time is low. Experiments are conducted to validate the generalization of the classifier and the quality of the obtained results.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have