Main Content Extraction from Web Pages

Stanislas Morbieu,Guillaume Bruneval,Francois-Xavier Bois,Mohamed Lacarne,Mohamed Kone

doi:10.1109/icmla51294.2020.00162

Main Content Extraction from Web Pages

Stanislas Morbieu, Guillaume Bruneval + Show 3 more

https://doi.org/10.1109/icmla51294.2020.00162

Copy DOI

Export

Save

Cite

Publication Date: Dec 1, 2020

Citations: 1

Affiliation: Kerneos (France)

#Main Content #Natural Language Processing Methods #Single Web Page #Web Page #Content Of Page #Previous Steps #Text Mining #Content Extraction #Single Page #Unsupervised Method

Abstract
Full-Text
Similar Papers

Abstract

Listen

The extraction of the main content of a web page is a major issue in text mining. It provides less noisy input content prior to fine grained natural language processing methods. We present an unsupervised learning method to extract the main textual content of a web page. It relies on three stages: a clustering step of text blocks within a single web page, a phase of selection of the clusters associated with the main content, and a classification phase carried out on the data labeled by the two previous steps. The overall method allows to extract the main content at scale since it is fully unsupervised and the complexity at prediction time is low. Experiments are conducted to validate the generalization of the classifier and the quality of the obtained results.

Full Text

Published Version

Check institute access

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.

R Discovery Prime

Main Content Extraction from Web Pages

Abstract

Published Version

Talk to us

Similar Papers

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Main Content Extraction from Web Pages

Abstract

Published Version

Talk to us

Similar Papers