Web Data Extraction with Hierarchical Clustering and Rich Features

Yong Quan Dong,Xiang Jun Zhao,Gong Jie Zhang

doi:10.4028/www.scientific.net/amm.55-57.1003

Web Data Extraction with Hierarchical Clustering and Rich Features

Yong Quan Dong, Xiang Jun Zhao + Show 1 more

https://doi.org/10.4028/www.scientific.net/amm.55-57.1003

Copy DOI

Export

Save

Cite

Journal: Applied Mechanics and Materials

Publication Date: May 3, 2011

Affiliation: Jiangsu Normal University

#Detail Pages #Web Data Extraction #Hierarchical Clustering Techniques #Rich Features #Content Feature #Hierarchical Clustering #Multiple Pages #Hierarchical Features #Clustering Features #Structure Features

Abstract
Full-Text
Similar Papers

Abstract

Listen

A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.

Full Text

Published Version

Check institute access

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: Applied Mechanics and Materials

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.

R Discovery Prime

Web Data Extraction with Hierarchical Clustering and Rich Features