Abstract

We present a novel profile-based focused crawling system for dealing with the increasingly popular social media-sharing websites. In this system, we treat the user profiles as ranking criteria for guiding the crawling process. Furthermore, we divide a user's profile into two parts, an internal part, which comes from the user's own contribution, and an external part, which comes from the user's social contacts. In order to expand the crawling topic, a cotagging topic-discovery scheme was adopted for social media-sharing websites. In order to efficiently and effectively extract data for the focused crawling, a path string-based page classification method is first developed for identifying list pages, detail pages, and profile pages. The identification of the correct type of page is essential for our crawling, since we want to distinguish between list, profile, and detail pages in order to extract the correct information from each type of page, and subsequently estimate a reasonable ranking for each link that is encountered while crawling. Our experiments prove the robustness of our profile-based focused crawler, as well as a significant improvement in harvest ratio, compared to breadth-first and online page importance computation (OPIC) crawlers, when crawling the Flickr website for two different topics.

Highlights

  • Social media-sharing websites such as Flickr and YouTube are becoming more and more popular

  • We propose to use a Document Object Model (DOM) path string-based method for page classification

  • We assume that using the path string method, if we do not need to consider schema path strings, we save a lot of effort for extracting real data

Read more

Summary

Introduction

Social media-sharing websites such as Flickr and YouTube are becoming more and more popular. Little attention has been paid to effectively exploit the second type of information, which are the user profiles, in order to enhance focused search on social media websites. We exploit the users’ profile information from social media-sharing websites to develop a more accurate focused crawler that is expected to enhance the accuracy of multimedia search. To begin the focused crawling process, we first need to accurately identify the correct type of a page To this end, we propose to use a Document Object Model (DOM) path string-based method for page classification.

Related Work
Motivation for Profile-Based Focused Crawling
Path String-Based Page Classification
Page Classification Using Path String
Profile-Based Focused Crawler
Cotagging Topic Discovery
Profile-Based Focused Crawling System
Experimental Results
US UV INT
Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.