Abstract
News aggregation websites collect news from various online sources using crawling techniques and provide a unified view to millions of users. Since, news sources update information frequently; aggregators have to recrawl them from time to time in order to have durable archiving of the news content. The majority of recrawling techniques assume the availability of unlimited resources and zero operating cost. However, in reality, the resources and budget are limited and it is impossible to crawl every news source at every point of time. To the best of our knowledge, none of the existing techniques discuss the crawling strategy that can retrieve the maximum amount of information in a resource/budget constrained environment. In this paper, we present a framework AcT that supports two different accuracy-aware personalized crawling techniques to attain the optimal accuracy level of retrieving the information. Given the crawling frequency as a resource constraint, the first scheme aims to find the optimal schedule that maximizes the accuracy. In the second scheme, we optimize the crawling frequency and the corresponding crawling schedule for a given accuracy level. We propose a supervised technique that monitors each news source for a particular time period and collect the news update patterns. The news update patterns are later analyzed using mixed integer programming to discover the optimal crawling schedule for the first scheme, whereas a greedy strategy is proposed to discover the optimal crawling frequency and crawling schedule for the second scheme. We develop a crawler for 87 news sources and performed a series of experiments to demonstrate the quality and efficiency of our proposed techniques against benchmark strategies.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.