Abstract

In traditional web crawling, all web pages crawled are first stored to databases. As a result, this approach can store unnecessary web pages and requires additional running time for the construction of a sentiment dictionary in a particular domain because sentiment words should be identified by scanning all web pages in the database. To address these problems, we first define the sentiment-aware web crawling problem and then propose two hash-based methods for the implementation. One is based on hash join and the other is bucket-sorted hash join. In particular, we propose a novel bucket-sorted hash join for the efficient sentiment-aware web crawling method. Our experimental results show that the proposed web crawling method using bucket-sorted hash join outperforms existing web crawling methods by significantly reducing the running time and storage space. In the proposed method, the time taken to execute the sentiment-aware task per web page is 0.016 seconds and the database space can be saved by 59% compared to the existing web crawling methods.

Highlights

  • In the past, most data mining techniques that exploit useful information hidden in objective facts have been widely used, but recent studies on analyzing and aggregating subjective information of people by the development of smart devices and social network services have been treated to be important

  • Public opinion and market research are no longer surveyed in the traditional way, but rather relevant data are automatically collected from the web and pros and cons of the questionnaire are summarized through sentiment analysis

  • In this work, we propose a new sentiment–aware web crawling approach that filters unnecessary web pages during web crawling

Read more

Summary

INTRODUCTION

Most data mining techniques that exploit useful information hidden in objective facts have been widely used, but recent studies on analyzing and aggregating subjective information of people by the development of smart devices and social network services have been treated to be important. The same process is repeated until the queue is empty In this manner, traditional web crawling methods are likely to store the downloaded web pages in a file system in which all web pages are scanned when a sentiment dictionary for a particular domain is constructed. For efficient sentiment-aware task, we propose a solution that fits our problem by borrowing the existing hash join algorithm We call this approach sentimentaware web crawling based on hash join. The proposed bucket-sorted hash join method is faster than the hash join based method in the sentiment-aware task in web crawling.

RELATED WORK
MAIN PROPOSAL
EXPERIMENTAL VALIDATION
EXPERIMENTAL RESULTS
Findings
DISCUSSION
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call