Abstract
Web crawlers have the ability to automatically extract web page information, but there exists the issue that some pages reuse keywords to improve their search rankings. Therefore, we propose an adaptive Page-rank algorithm to build a crawler system to resolve the issue mentioned above. Specifically, we generate a relationship matrix based on the crawled web page access relationships, and then an probability matrix based on the number of web pages is generated iteratively, and finally the web pages crawled are displayed in descending order of calculated weights. Besides, we propose to control the iterative process in Page-rank with the coherence of anchor texts. The system uses Python language to realize the functions of web crawling. Experimental results demonstrate that this system has a high speed in data collection. Comparing with Hints and classical Page-rank crawler systems, The results show that the proposed method outperforms in precision and recall.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have