Abstract

This paper proposes a method for finding and extracting academic information from conference Web pages. The main contributions include: (1) A lightweight topic crawling method based on search engine is used to crawl academic conference Web pages. (2) An new vision-based page segmentation algorithm is proposed to improve the result of classical VIPS algorithm by introducing complete tree. This algorithm can divide Web pages into text blocks. (3) Using bayesian network classifier, all text blocks are classified as 10 categories according to its vision features, key-word features and text content features. The initial classification results have 75 % precision and 67 % recall. (4) The context information of text blocks are employed to repair and refine initial classification results, which are improved to 96 % precision and 98 % recall. Finally, academic information is easily extracted from the classified text blocks. Experimental results on real-world datasets show that our method is effective and efficient for finding and extracting academic information from conference Web pages.KeywordsTopic crawlerWeb information extractionPage segmentation

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call