Extracting News from Server Side Databases by Query Interfaces

Hao Han

doi:10.1080/08874417.2014.11645686

Abstract

Web news has become an important information resource, and we can collect and analyze Web news to acquire desired information. In this paper, an effective and efficient Web-based knowledge acquisition approach is proposed for extracting Web news full content from news site databases using site-side news search engines as query interfaces. We do not crawl the news sites to collect news pages. Instead, we use news search engines affiliated to the news sites to search for the desired news articles directly from the news site databases. We give the search keywords to the search engines and extract the full content of the news articles without the process of machine learning or pattern matching. This approach is applicable to general news sites, and the experimental results show that it can extract a large amount of Web news content from news site databases automatically, quickly, and accurately.

Full Text