Abstract
The need for personal customization in professional dynamic information collection requires crawling information that is published within specified time durations, extracting accurate and interest-focused information, and assuring the completeness of the information. By combining the technology of jsoup and Lucene as well as the extensibility research on Heritrix, this paper proposes a set of personal customization strategies and methods for professional dynamic information collection: analyze the composition of the URL link to filter out information beyond a specified time window according to different URL types; set up different cleaning templates to extract the attribute items for different styles of websites; establish user personalization blacklist to filter out the information with poor relevance to user interests; apply correction function to improve the robustness of the crawler and guarantee the completeness of the information. Using the collection part of the Dynamic Information Collecting System of Oil and Gas Resources project as a case verification, the results show that the strategies and methods can be used to achieve the three objectives: timeliness; accuracy and interest-focus; and completeness. The proposed strategies and methods may be applied to the construction of an industrial dynamic information collection system.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have