Abstract

The article deals with the issues of data extraction from Web-resources on the example of gathering information on vacancies. There are three main interacting parts of this process: data source, database, and an expert. The main problematic aspects of the data mining process are the availability of several data sources; data representation in different languages; extraction data from different file formats; multiple updating of repetitive operations and data. The advantages and disadvantages of Web Mining methods were analyzed and defined. They are DOM tree analysis, line parsing, usage of regular expressions, XML parsing and visual approach. Method of DOM tree using XPath was applied in the paper. The method of comparator identification for modeling the data extraction process was proposed. The component, which receives the search topic and the search start page, carries out a thematically directed extraction. The comparator compares the extracted word from the page with the words of the search model. The application of the above-mentioned approach is presented for identifying a vacancy on the job search site. The thesaurus of employers' requirements is developed. Words-indicators of the required vacancies are presented in three languages. The parser work was set up. The parser processes the documents and retrieves the data used to fill a particular data model. The developed module works as follows. It begins to work with obtaining an array of necessary pages from the selected Web site. The next step is the analysis of Web page‘s structure. Then it is necessary to get the content of a specific HTML page, which contains the necessary information for its further retrieval and processing. As a result ―vacancy model‖ is developed. The model should include the following elements: vacancy title; date of adding a job to the site; the city where the applicant needs to work; requirements for the candidate; applicant duties; working conditions. Extraction of requirements, liabilities, and conditions was defined as the most problematic area, whereas the same information can be presented in a different way. In order to unify requirement experts were engaged.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.