Abstract

Extracting information from an information platform, site or system is possible if the information is structured or annotated in a way that is convenient for subsequent analysis and data processing, decision making and reasoning. The goal of this paper is to review and categorize various techniques, tools, and libraries for extracting information from unstructured web content (platforms, sites, systems), and to develop a JavaScript application that crawls and extracts data from dynamic web pages without the need to browse, read and search the page content. The paper presents an implementation of a particular JavaScript web scraper that retrieves a list of news headlines from the official European Union Agriculture and Rural Development website without the need for the content of the document to be read by users. The web scraper is configured to extract the searched content directly from the source HTML code of the document, regardless of whether the information is explicit or implicit. It also searches all pages related to the document. Finally exports data in a proper format. The benefits of such a tool for extracting web content from source code are related to saving time, manual labour and means of generating quality content in the biotech and agriculture industry.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call