Abstract
The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users’ requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity.However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search engines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.
Highlights
While the World Wide Web comprises a tremendous amount of information from different areas, its content structure is not centrally organized in a specified way and has no predefined data model. [Mini and Jatinder, 2014] The data presented in the Web normally contains more text data which could have various dissimilar formats. [Jain and Subodh, 2018] A Web crawler is invented as a computer program to download data from the World Wide Web in a systematic, methodical, and automated manner. [Avinash et al, 2010; Kausar et al, 2013] It is named as a spider or a spider-bot, ant, automatic indexer, bot, worm [Kobayashi and Takeda, 2000], and is typically used for Web indexing
It can be summarized that the search operation is a traversing process of the directed graph. [Kausar et al, 2013] Using the graphical structure of the World Wide Web, web crawlers can move from page to page and traverse some new web pages from a web page
The website dataset was collected from 495 Uniform Resource Locator (URL), which is corresponding to 495 web pages
Summary
While the World Wide Web (commonly known as the Web) comprises a tremendous amount of information from different areas, its content structure is not centrally organized in a specified way and has no predefined data model. [Mini and Jatinder, 2014] The data presented in the Web normally contains more text data which could have various dissimilar formats. [Jain and Subodh, 2018] A Web crawler is invented as a computer program to download data from the World Wide Web in a systematic, methodical, and automated manner. [Avinash et al, 2010; Kausar et al, 2013] It is named as a spider or a spider-bot, ant, automatic indexer, bot, worm [Kobayashi and Takeda, 2000], and is typically used for Web indexing.The World Wide Web has a graphical structure in which links displayed on a web page could be used to open other web pages. [Jain and Subodh, 2018] A Web crawler is invented as a computer program to download data from the World Wide Web in a systematic, methodical, and automated manner. It can be summarized that the search operation is a traversing process of the directed graph (the Internet). [Kausar et al, 2013] Using the graphical structure of the World Wide Web, web crawlers can move from page to page and traverse some new web pages from a web page. The process of web crawlers starts from retrieving web pages, inserting them into local repositories [Martin et al, 2004]. Web crawlers generate a replica of all visited pages which later be processed and indexed by search engines. Web crawlers generate a replica of all visited pages which later be processed and indexed by search engines. [Kausar et al, 2013; Pant
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.