Web Crawler: Design And Implementation For Extracting Article-Like Contents

Ngo Le Huy Hien,Thai Quang Tien,Nguyen Van Hieu

doi:10.35470/2226-4116-2020-9-3-144-151

Ngo Le Huy Hien, Thai Quang Tien + Show 1 more

Open Access

https://doi.org/10.35470/2226-4116-2020-9-3-144-151

Copy DOI

Abstract

The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users’ requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity.However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search engines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.

Highlights

While the World Wide Web comprises a tremendous amount of information from different areas, its content structure is not centrally organized in a specified way and has no predefined data model. [Mini and Jatinder, 2014] The data presented in the Web normally contains more text data which could have various dissimilar formats. [Jain and Subodh, 2018] A Web crawler is invented as a computer program to download data from the World Wide Web in a systematic, methodical, and automated manner. [Avinash et al, 2010; Kausar et al, 2013] It is named as a spider or a spider-bot, ant, automatic indexer, bot, worm [Kobayashi and Takeda, 2000], and is typically used for Web indexing
It can be summarized that the search operation is a traversing process of the directed graph. [Kausar et al, 2013] Using the graphical structure of the World Wide Web, web crawlers can move from page to page and traverse some new web pages from a web page
The website dataset was collected from 495 Uniform Resource Locator (URL), which is corresponding to 495 web pages

Summary

Introduction

While the World Wide Web (commonly known as the Web) comprises a tremendous amount of information from different areas, its content structure is not centrally organized in a specified way and has no predefined data model. [Mini and Jatinder, 2014] The data presented in the Web normally contains more text data which could have various dissimilar formats. [Jain and Subodh, 2018] A Web crawler is invented as a computer program to download data from the World Wide Web in a systematic, methodical, and automated manner. [Avinash et al, 2010; Kausar et al, 2013] It is named as a spider or a spider-bot, ant, automatic indexer, bot, worm [Kobayashi and Takeda, 2000], and is typically used for Web indexing.The World Wide Web has a graphical structure in which links displayed on a web page could be used to open other web pages. [Jain and Subodh, 2018] A Web crawler is invented as a computer program to download data from the World Wide Web in a systematic, methodical, and automated manner. It can be summarized that the search operation is a traversing process of the directed graph (the Internet). [Kausar et al, 2013] Using the graphical structure of the World Wide Web, web crawlers can move from page to page and traverse some new web pages from a web page. The process of web crawlers starts from retrieving web pages, inserting them into local repositories [Martin et al, 2004]. Web crawlers generate a replica of all visited pages which later be processed and indexed by search engines. Web crawlers generate a replica of all visited pages which later be processed and indexed by search engines. [Kausar et al, 2013; Pant

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Cybernetics and Physics	Publication Date: Nov 30, 2020
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Web Crawler: Design And Implementation For Extracting Article-Like Contents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Cybernetics and Physics

Lead the way for us

Similar Papers

Googling for Health Information
Jennifer P D'Auria
Journal of Pediatric Health Care | VOL. 26
Jennifer P D'AuriaJennifer P D'Auria
21 Jun 2012
Journal of Pediatric Health Care | VOL. 26

Community Detection in Complex Networks
Nan Du ... Bai Wang
Journal of Computer Science and Technology | VOL. 23
Nan Du, et. al.Nan Du ... Bai Wang
01 Jul 2008
Journal of Computer Science and Technology | VOL. 23

A Systematic Analysis of Community Detection in Complex Networks
Haji Gul ... Fernando Moreira
Procedia Computer Science | VOL. 201
Haji Gul, et. al.Haji Gul ... Fernando Moreira
01 Jan 2021
Procedia Computer Science | VOL. 201

LICOD: A Leader-driven algorithm for community detection in complex networks
Zied Yakoubi ... Rushed Kanawati
Vietnam Journal of Computer Science | VOL. 1
Zied Yakoubi, et. al.Zied Yakoubi ... Rushed Kanawati
14 Sep 2014
Vietnam Journal of Computer Science | VOL. 1

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Web Crawler: Design And Implementation For Extracting Article-Like Contents

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Cybernetics and Physics