Pattern Matching-based scraping of news websites

Hamza Salem,Manuel Mazzara

doi:10.1088/1742-6596/1694/1/012011

Abstract

Web Scraping is the process of extracting content from human-readable websites in order to import it into local storage such as databases or CSV Files. The process of data extraction and its design is time-consuming requiring an analysis of the website, data representation of the objects comprising its structure (DOM), HTML tags, and the Cascading Style Sheets (CSS) classes. To support this process we aim at providing automation. In this paper, we propose a pattern mining technique to scrap news and blog websites by recognizing title and body based on a content structure pattern. This approach consists of three steps, i.e.: extracting news website structure, constructing a pattern of HTML content, and implementing the pattern as a set of rules in web scraping. Our approach is a simple, general, and straightforward way to extract articles that consist of the title, the body of any blogs, or news websites.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Physics: Conference Series	Publication Date: Dec 1, 2020
Citations: 5	License type: cc-by

R Discovery Prime

R Discovery Prime

Pattern Matching-based scraping of news websites

Abstract

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series

Lead the way for us

Similar Papers

An Approach of Web Scraping on News Website based on Regular Expression
Achmad Maududie ... Windi Eka Yulia Retnani
-
Achmad Maududie, et. al.Achmad Maududie ... Windi Eka Yulia Retnani
01 Nov 2018
01 Nov 2018

Implementation of Web Scraping on News Sites Using the Supervised Learning Method
...
İlköğretim Online | VOL. 20
, et. al. ...
01 Jan 2020
İlköğretim Online | VOL. 20

Computer Vision-based Web Scraping for Internet Forums
Eric C Dallmeier
-
Eric C DallmeierEric C Dallmeier
19 May 2021
19 May 2021

Chapter 5 - CSS
Mario Heiderich ... David Lindsay
Web Application Obfuscation | VOL. -
Mario Heiderich, et. al.Mario Heiderich ... David Lindsay
01 Jan 2010
Web Application Obfuscation | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pattern Matching-based scraping of news websites

Abstract

Talk to us

Similar Papers

More From: Journal of Physics: Conference Series