Scraping Relevant Images from Web Pages without Download

Erdinç Uzun

doi:10.1145/3616849

Abstract

Automatically scraping relevant images from web pages is an error-prone and time-consuming task, leading experts to prefer manually preparing extraction patterns for a website. Existing web scraping tools are built on these patterns. However, this manual approach is laborious and requires specialized knowledge. Automatic extraction approaches, while a potential solution, require large training datasets and numerous features, including width, height, pixels, and file size, that can be difficult and time-consuming to obtain. To address these challenges, we propose a semi-automatic approach that does not require an expert, utilizes small training datasets, and has a low error rate while saving time and storage. Our approach involves clustering web pages from a website and suggesting several pages for a non-expert to annotate relevant images. The approach then uses these annotations to construct a learning model based on textual data from the HTML elements. In the experiments, we used a dataset of 635,015 images from 200 news websites, each containing 100 pages, with 22,632 relevant images. When comparing several machine learning methods for both automatic approaches and our proposed approach, the AdaBoost method yields the best performance results. When using automatic extraction approaches, the best f-Measure that can be achieved is 0.805 with a learning model constructed from a large training dataset consisting of 120 websites (12,000 web pages). In contrast, our approach achieved an average f-Measure of 0.958 for 200 websites with only six web pages annotated per website. This means that a non-expert only needs to examine 1,200 web pages to determine the relevant images for 200 websites. Our approach also saves time and storage space by not requiring the download of images and can be easily integrated into currently available web scraping tools, because it is based on textual data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Scraping Relevant Images from Web Pages without Download

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on the Web

Lead the way for us

Journal: ACM Transactions on the Web	Publication Date: Oct 11, 2023
Citations: 1

Similar Papers

Identification of Query Forms for Retrieving the Information From Deep Web
Nripendra Narayan Das ... Ela Kumar
Transactions on Machine Learning and Artificial Intelligence | VOL. 2
Nripendra Narayan Das, et. al.Nripendra Narayan Das ... Ela Kumar
31 Dec 2015
Transactions on Machine Learning and Artificial Intelligence | VOL. 2

Automatic Data Extraction from Data-Rich Web Pages
Dongdong Hu ... Xiaofeng Meng
-
Dongdong Hu, et. al.Dongdong Hu ... Xiaofeng Meng
01 Jan 2004
01 Jan 2004

An efficient regular expression inference approach for relevant image extraction
Hayri Volkan Agun ... Erdinç Uzun
Applied Soft Computing | VOL. 135
Hayri Volkan Agun, et. al.Hayri Volkan Agun ... Erdinç Uzun
14 Jan 2023
Applied Soft Computing | VOL. 135

Automatically Discovering Relevant Images From Web Pages
Erdinc Uzun ... Tarik Yerlikaya
IEEE Access | VOL. 8
Erdinc Uzun, et. al.Erdinc Uzun ... Tarik Yerlikaya
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scraping Relevant Images from Web Pages without Download

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on the Web