An efficient regular expression inference approach for relevant image extraction

Hayri Volkan Agun,Erdinç Uzun

doi:10.1016/j.asoc.2023.110030

Abstract

Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. To improve this task, operations such as preparing a larger dataset and finding new features are used in the web data extraction approaches. However, these operations are difficult and laborious. In this study, we propose a fully-automated approach based on alignment of regular expressions to automatically extract the relevant images from web pages. The automatically constructed regular expressions has been applied to a classification task for the first time. In this respect, a multi-stage inference approach is developed for generating regular expressions from the attribute values of relevant and irrelevant image elements in web pages. The proposed approach reduces the complexity of the alignment of two regular expressions by applying a constraint on a version of the Levenshtein distance algorithm. The classification accuracy of regular expression approaches is compared with the naive Bayes, logistic regression, J48, and multilayer perceptron classifiers on a balanced relevant image retrieval dataset consisting of 360 image element samples for 10 shopping websites. According to the cross-validation results, the regular expression inference-based classification achieved a 0.98 f-measure with only 5 frequent n-grams, and it outperformed other classifiers on the same set of features. The classification efficiency of the proposed approach is measured at 0.108 ms, which is very competitive with other classifiers.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An efficient regular expression inference approach for relevant image extraction

Abstract

Talk to us

Similar Papers

More From: Applied Soft Computing

Lead the way for us

Journal: Applied Soft Computing	Publication Date: Jan 14, 2023
Citations: 3

Similar Papers

Automatically Discovering Relevant Images From Web Pages
Erdinc Uzun ... Tarik Yerlikaya
IEEE Access | VOL. 8
Erdinc Uzun, et. al.Erdinc Uzun ... Tarik Yerlikaya
01 Jan 2020
IEEE Access | VOL. 8

Scraping Relevant Images from Web Pages without Download
Erdinç Uzun
ACM Transactions on the Web | VOL. 18
Erdinç UzunErdinç Uzun
11 Oct 2023
ACM Transactions on the Web | VOL. 18

An N-Gram Based Approach to Automatically Identifying Web Page Genre
...
-
, et. al. ...
01 Jan 2009
01 Jan 2009

Visual and textual summarization of webpages
Nadeem Akhtar ... Rounaque Afroz
-
Nadeem Akhtar, et. al.Nadeem Akhtar ... Rounaque Afroz
01 Sep 2014
01 Sep 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An efficient regular expression inference approach for relevant image extraction

Abstract

Talk to us

Similar Papers

More From: Applied Soft Computing