Abstract

Data are crucial to the growth of e-commerce in today's world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.

Highlights

  • Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies

  • For each experiment the following figures are included: Input image with and without user login indicating an error/ no error condition; Output image with a bounding box user login indicating an error/no error condition, and demonstrating the automated image or object detection capability by using the deep learning based You only look once (Yolo) model; Output text demonstrating the work of the proposed web data extraction algorithm. 4.3.1 Experiments 1 and 2: Extracting data from single product specification in the Amazon retail site without and with changes in website layout

  • Data extracted from single product specification pages in the Amazon retail website are shown in Figs. 9 and 10

Read more

Summary

Introduction

Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. Recent advancements in machine learning and Artificial Intelligence (AI) have unfolded new opportunities, even in extensively studied research programs in numerous domains, including medical imaging (e.g., image recognition), transportation (feature extraction in selfdriving cars)[1,2], and traffic scenarios (e.g., object detection)[3,4] These advancements encourage the (pdf, doc, or txt files), websites, and images that use Optical Character Recognition (OCR)[5], subsequently inspiring the development of automated web data extraction systems through leading edge technology solutions[6,7]. Web data extraction is explored using repetitive blocks[11], with their respective attributes obtained from classification-based approaches This data extraction technique demonstrates good accuracy and adaptability to layout changes in websites. The application of CNN is typically used for accurate object detection[13], semantic segmentation[22] (using selective search algorithm to propose possible regions of interest)[16,23] and object classification

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call