A novel algorithm for extracting the user reviews from web pages

Erdem Uçar,Pınar Tüfekci,Erdinç Uzun

doi:10.1177/0165551516666446

Abstract

Extracting the user reviews in websites such as forums, blogs, newspapers, commerce, trips, etc. is crucial for text processing applications (e.g. sentiment analysis, trend detection/monitoring and recommendation systems) which are needed to deal with structured data. Traditional algorithms have three processes consisting of Document Object Model (DOM) tree creation, extraction of features obtained from this tree and machine learning. However, these algorithms increase time complexity of extraction process. This study proposes a novel algorithm that involves two complementary stages. The first stage determines which HTML tags correspond to review layout for a web domain by using the DOM tree as well as its features and decision tree learning. The second stage extracts review layout for web pages in a web domain using the found tags obtained from the first stage. This stage is more time-efficient, being approximately 21 times faster compared to the first stage. Moreover, it achieves a relatively high accuracy of 96.67% in our experiments of review block extraction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A novel algorithm for extracting the user reviews from web pages

Abstract

Talk to us

Similar Papers

More From: Journal of Information Science

Lead the way for us

Journal: Journal of Information Science	Publication Date: Sep 1, 2016
Citations: 6

Similar Papers

Web Content Extraction by Integrating Textual and Visual Importance of Web Pages
J Anitha ... K Nethra
International Journal of Computer Applications | VOL. 91
J Anitha, et. al.J Anitha ... K Nethra
18 Apr 2014
International Journal of Computer Applications | VOL. 91

Hunting for DOM-Based XSS vulnerabilities in mobile cloud-based online social network
Shashank Gupta ... Pooja Chaudhary
Future Generation Computer Systems | VOL. 79
Shashank Gupta, et. al.Shashank Gupta ... Pooja Chaudhary
12 Jun 2017
Future Generation Computer Systems | VOL. 79

Exploiting DOM Mutation for the Detection of Ad-injecting Browser Extension
Azreen Zaini ... Anazida Zainal
-
Azreen Zaini, et. al.Azreen Zaini ... Anazida Zainal
09 Sep 2018
09 Sep 2018

Ontology Augmentation via Attribute Extraction from Multiple Types of Sources
Xiu Susie Fang ... Xianzhi Wang
-
Xiu Susie Fang, et. al.Xiu Susie Fang ... Xianzhi Wang
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A novel algorithm for extracting the user reviews from web pages

Abstract

Talk to us

Similar Papers

More From: Journal of Information Science