Hybrid model of content extraction

Pir Abdul Rasool Qureshi,Nasrullah Memon

doi:10.1016/j.jcss.2011.10.012

Pir Abdul Rasool Qureshi, Nasrullah Memon

Open Access

https://doi.org/10.1016/j.jcss.2011.10.012

Copy DOI

Journal: Journal of Computer and System Sciences	Publication Date: Nov 2, 2011
Citations: 21	License type: publisher-specific-oa

Affiliation: Maersk (Denmark)

Abstract

We present a hybrid model for content extraction from HTML documents. The model operates on Document Object Model (DOM) tree of the corresponding HTML document. It evaluates each tree node and associated statistical features like link density and text distribution across the node to predict significance of the node towards overall content provided by the document. Once significance of the nodes is determined, the formatting characteristics like fonts, styles and the position of the nodes are evaluated to identify the nodes with similar formatting as compared to the significant nodes. The proposed hybrid model is derived from two different models, i.e., one is based on statistical features and other on formatting characteristics and achieved the best accuracy. We describe the validity of model with the help of experiments conducted on the standard data sets. The results revealed that the proposed model outperformed other existing content extraction models. We present a browser based implementation of the proposed model as proof of concept and compare the implementation strategy with various state of art implementations. We also discuss various applications of the proposed model with special emphasis on open source intelligence.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hybrid model of content extraction

Abstract

Talk to us

Similar Papers

More From: Journal of Computer and System Sciences

Lead the way for us

Similar Papers

Statistical Model for Content Extraction
Pir Abdul Rasool Qureshi ... Nasrullah Memon
-
Pir Abdul Rasool Qureshi, et. al.Pir Abdul Rasool Qureshi ... Nasrullah Memon
01 Sep 2011
01 Sep 2011

Filtering the open-source information
Pir Abdul Rasool ... Panagiotis Karampelas
-
Pir Abdul Rasool, et. al.Pir Abdul Rasool ... Panagiotis Karampelas
01 Jul 2010
01 Jul 2010

Combining physics-based and data-driven modeling in well construction: Hybrid fluid dynamics modeling
Oney Erge ... Eric Van Oort
Journal of Natural Gas Science and Engineering | VOL. 97
Oney Erge, et. al.Oney Erge ... Eric Van Oort
27 Nov 2021
Journal of Natural Gas Science and Engineering | VOL. 97

Development of a linear–nonlinear hybrid special model to predict monthly runoff in a catchment area and evaluate its performance with novel machine learning methods
Fereshteh Nourmohammadi Dehbalaei ... Ali Akbar Akhtari
Applied Water Science | VOL. 13
Fereshteh Nourmohammadi Dehbalaei, et. al.Fereshteh Nourmohammadi Dehbalaei ... Ali Akbar Akhtari
21 Apr 2023
Applied Water Science | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hybrid model of content extraction

Abstract

Talk to us

Similar Papers

More From: Journal of Computer and System Sciences