The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Marcin Michał Mirończuk

doi:10.1007/s10115-017-1097-2

Abstract

The aim of this study is to propose an information extraction system, called BigGrams, which is able to retrieve relevant and structural information (relevant phrases, keywords) from semi-structural web pages, i.e. HTML documents. For this purpose, a novel semi-supervised wrappers induction algorithm has been developed and embedded in the BigGrams system. The wrappers induction algorithm utilizes a formal concept analysis to induce information extraction patterns. Also, in this article, the author (1) presents the impact of the configuration of the information extraction system components on information extraction results and (2) tests the boosting mode of this system. Based on empirical research, the author established that the proposed taxonomy of seeds and the HTML tags level analysis, with appropriate pre-processing, improve information extraction results. Also, the boosting mode works well when certain requirements are met, i.e. when well-diversified input data are ensured.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Knowledge and Information Systems	Publication Date: Aug 20, 2017
Citations: 14	License type: open-access

R Discovery Prime

R Discovery Prime

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Abstract

Talk to us

Similar Papers

More From: Knowledge and Information Systems

Lead the way for us

Similar Papers

Inducing information extraction systems for new languages via cross-language projection
Ellen Riloff ... David Yarowsky
-
Ellen Riloff, et. al.Ellen Riloff ... David Yarowsky
01 Jan 2002
01 Jan 2002

Use of a Fast Information Extraction Method as a Decision Support Tool
Mahmudul Sheikh ... Sumali Conlon
Journal of International Technology and Information Management | VOL. 19
Mahmudul Sheikh, et. al.Mahmudul Sheikh ... Sumali Conlon
01 Jan 2009
Journal of International Technology and Information Management | VOL. 19

Join Optimization of Information Extraction Output: Quality Matters!
Alpa Jain ... Luis Gravano
-
Alpa Jain, et. al.Alpa Jain ... Luis Gravano
01 Mar 2009
01 Mar 2009

Use of Natural Language Processing to Infer Sites of Metastatic Disease From Radiology Reports at Scale.
See Boon Tay ... Ryan Shea Ying Cong Tan
JCO clinical cancer informatics | VOL. 8
See Boon Tay, et. al.See Boon Tay ... Ryan Shea Ying Cong Tan
01 May 2024
JCO clinical cancer informatics | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Abstract

Talk to us

Similar Papers

More From: Knowledge and Information Systems