A Novel Approach to Data Extraction on Hyperlinked Webpages

Shaukat Shaukat,Khushi Khushi,Masood Masood

doi:10.3390/app9235102

Abstract

The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.

Highlights

Over the years, the World Wide Web (WWW) has gained significant popularity and is presently reckoned to be a treasure trove of information
On the World Wide Web (WWW), data are often shown in a two-dimensional grid-like structures referred to as tables
HTML web tables can be in different formats, as described by [35]

Summary

Introduction

The World Wide Web (WWW) has gained significant popularity and is presently reckoned to be a treasure trove of information. The plethora of this information takes the form of images, text, audios, videos, etc. A tabular representation of data/information on the web is considered more effective and precise than non-tabular. The number of available tables on the web ranges from hundreds of thousands to millions [1]. This data in the form of tables is brief most of the time, yet is very rich in information

Methods

Results

Conclusion