Abstract

Automatic information extraction from online published scientific documents is useful in various applications such as tagging, web indexing and search engine optimization. As a result, automatic information extraction has become among the hottest areas of research in text mining. Although various information extraction techniques have been proposed in the literature, their efficiency demands domain specific documents with static and well-defined format. Furthermore, their accuracy is challenged with a slight modification in the format. To overcome these issues, a novel ontological framework for information extraction (OFIE) using fuzzy rule-base (FRB) and word sense disambiguation (WSD) is proposed. The proposed approach is validated with a significantly wider document domains sourced from well-known publishing services such as IEEE, ACM, Elsevier, and Springer. We have also compared the proposed information extraction approach against state-of-the-art techniques. The results of the experiment show that the proposed approach is less sensitive to changes in the document format and has a significantly better average accuracy of 89.14% and F-score as 89%.

Highlights

  • Scientific repositories maintained by research societies such as IEEE, ACM, Elsevier and Springer have become an increasingly important tool for diverse stakeholders that include researchers, businesses, research institutions, government agencies as well as funding agencies[1]

  • The published articles are hosted in the form of structured and unstructured portable document format (PDF) of varying sizes

  • The authors further developed their own tool called PAXAT which works on rich text features (RTF) of the document taken from published article from ACM, IEEE, SPRINGER and ArVix

Read more

Summary

INTRODUCTION

Scientific repositories maintained by research societies such as IEEE, ACM, Elsevier and Springer have become an increasingly important tool for diverse stakeholders that include researchers, businesses, research institutions, government agencies as well as funding agencies[1] These scientific repositories host millions of published documents that provide rich and useful information to the stakeholders[2]. The bulk of scientific documents hosted in the publishers digital libraries[18], [19] are mostly unstructured documents, which presents a considerable challenge to reliably and efficiently extract required information from such repositories[20]–[22]. In general both information extraction and metadata extraction are sensitive.

PROBLEM OVERVIEW AND LITERATURE REVIEW
STRUCTURAL INFORMATION EXTRACTION
WORD SENSE DISAMBIGUATION
ONTOLOGY AND DIGITAL LIBRARY
PERFORMANCE ANALYSIS
Information type 7 WSD method
Springer
List of Tables
Affiliation
Findings
CONCLUSION AND FUTURE WORK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.