Abstract

The amount of scientific publications is believed to get doubled every five-years. These publications are stored by citation indexes and digital libraries in the form of complete PDF or/and by extracting terms from these documents. This indexing behavior poses several challenges for the scientific community as well as for digital repositories in terms of handling the advanced requirements of a user. For instance, addressing queries like “Give me those papers that contain the term “Pagerank” in their result section” may not be answered unless the papers are indexed section-wise. This issue has been focused by researchers and international prestigious challenges by top venues in the world like Semantic Publishing Challenge in ESWC. One of the important metadata extraction from research papers is the section information such as IMRAD (Introduction, Methodology, Results, and Discussion). Researchers have presented different approaches to identify and map the section-headings to IMRAD sections. The existing studies have employed parameters like dictionary terms, the template of a paper, and in-text citation frequency to map section-headings onto logical sections. The critical analysis of state-of-the-art revealed that some immensely potential features have been ignored, which might result in accurate mapping. In this study, we propose a novel approach that employs new features along with previously well-known features to map sections-headings to IMRAD. The newly proposed features are: (1) variant of In-text Citation count (2) Figure counts, (3) Table counts, and (4) subheading implicit mapping. The employed data set contains 5000 research papers, collected from CiteSeer. The evaluation of the proposed approach and comparisons with state-of-the-art three approaches revealed an improvement of 18.96%, 21.77%, and 9.50% in average precision with Ding et al, Shahid et al, and Habib et al. respectively. This research has significant implications for citation indexes and digital libraries.

Highlights

  • Communication in science is realized through scientific publications

  • The proposed approach takes advantage of accurately identifying the subsections and mapping them to IMRAD headings based on their main section mapping to achieve better results

  • The data set containing PDF files of research papers are collected from a digital library named CiteSeer

Read more

Summary

INTRODUCTION

Communication in science is realized through scientific publications. Due to the latest inventions in science, a tremendous increase has been reported in the amount of publications on WWW. The structure states that a research paper should comprise logical sections like Introduction, Methods, Results, and Discussion. The researcher used very extensive dictionary terms to identify the section and applied their technique on 866 full-text articles containing 6866 sections and achieved 81% accuracy. Shahid and Afzal [2] extended Ding et al [8] technique with different dictionary terms along with research paper templates and layout to identify section headings and mapped them to IMRAD structure. The proposed approach takes advantage of accurately identifying the subsections and mapping them to IMRAD headings based on their main section mapping to achieve better results.

LITERATURE REVIEW
DATA COLLECTION
METHODOLOGY
RESULTS AND EVALUATION
Result
VIII. CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.