FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications

Muhammad Waqas Ahmed,Muhammad Tanvir Afzal

doi:10.1109/access.2020.2997907

Muhammad Waqas Ahmed, Muhammad Tanvir Afzal

Open Access

https://doi.org/10.1109/access.2020.2997907

Copy DOI

Abstract

The unprecedented growth of the research publications in diversified domains has overwhelmed the research community. It requires a cumbersome process to extract this enormous information by manually analyzing these research documents. To automatically extract content of a document in a structured way, metadata and content must be annotated. Scientific community has been focusing on automatic extraction of content by forming different heuristics and applying different machine learning techniques. One of the renowned conference organizers, ESWC organizes state-of-the-art challenge to extract metadata like authors, affiliations, countries in affiliations, supplementary material, sections, table, figures, funding agencies, and EU funded projects from PDF files of research articles. We have proposed a feature centric technique that can be used to extract logical layout structure of articles from publishers with diversified composition styles. To extract unique metadata from a research article placed in logical layout structure, we have developed a four-staged novel approach “FLAG-PDFe”. The approach is built upon distinct and generic features based on the textual and the geometric information from the raw content of research documents. At the first stage, the distinct features are used to identify different physical layout components of an individual article. Since research journals follow their unique publishing styles and layout formats, therefore, we develop generic features to handle these diversified publishing patterns. We employ support vector classification (SVC) in the third stage to extract the logical layout structure (LLS)/ sections of an article, after performing comprehensive evaluation of generic features and machine learning models. Finally, we further apply heuristics on LLS to extract the desired metadata of an article. The outcomes of the study are obtained using the gold standard data set. The results yields 0.877 recall, precision 0.928 and 0.897 F-measure. Our approach has achieved a 16% gain on f-measure when compared to the best approach of the ESWC challenge.

Highlights

Research plethora over the web increases rapidly due to millions of annual publications of research articles [1]–[3]
We shall present a brief overview of different machine learning algorithms that we evaluated for our proposed methodology, as comprehensive details and computational complexities are available [37]–[41]
To evaluate the results, standard evaluation measures like recall, precision, and f-measure are mostly employed. These methods are based on classification parameters known as true positive (TP), false positive (FP), true negative (TN), or false negative (FN)

Summary

Introduction

Research plethora over the web increases rapidly due to millions of annual publications of research articles [1]–[3]. More often scholars cogitate queries based on complex scenarios to retrieve their required research documents from this colossal scientific resource. Researchers post their queries to find scholarly articles on famous online search engines like Google. Scholar or Semantic Scholar, and renowned digital libraries like DBLP3 or ACM.4 These platforms do not hold adequate potential to intelligently process the query which results into surplus results. This is due to the fact that these search engines harness citation indexes and article’s full text search to retrieve the information wherein one of the potential aspects, structural information is overlooked. Human-understandable metadata like author name, affiliation, country, email, section headings with levels, funding agency, 1https://scholar.google.com 2https://www.semanticscholar.org/ 3http://dblp.uni-trier.de/ 4https://dl.acm.org/

Methods

Discussion

Conclusion