Abstract

A critical ideology of the existing Material Genome Project refers to the application of data and artificial intelligence to facilitate material innovation. The lack of data hinders the development of novel materials. The figures and captions in the material literature cover essential information regarding the entire document and have sufficient image sample data for research. Accordingly, how to extract figures and captions from the literature is critical to solve the lack of data. Though some PDF parsing tools are capable of extracting information from documents, they generally identify a document's figures by parsing the document into a concrete structure. As impacted by the inconsistency of the form of different journals, they commonly achieve wrong recognition results. Thus, an efficient figure and caption extraction network FCENet is proposed in the present study. Inconsistent with other extraction tools, this study first attempts to adopt instance segmentation models to detect figures and their captions, and then extract them. FCENet developed in this study builds upon BlendMask and introduces a horizontal and vertical attention module. This study splits the BlendMask detection head into two branches, i.e., figure detection and caption detection, which increases final detection accuracy and speed. This study collects nearly 3000 material documents for model training and testing. As revealed from the last experiments and results, the performance of FCENet is significantly compared with that of other existing instance segmentation models. Its box and mask mAP (mean Average Precision) are 8.51% and 12.59% higher than those of BlendMask, respectively. This study hopes that considerable material image data can be acquired via FCENet and sufficiently support image data for machine learning and data mining in the material area.

Highlights

  • Scientific research literature represents the most cutting-edge research results in arrange of research areas

  • This study introduces instance segmentation to document image’s figures and captions extraction initially

  • This study collects and produces an instance segmentation dataset with COCO format and selected box and mask mAP as assessment indicators consistent with the COCO dataset

Read more

Summary

Introduction

Scientific research literature represents the most cutting-edge research results in arrange of research areas. Its research direction overall determines each area’s development and boosts research development in science and technology. When the mentioned research results flow from science and technology to industry, they are capable of boosting human society’s progress. The material area refers to one of the. The associate editor coordinating the review of this manuscript and approving it for publication was Ting Wang. Essential areas at present, and its development impacts other areas. A wide variety of areas require materials in practice. The development and application of several critical materials lay the cornerstones for promoting social progress

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call