Abstract
New scientific and technological (S&T) knowledge is being introduced rapidly, and hence, analysis efforts to understand and analyze new published S&T documents are increasing daily. Automated text mining and vision recognition techniques alleviate the burden somewhat, but the various document layout formats and knowledge content granularities across the S&T field make it challenging. Therefore, this paper proposes LA-SEE (LAME and Vi-SEE), a knowledge graph construction framework that simultaneously extracts meta-information and useful image objects from S&T documents in various layout formats. We adopt Layout-aware Metadata Extraction (LAME), which can accurately extract metadata from various layout formats, and implement a transformer-based instance segmentation (i.e., Vision based Semantic Elements Extraction (Vi-SEE)) to maximize the vision-based semantic element recognition. Moreover, to constructing a scientific knowledge graph consisting of multiple S&T documents, we newly defined an extensible Semantic Elements Knowledge Graph (SEKG) structure. For now, we succeeded in extracting about 6 million semantic elements from 49,649 PDFs. In addition, to illustrate the potential power of our SEKG, we provide two promising application scenarios, such as a scientific knowledge guide across multiple S&T documents and questions and answering over scientific tables.
Highlights
Decision support systems or specific methods for science and technology (S&T) problems or social issues can be employed effectively across various domain user types related to policymaking, research topic search, research method survey, comparing experimental results, emerging technology trend analyses, etc.Junior researchers may have difficulty collecting target information due to lacking domain knowledge
We propose a layoutaware semantic element extraction (LA-SEE) framework that can extract meta and semantic knowledge from S&T documents and construct a Knowledge graphs (KGs) with the extracted semantic elements
We propose two user scenarios based on the proposed Semantic Elements Knowledge Graph (SEKG) to confirm promising applications
Summary
Decision support systems or specific methods for science and technology (S&T) problems or social issues can be employed effectively across various domain user types related to policymaking, research topic search, research method survey, comparing experimental results, emerging technology trend analyses, etc. In order to resolve these limitations, this study aims to enable a sophisticated decision support system by extracting semantic elements from S&T documents and constructing a knowledge graph with the semantic elements. Recently proposed SciNLP-KG, an end-to-end natural language processing (NLP) KG construction with 30,000 NLP papers focusing on four extracted relationship types among tasks, datasets, and evaluation metrics Their relationship extraction modules still only achieved an F1-score < 80%. Liu et al [9] defined a metaknowledge architecture to construct structural knowledge with documents, in contrast with previous KGs but similar to the present paper’s approach They employed a multi-modal metaknowledge extraction model to extract and organize metaknowledge elements (e.g., titles, authors, abstracts, and sections) from a government policy document dataset and DocBank [10]. We propose two user scenarios based on the proposed SEKG to confirm promising applications
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.