Abstract

Research Data Management (RDM) in Natural Science establishes a structured foundation for organizing and preserving scientific data. Effective management and access to these diverse data sources are crucial for supporting domain scientists in future knowledge discovery. Scientific publications, a primary data source often presented in Portable Document Format (PDF), serve as a rich source of information, encompassing text, tables, figures, and metadata. These components present information individually or collectively, offering the potential to explore exciting research directions. However, to fully address these aspects, it is necessary to be able to perform data acquisition from these publications, focusing on these data components, and conducting respective information extraction. Furthermore, modeling the extracted information into a Heterogeneous Information Network of publications enhances accessibility, collaboration, and information harvesting within the natural sciences domain. We developed a comprehensive framework ensuring user accessibility and widespread applicability, which is capable of modeling diverse information from marine science publications into a Heterogeneous Information Network. The framework comprises three modules: Data Acquisition, Information Extraction, and Information Modeling. The Data Acquisition (DA) module extracts various data components from the relevant publications and transforms them into machine-readable formats. The Information Extraction (IE) module includes two sub-modules: Named Entity Recognition (NER) modules trained on marine science annotated text, capable of extracting eight different types of entities from plain text; and an information parser module responsible for extracting quantitative information from tabular data. It initially detects and then extracts scientific measurements, relevant spatial information, and other available characteristics. Finally, the information modeling module exhibits the extracted information from data components and performs information linking. Consequently, the information is structured into a Heterogeneous Information Network (HIN) of scientific publications, ensuring effective information delivery and providing diverse information to domain experts while supporting the Research Data Management initiative.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.