A Data-Driven Approach for Integrating Multi-Source Scientific Categorization Labels

Huidong Wu,Jianping Li,Dengsheng Wu

doi:10.1016/j.procs.2024.08.190

Abstract

This study presents a novel framework for integrating scientific categorization labels from diverse sources, thereby addressing the complexities introduced by varying classification systems in the scientific literature. Utilizing a SciBERT-based encoder and classification prediction models, this research proposes a method to effectively map and integrate labels across different databases. Our approach involves encoding the titles of scientific documents into vectorized embeddings, training a classification model on these embeddings, and then employing this model to categorize and harmonize labels from various sources. We apply this methodology to datasets from the 2017 National Natural Science Foundation of China (NSFC) and the Chinese Science Citation Database (CSCD), demonstrating the framework’s ability to improve label consistency and thematic coherence across multiple scientific disciplines. The results highlight a significant enhancement in the navigation and understanding of scientific literature, showcasing the potential of this approach to facilitate more efficient and integrated management of scientific knowledge.

Full Text