The implementation of National Science and Technology Innovation Strategy demands exponential growing in knowledge services on literature information institutions. It is the most important knowledge organization tool for Information Retrieval, which can be widely used for semantic citation, organization and retrieval of literature resources. This study aims to develop an innovative algorithm for constructing subject thesaurus based on massive literature resource data and mining academic neologisms, also the semantic relationship between academic neologisms and subject system. We firstly collect a dataset of literature corpus, corresponding work for data pre-processing carried out. Then using the FastText model to complete academic neologisms mining, we construct an automatic categorization model of academic neologisms based on the Bert and TextCNN algorithm. The algorithm proposed in this study is validated by 8.1 million multi-source and heterogeneous literature data in the field of marine disciplines. The result shows that the algorithm can effectively replace 90% of the manual annotation volume, mine a large number of high-quality marine neologisms and successfully build the marine science knowledge base with a pass rate of 82.6% reviewed by expert, which present high accuracy and certain engineering application prospects.
Read full abstract