Abstract

This work is devoted to the study of applicability of modern methods of machine learning to the task of automatic classification of scientific articles and abstracts. For this purpose, the study of such models of machine learning as artificial neural networks, random forest, logistic regression, and support vector machine was carried out with taking into account such a feature of scientific texts as a large number of terms specific for various categories. Separately, the stages of data collection and extraction of text characteristics are considered. The results of research are used in development of a decision support system for assignment of scientific texts to the code of the department or abstract journal of All-Russian Institute of Scientific and Technical Information of Russian Academy of Sciences.

Highlights

  • IntroductionIn order to be able to meet this challenge, algorithms of machine learning (such as supervised learning algorithms) are applied

  • The problem of automatic classification of texts is becoming increasingly required due to the growing amount of textual information stored on the Internet

  • This study is aimed at developing a model capable to determine the probability of text belonging to a category of a certain rubricator, i.e. to work in a Decision Support System (DSS) mode

Read more

Summary

Introduction

In order to be able to meet this challenge, algorithms of machine learning (such as supervised learning algorithms) are applied. For their setting, they require a set of marked data already having a class label. The work is carried out as part of development of a text analysis system for All-Russian Institute of Scientific and Technical Information of Russian Academy of Sciences (VINITI RAS) (Viniti.ru, 2019). Documents go through thematic departments, where specialists assign them codes of topics in various systems of classification. In this case, the number of codes of abstract journals and State Rubricator of Scientific and Technical Information (SRSTI) reaches several hundred. The use of DSS is intended to reduce the number of possible topics for the text providing the specialist with an estimate of the probabilities for each rubric

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.