Abstract
The concept of automation of the process of annotation of scientific materials (Russian-language scientific articles) is proposed and its practical implementation is carried out by means of machine learning technologies, and additional training of large language models. The relevance of correct and rational compilation of annotations is indicated, and the problems related to establishing a balance between the time-consuming process of annotation and ensuring compliance with key requirements for annotation are highlighted. The basics of annotation presented in the family of standards on information, librarianship, and publishing are analyzed, and the classification of annotations and requirements for their content and functionality is given. The essence and content of the annotation process, and the typical structure of the research object are presented schematically. The issue of integration of digital technologies into the annotation process is analyzed, and special attention is paid to the advantages of introducing machine learning and artificial intelligence technology. The digital toolkit used to generate text in natural language processing applications is briefly described. Its shortcomings for solving the problem posed in this scientific article are noted. The research part substantiates the choice of the machine learning model used to solve the problem of conditional text generation. The existing pre-trained large language models are analyzed and, considering the problem statement and existing limitations of computing resources, the ruT5-base model is selected. A description of the dataset is given, including scientific articles from journals included in the list of peer-reviewed scientific publications in which the main scientific results of dissertations for the degrees of candidate and doctor of science should be published. The data labeling technique based on the operation of the tokenizer of the pre-trained large language model is characterized, and the numerical characteristics of the dataset distributions and the parameters of the training pipeline are presented graphically and in tables. The ROUGE quality metric is used to evaluate the model, and the expert assessment method, including grammar and logic as basic criteria, is used to evaluate the results. The quality of automatic annotation generation is comparable to real texts and meets the requirements of information content, structure and compactness. The article may be of interest to an audience of scientists and researchers seeking to optimize their scientific activities in terms of integrating digitalization tools into the process of writing articles, as well as to specialists involved in training large language models.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have