Abstract
The article discusses the problems arising in the automatic processing of scientific texts and presents the results of work on creating a combined method for aspect-oriented analysis of scientific texts in the field of fundamental disciplines, taking into account both knowledge of the subject area and statistical methods of text processing. Thematic encyclopedias, which are not only a source of professional scientific terminology, but are considered to be an information resource for extracting knowledge about the subject area, are proposed to be used as training data. The work offers the structure of templates designed to extract information from the partially structured text of the encyclopedia, considers the structure of extracted sets of professional terms, offers the algorithm of formation of semantic relationships between special terms. The process of knowledge extraction in this paper is demonstrated on the example of processing four encyclopedias: mathematical, physical, chemical, medical. The general principles of the formation of domain scientific terminology are highlighted, and statistical data on the terminological composition in each of the examined areas is given. Within the framework of the conducted research on the basis of the texts of encyclopedias the basic semantic graphs of the corresponding scientific fields with the relations between the professional terms introduced on them are constructed. Basic graphs accumulate knowledge about the scientific field and are intended for the subsequent thematic analysis of unstructured texts of scientific articles. The Implemented algorithm of extraction of semantics of the given scientific text is based both on amplification of weights of nodes — terms of the applied domain, and on the correction of semantic relations between the nodes of the graph according to the processed text. The results of experiments on automatic construction of the list of keywords of the article are given. The results were compared with the list of keywords specified by the author of the article. It should be noted that the relevance of correctly extracted terms is mainly determined by semantic links in the basic domain graph, and depends significantly less on the number of keywords in the original article, which demonstrates the advantage of the proposed combined method compared with a simple frequency analysis. The sample analysis of the texts of the articles on mathematics showed good accuracy in the extraction of key terms compared to the list of keywords specified by the author of the article.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.