Abstract

The CS was developed by means of the Python programming language to generate a semantic template of a group of documents by the LSA method. The system contains eight software modules, each performs one stage of the LSA. The control module of the frequency word-document matrix and the measuring module of semantic distance between the template documents are unique. Adjustment of CS to the contents and structure of the document templates is performed by changing a set of modules. According to the research, the frequency matrix normalization enhances the resolution of the semantic template generated by using the LSA. It is proved that the removal of individual words improves the resolution of the generated semantic template and does not affect the semantic content. Application of semantic proximity of documents, the cosine of the difference of angles between the vector of a group of basic words and vectors of documents for evaluation allows increasing the resolution of the generated semantic template. To ensure the continuity of the LSA, the module of the frequency matrix analysis for compliance of excess (or equality) of the number of words over the number of documents was introduced in the CS. In the event of a mismatch, the module starts over the LSA process with a new set of words and documents after removal of the inappropriate document and related words.

Highlights

  • Semantic text analysis is one of the key problems of both the theory of artificial intelligence systems, related to natural language processing (NLP) and computational linguistics

  • According to the publications provided, the semantic template increases the efficiency of search engines, which is a priority at the present development stage of information technology

  • – to examine the impact of the frequency matrix normalization on the latent semantic analysis (LSA) results, the use of the words found in all documents only once, the use of the cosine of the difference of angles between the vector of the group of basic words and vectors of documents to account for the semantic distance;

Read more

Summary

Introduction

Semantic text analysis is one of the key problems of both the theory of artificial intelligence systems, related to natural language processing (NLP) and computational linguistics. Despite a demand in almost all spheres of human life, semantic analysis is one of the most complex mathematical problems. An important task is developing software for automatic processing of speech and text data to improve information retrieval systems with advanced features that use natural language queries. The research of the method of latent semantic analysis (LSA) allows automating a number of text data processing cycles, including document indexing by thematic groups, plagiarism detection, forming databases of natural language queries. Software implementation, especially such that can increase the resolution of the method is an extremely urgent task of scientists and information technology developers

Literature review and problem statement
Research goals and objectives
Means and methods of research of LSA of text documents
Analysis of LSA results without exception of individual words
Analysis of the case of the degenerate frequency word stem-document matrix
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.