Abstract
The CS was developed by means of the Python programming language to generate a semantic template of a group of documents by the LSA method. The system contains eight software modules, each performs one stage of the LSA. The control module of the frequency word-document matrix and the measuring module of semantic distance between the template documents are unique. Adjustment of CS to the contents and structure of the document templates is performed by changing a set of modules. According to the research, the frequency matrix normalization enhances the resolution of the semantic template generated by using the LSA. It is proved that the removal of individual words improves the resolution of the generated semantic template and does not affect the semantic content. Application of semantic proximity of documents, the cosine of the difference of angles between the vector of a group of basic words and vectors of documents for evaluation allows increasing the resolution of the generated semantic template. To ensure the continuity of the LSA, the module of the frequency matrix analysis for compliance of excess (or equality) of the number of words over the number of documents was introduced in the CS. In the event of a mismatch, the module starts over the LSA process with a new set of words and documents after removal of the inappropriate document and related words.
Highlights
Semantic text analysis is one of the key problems of both the theory of artificial intelligence systems, related to natural language processing (NLP) and computational linguistics
According to the publications provided, the semantic template increases the efficiency of search engines, which is a priority at the present development stage of information technology
– to examine the impact of the frequency matrix normalization on the latent semantic analysis (LSA) results, the use of the words found in all documents only once, the use of the cosine of the difference of angles between the vector of the group of basic words and vectors of documents to account for the semantic distance;
Summary
Semantic text analysis is one of the key problems of both the theory of artificial intelligence systems, related to natural language processing (NLP) and computational linguistics. Despite a demand in almost all spheres of human life, semantic analysis is one of the most complex mathematical problems. An important task is developing software for automatic processing of speech and text data to improve information retrieval systems with advanced features that use natural language queries. The research of the method of latent semantic analysis (LSA) allows automating a number of text data processing cycles, including document indexing by thematic groups, plagiarism detection, forming databases of natural language queries. Software implementation, especially such that can increase the resolution of the method is an extremely urgent task of scientists and information technology developers
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Eastern-European Journal of Enterprise Technologies
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.