Development of a computer system for generating semantic template of a group of documents by using latent semantic analysis

Yuriy Taranenko,Maryna Kabanova

doi:10.15587/1729-4061.2016.73551

Abstract

The CS was developed by means of the Python programming language to generate a semantic template of a group of documents by the LSA method. The system contains eight software modules, each performs one stage of the LSA. The control module of the frequency word-document matrix and the measuring module of semantic distance between the template documents are unique. Adjustment of CS to the contents and structure of the document templates is performed by changing a set of modules. According to the research, the frequency matrix normalization enhances the resolution of the semantic template generated by using the LSA. It is proved that the removal of individual words improves the resolution of the generated semantic template and does not affect the semantic content. Application of semantic proximity of documents, the cosine of the difference of angles between the vector of a group of basic words and vectors of documents for evaluation allows increasing the resolution of the generated semantic template. To ensure the continuity of the LSA, the module of the frequency matrix analysis for compliance of excess (or equality) of the number of words over the number of documents was introduced in the CS. In the event of a mismatch, the module starts over the LSA process with a new set of words and documents after removal of the inappropriate document and related words.

Highlights

Semantic text analysis is one of the key problems of both the theory of artificial intelligence systems, related to natural language processing (NLP) and computational linguistics
According to the publications provided, the semantic template increases the efficiency of search engines, which is a priority at the present development stage of information technology
– to examine the impact of the frequency matrix normalization on the latent semantic analysis (LSA) results, the use of the words found in all documents only once, the use of the cosine of the difference of angles between the vector of the group of basic words and vectors of documents to account for the semantic distance;

Summary

Introduction

Semantic text analysis is one of the key problems of both the theory of artificial intelligence systems, related to natural language processing (NLP) and computational linguistics. Despite a demand in almost all spheres of human life, semantic analysis is one of the most complex mathematical problems. An important task is developing software for automatic processing of speech and text data to improve information retrieval systems with advanced features that use natural language queries. The research of the method of latent semantic analysis (LSA) allows automating a number of text data processing cycles, including document indexing by thematic groups, plagiarism detection, forming databases of natural language queries. Software implementation, especially such that can increase the resolution of the method is an extremely urgent task of scientists and information technology developers

Literature review and problem statement

Research goals and objectives

Means and methods of research of LSA of text documents

Analysis of LSA results without exception of individual words

Analysis of the case of the degenerate frequency word stem-document matrix

Conclusions

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Development of a computer system for generating semantic template of a group of documents by using latent semantic analysis

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies

Lead the way for us

Journal: Eastern-European Journal of Enterprise Technologies	Publication Date: Aug 30, 2016
License type: cc-by

Similar Papers

Consequences of rigid and flexible learning
Linda Baker ... John M Gentry
Bulletin of the Psychonomic Society | VOL. 9
Linda Baker, et. al.Linda Baker ... John M Gentry
01 Jan 1976
Bulletin of the Psychonomic Society | VOL. 9

ALPACA
Phong Minh Vu ... Tung Thanh Nguyen
-
Phong Minh Vu, et. al.Phong Minh Vu ... Tung Thanh Nguyen
18 Apr 2019
18 Apr 2019

The influence of morphological structure information on the memorization of Chinese compound words
Duo Liu
Reading and Writing | VOL. 30
Duo LiuDuo Liu
05 Jun 2017
Reading and Writing | VOL. 30

Short text categorization exploiting contextual enrichment and external knowledge
Stefano Mizzaro ... Marco Pavan
-
Stefano Mizzaro, et. al.Stefano Mizzaro ... Marco Pavan
11 Jul 2014
11 Jul 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Development of a computer system for generating semantic template of a group of documents by using latent semantic analysis

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies