Abstract

The presented publication is devoted to an overview of the problem of presenting textual informationfor the subsequent implementation of cluster analysis in the framework of processingand managing high-dimensional information. Modern requirements for analytical, search andrecommendation information systems demonstrate the weak formation of a holistic solution thatcan provide a sufficient level of speed and quality of the results obtained within the framework ofthe current information technology market. The search for a solution to the presented problementails the need to conduct an objective analysis of existing solutions for representing textual informationin vector space, in order to form a holistic view of the advantages and disadvantages ofthe analyzed approaches, as well as the formation of criteria that allow one to implement theirown approach, devoid of identified weaknesses. The presented work is analytical, and allows youto get an idea of the current state and elaboration of the identified problem within a limited subjectarea. Clustering of text data is the automatic formation of subsets, the elements of which are instancesof documents of some researched, unstructured sample of a fixed dimension. This processcan be classified as unsupervised learning, which implies the absence of an expert who personallyassigns class indices to the original sample of documents. However, the implementation of clusteranalysis of text data without any pre-processing is impossible. To do this, it is necessary to ensurestandardization and reduction of input data to a single format and form. Within the framework ofthis stage of the implementation of cluster analysis, the presented publication discusses methodsfor preprocessing text data. The novelty of the presented publication lies in the formation of thetheoretical basis of the main methods of text data vectorization, by systematizing and objectifyingthe proposed assumptions, by conducting a series of experimental studies. The main difference ofthis work from the already published scientific works is the systematization and analysis of modernsolutions, as well as the hypotheses about the relevance and effectiveness of our own hybridizedapproach designed for text data vectorization.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.