СРАВНИТЕЛЬНЫЙ АНАЛИЗ МЕТОДОВ ВЕКТОРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ БОЛЬШОЙ РАЗМЕРНОСТИ

Philip Bulyga,Viktor Kureichik

doi:10.18522/2311-3103-2023-2-212-226

Abstract

The presented publication is devoted to an overview of the problem of presenting textual informationfor the subsequent implementation of cluster analysis in the framework of processingand managing high-dimensional information. Modern requirements for analytical, search andrecommendation information systems demonstrate the weak formation of a holistic solution thatcan provide a sufficient level of speed and quality of the results obtained within the framework ofthe current information technology market. The search for a solution to the presented problementails the need to conduct an objective analysis of existing solutions for representing textual informationin vector space, in order to form a holistic view of the advantages and disadvantages ofthe analyzed approaches, as well as the formation of criteria that allow one to implement theirown approach, devoid of identified weaknesses. The presented work is analytical, and allows youto get an idea of the current state and elaboration of the identified problem within a limited subjectarea. Clustering of text data is the automatic formation of subsets, the elements of which are instancesof documents of some researched, unstructured sample of a fixed dimension. This processcan be classified as unsupervised learning, which implies the absence of an expert who personallyassigns class indices to the original sample of documents. However, the implementation of clusteranalysis of text data without any pre-processing is impossible. To do this, it is necessary to ensurestandardization and reduction of input data to a single format and form. Within the framework ofthis stage of the implementation of cluster analysis, the presented publication discusses methodsfor preprocessing text data. The novelty of the presented publication lies in the formation of thetheoretical basis of the main methods of text data vectorization, by systematizing and objectifyingthe proposed assumptions, by conducting a series of experimental studies. The main difference ofthis work from the already published scientific works is the systematization and analysis of modernsolutions, as well as the hypotheses about the relevance and effectiveness of our own hybridizedapproach designed for text data vectorization.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

СРАВНИТЕЛЬНЫЙ АНАЛИЗ МЕТОДОВ ВЕКТОРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ БОЛЬШОЙ РАЗМЕРНОСТИ

Abstract

Talk to us

Similar Papers

More From: IZVESTIYA SFedU. ENGINEERING SCIENCES

Lead the way for us

Similar Papers

Ensemble subspace clustering of text data using two-level features
He Zhao ... Yeshou Cai
International Journal of Machine Learning and Cybernetics | VOL. 8
He Zhao, et. al.He Zhao ... Yeshou Cai
17 Jun 2016
International Journal of Machine Learning and Cybernetics | VOL. 8

Design and Implementation of Chinese Historical Text Mining System Based on Culturomics
Lin Tang ... Chonghui Guo
-
Lin Tang, et. al.Lin Tang ... Chonghui Guo
01 Jan 2015
01 Jan 2015

Assessment of Twitter Data Clusters with Cosine-Based Validation Metrics Using Hybrid Topic Models
Noorullah R Mohammed ... Moulana Mohammed
Ingénierie des systèmes d information | VOL. 25
Noorullah R Mohammed, et. al.Noorullah R Mohammed ... Moulana Mohammed
31 Dec 2020
Ingénierie des systèmes d information | VOL. 25

Denoising Autoencoder as an Effective Dimensionality Reduction and Clustering of Text Data
Milad Leyli-Abadi ... Mohamed Nadif
-
Milad Leyli-Abadi, et. al.Milad Leyli-Abadi ... Mohamed Nadif
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

СРАВНИТЕЛЬНЫЙ АНАЛИЗ МЕТОДОВ ВЕКТОРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ БОЛЬШОЙ РАЗМЕРНОСТИ

Abstract

Talk to us

Similar Papers

More From: IZVESTIYA SFedU. ENGINEERING SCIENCES