Word-embedding Based Text Vectorization Using Clustering

Vitaly I Yuferev,Nikolai A Razin

doi:10.18255/1818-1015-2021-3-292-311

Vitaly I Yuferev, Nikolai A Razin

Open Access

https://doi.org/10.18255/1818-1015-2021-3-292-311

Copy DOI

Abstract

It is known that in the tasks of natural language processing, the representation of texts by vectors of fixed length using word-embedding models makes sense in cases where the vectorized texts are short.The longer the texts being compared, the worse the approach works. This situation is due to the fact that when using word-embedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word.This paper proposes an alternative way for using pre-trained word-embedding models for text vectorization. The essence of the proposed method consists in combining semantically similar elements of the dictionary of the existing text corpus by clustering their (dictionary elements) embeddings, as a result of which a new dictionary is formed with a size smaller than the original one, each element of which corresponds to one cluster. The original corpus of texts is reformulated in terms of this new dictionary, after which vectorization is performed on the reformulated texts using one of the dictionary approaches (TF-IDF was used in the work). The resulting vector representation of the text can be additionally enriched using the vectors of words of the original dictionary obtained by decreasing the dimension of their embeddings for each cluster.A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem – averaging word embeddings with TF-IDF weighting and without weighting, as well as vectorization based on TF-IDF coefficients.

Highlights

It is known that in the tasks of natural language processing, the representation of texts by vectors of xed length using word-embedding models makes sense in cases where the vectorized texts are short
E longer the texts being compared, the worse the approach works. is situation is due to the fact that when using wordembedding models, information is lost when converting the vector representations of the words that make up the text into a vector representation of the entire text, which usually has the same dimension as the vector of a single word
A series of experiments to determine the optimal parameters of the method is described in the paper, the proposed approach is compared with other methods of text vectorization for the text ranking problem – averaging word embeddings with TFIDF weighting and without weighting, as well as vectorization based on TF-IDF coe cients

Summary

Описание подхода

Чтобы обойти ограничения алгоритмов по векторизации текстов – словарного и основанного на word-embedding – предлагается совместить данные подходы. Для каждого слова wj исходного словаря V выполняется формирование обогащающего вектора eej следующим образом. 9. Для каждого текста tci из Tc. Каждая позиция (обозначим индексом j) TF-IDF вектора xci соответствует TF-IDF-коэффициенту xcij для n-граммы wcj из словаря Vc. Для каждой j-й позиции вектора xci сформируем обогащающий вектор excj следующим образом. На основе текста ti из T сформируем обогащающие векторы всех входящих в ti n-грамм длиной от Nmin до Nmax путем конкатенации обогащающих векторов входящих в них слов (полученных на шаге 8.3). Каждой n-грамме из ti соответствует некоторая n-грамма из Vc. Выполним усреднение обогащающих векторов n-грамм из ti по соответствующим им n-граммам из Vc. Таким образом, получены обогащающие векторы для тех позиций вектора xci, для которых соответствующие n-граммы wcj из Vc входят в tci. Получено множество обогащенных векторных представлений Xce текстов T

Условия проведения апробации

Результаты

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Modeling and Analysis of Information Systems	Publication Date: Oct 14, 2021
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Word-embedding Based Text Vectorization Using Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Modeling and Analysis of Information Systems

Lead the way for us

Similar Papers

Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
Tilahun Yeshambel ... Josiane Mothe
Information | VOL. 14
Tilahun Yeshambel, et. al.Tilahun Yeshambel ... Josiane Mothe
20 Mar 2023
Information | VOL. 14

Word Embeddings for Natural Language Processing

-

01 Jan 2015
01 Jan 2015

An Enhanced Neural Word Embedding Model for Transfer Learning
Md Kowsher ... Takeshi Koshiba
Applied Sciences | VOL. 12
Md Kowsher, et. al.Md Kowsher ... Takeshi Koshiba
10 Mar 2022
Applied Sciences | VOL. 12

A Superior Arabic Text Categorization Deep Model (SATCDM)
M Alhawarat ... Ahmad O Aseeri
IEEE Access | VOL. 8
M Alhawarat, et. al.M Alhawarat ... Ahmad O Aseeri
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Word-embedding Based Text Vectorization Using Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Modeling and Analysis of Information Systems