Векторные модели на основе символьных н-грамм для морфологического анализа текстов

Tsolak Ghukasyan

doi:10.15514/ispras-2020-32(2)-1

Abstract

The paper presents modifications of fastText word embedding model based solely on n-grams, for morphological analysis of texts. fastText is a library for classifying texts and teaching vector representations. The representation of each word is calculated as the sum of its individual vector and the vectors of its symbolic n-grams. fastText stores and uses a separate vector for the whole word, but in extra-vocabular cases there is no such vector, which leads to a deterioration in the quality of the resulting word vector. In addition, as a result of storing vectors for whole words, fastText models usually require a lot of memory for storage and processing. This becomes especially problematic for morphologically rich languages, given the large number of word forms. Unlike the original fastText model, the proposed modifications only pretrain and use vectors for the character n-grams of a word, eliminating the reliance on word-level vectors and at the same time helping to significantly reduce the number of parameters in the model. Two approaches are used to extract information from a word: internal character n-grams and suffixes. Proposed models are tested in the task of morphological analysis and lemmatization of the Russian language, using SynTagRus corpus, and demonstrate results comparable to the original fastText.

Highlights

Для обучения анализатора использовался синтаксически размеченный корпус русского языка SynTagRus [19] версии Universal Dependencies v2.4
Ghukasyan Ts. Character N-gram-Based Word Embeddings for Morphological Analysis of Texts

Summary

Введение

Вектора слов широко и успешно используются во многих задачах обработки естественного языка, но они имеют серьезные недостатки для обработки редких слов или слов из словарного запаса, для которых вложения либо недоступны, либо неудовлетворительны. В последнем режиме fastText учит представления слов с использованием символьных нграмм, обучая нейронную сеть вида SkipGram или CBOW на неразмеченных текстах. Представление каждого слова вычисляется как сумма его отдельного вектора и векторов его символьных н-грамм. Отсюда вытекает преимущество fastText по сравнению с другими моделями встраивания слов, заключающееся в том, что он может вычислять представление для слова вне словарного запаса (OOV), используя его символьные н-граммы. В этой работе рассматриваются модификации fastText, которые удаляют вектора на уровне слов из модели и основываются только на символьных н-граммах для обучения и генерации представлений. Векторные модели на основе символьных н-грамм для морфологического анализа текстов. Представление каждого слова вычисляется как сумма его вектора и векторов н-грамм его символов: V=w E +. Матрицы векторов слов и символьных н-грамм соответственно; – длина н-граммы ( и – гиперпараметры); ∈ R | | – параметры выходного слоя. Матрицы векторов и параметры выходного слоя обучаются путем обратного распространения ошибки с использованием негативного семплирования и стохастического градиентного спуска

Ngrams-only fastText

Данные

Результаты

Заключение

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Векторные модели на основе символьных н-грамм для морфологического анализа текстов

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS

Lead the way for us

Journal: Proceedings of the Institute for System Programming of the RAS	Publication Date: Jan 1, 2020
License type: cc-by

Similar Papers

Subwords-Only Alternatives to fastText for Morphologically Rich Languages
Tsolak Ghukasyan ... Karen Avetisyan
Programming and Computer Software | VOL. 47
Tsolak Ghukasyan, et. al.Tsolak Ghukasyan ... Karen Avetisyan
01 Jan 2020
Programming and Computer Software | VOL. 47

Adaptive GloVe and FastText Model for Hindi Word Embeddings
Vijay Gaikwad ... Yashodhara Haribhakta
-
Vijay Gaikwad, et. al.Vijay Gaikwad ... Yashodhara Haribhakta
05 Jan 2020
05 Jan 2020

Recognition of Alzheimer’s Dementia From the Transcriptions of Spontaneous Speech Using fastText and CNN Models
Amit Meghanani ... Angarai Ganesan Ramakrishnan
Frontiers in Computer Science | VOL. 3
Amit Meghanani, et. al.Amit Meghanani ... Angarai Ganesan Ramakrishnan
24 Mar 2021
Frontiers in Computer Science | VOL. 3

Analysis of Google Play Store's Sentiment Review on Waqf Digital Platform Using Fasttext Embedding
Muhammad Ichwandar Akrianto ... Indriana Hidayah
-
Muhammad Ichwandar Akrianto, et. al.Muhammad Ichwandar Akrianto ... Indriana Hidayah
16 Feb 2023
16 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Векторные модели на основе символьных н-грамм для морфологического анализа текстов

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS