Сравнительная оценка эффективности моделей текста в задаче классификации документов

Oleg V Bartenyev

doi:10.24160/1993-6982-2021-5-117-127

Abstract

Various text models used in solving natural language processing problems are considered. Text models are used to perform document classification, the results of which are then used to estimate the comparative effectiveness of the used models. From two classification accuracy values obtained on the evaluation and training sets, the minimum value is selected to evaluate the model. A multilayer perceptron with one hidden layer is used as a classifier. The classifier input receives a real vector representing the document. At its output, the classifier generates a forecast about the document class. The input vector is determined, depending on the used text model, either by the text frequency characteristics, or by distributed vector representations of the pre-trained text model's tokens. The obtained results demonstrate the advantage of models based on the Transformer architecture over other models used in the study, e.g., the word2vec, doc2vec, and fasttext models.

Full Text