Поиск заимствований в армянских текстах путем внутреннего стилометрического анализа

Yeva Maksimovna Yeshilbashian,Tsolak Gukasovitch Ghukasyan,Ariana Armenovna Asatryan

doi:10.15514/ispras-2021-33(1)-14

Yeva Maksimovna Yeshilbashian, Tsolak Gukasovitch Ghukasyan + Show 1 more

Open Access

https://doi.org/10.15514/ispras-2021-33(1)-14

Copy DOI

Abstract

In this work we study the application of intrinsic stylometric methods to the task of plagiarism detection in Armenian texts. We use two task setups from PAN’s series of conferences on text forensics and stylometry: style change detection and style breach detection. Style change detection aims to determine whether the text is written by more than one author, while style breach detection detects the boundaries of stylistically distinct text fragments. For these tasks, we generate synthetic test sets for three genres of text: academic, literature, and news, and then use them to evaluate the effectiveness of hierarchical clustering and other relevant models from PAN conferences. We employ a standard set of character-level, lexical and readability features, and additionally perform morphological and dependency parsing of text fragments to extract syntactic features encoding author style information. The evaluation results show that the clustering-based approach fails to correctly detect style change detection in longer texts and is only marginally better for shorter texts. For style breach detection, hierarchical clustering-based approach performs better than a random baseline classifier, but the difference is not sufficient to warrant its practical use. In a complementary experiment, we show that reducing the number of features and multicollinearity in them via PCA helps to increase the precision of style breach detection methods for certain text categories.

Highlights

На каждом шаге алгоритма объединяются те два кластера, которые приводят к минимальному увеличению дисперсии
O вышеперечисленное для каждого знака пунктуации отдельно общие суффиксы: o наличие конкретного суффикса o #[слова с конкретным суффиксом] / #[слова]

Summary

Обнаружение границ нарушений стиля

Задача выявления границ нарушений стиля заключается в определении моно- или мультиавторства исследуемого документа и разбиении документа на стилистически однородные фрагменты в случае мульти-авторства с указанием границ смен стиля. Данная задача была предложена участникам PAN в 2017 году [3]. ISP RAS, vol 33, issue 1, 2021, pp. [11] были менее эффективны в данной задаче, и поэтому не рассматривались в нашей работе для адаптации и применения к армянскому языку. В этих методах сравниваются представления рядом стоящих предложений. В [10] документ разделяется по предложениям, и рассматриваются скользящие окна из трех предложений, имеющие одно общее предложение. Предложение считается отклонением, если среднее расстояние его векторного представления от остальных векторов больше заданного допустимого. Что данная модель работала значительно медленно по сравнению с остальными, ее адаптация к армянскому языку была бы проблематичной также потому, что для предобучения и эффективной работы используемой нейронной сети понадобилось бы большое количество текстовых данных

Кластеризация по авторству

Методы

Эксперименты

Наборы данных

Результаты

Findings

Заключение

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Proceedings of the Institute for System Programming of the RAS	Publication Date: Jan 1, 2021
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Поиск заимствований в армянских текстах путем внутреннего стилометрического анализа

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS

Lead the way for us

Similar Papers

A study on utilizing OCR technology in building text database
Sun-Hwa Hahn ... Jin-Hyung Kim
-
Sun-Hwa Hahn, et. al. Sun-Hwa Hahn ... Jin-Hyung Kim
01 Jan 1998
01 Jan 1998

Toward the optimized crowdsourcing strategy for OCR post-correction
Omri Suissa ... Avshalom Elmalech
Aslib Journal of Information Management | VOL. 72
Omri Suissa, et. al.Omri Suissa ... Avshalom Elmalech
07 Jan 2020
Aslib Journal of Information Management | VOL. 72

TextDC: Exploring Multidimensional Text Detection via a New Benchmark and Solution
Yingjie Tian ... Fenfen Zhou
Electronics | VOL. 12
Yingjie Tian, et. al.Yingjie Tian ... Fenfen Zhou
29 Dec 2022
Electronics | VOL. 12

OCR Error Correction of an Inflectional Indian Language Using Morphological Parsing
...
Journal of Information Science and Engineering | VOL. 16
, et. al. ...
01 Nov 2000
Journal of Information Science and Engineering | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Поиск заимствований в армянских текстах путем внутреннего стилометрического анализа

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Proceedings of the Institute for System Programming of the RAS