Optimization approach to the choice of explicable methods for detecting anomalies in homogeneous text collections

Elena Baskakova,Fedor Krasnov,Irina Smaznevich

doi:10.15622/ia.20.4.5

Elena Baskakova, Fedor Krasnov + Show 1 more

Open Access

PDF Available

https://doi.org/10.15622/ia.20.4.5

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The problem of detecting anomalous documents in text collections is considered. The existing methods for detecting anomalies are not universal and do not show a stable result on different data sets. The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal. Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity.The problem of finding anomalies is considered in the following statement: it is necessary to checking a new document uploaded to an applied intelligent information system for congruence with a homogeneous collection of documents stored in it. In such systems that process legal documents the following limitations are imposed on the anomaly detection methods: high accuracy, computational efficiency, reproducibility of results and explicability of the solution. Methods satisfying these conditions are investigated.The paper examines the possibility of evaluating text documents on the scale of anomaly by deliberately introducing a foreign document into the collection. A strategy for detecting novelty of the document in relation to the collection is proposed, which assumes a reasonable selection of methods and parameters. It is shown how the accuracy of the solution is affected by the choice of vectorization options, tokenization principles, dimensionality reduction methods and parameters of novelty detection algorithms.The experiment was conducted on two homogeneous collections of documents containing technical norms: standards in the field of information technology and railways. The following approaches were used: calculation of the anomaly index as the Hellinger distance between the distributions of the remoteness of documents to the center of the collection and to the foreign document; optimization of the novelty detection algorithms depending on the methods of vectorization and dimensionality reduction. The vector space was constructed using the TF-IDF transformation and ARTM topic modeling. The following algorithms have been tested: Isolation Forest, Local Outlier Factor and One-Class SVM (based on Support Vector Machine).The experiment confirmed the effectiveness of the proposed optimization strategy for determining the appropriate method for detecting anomalies for a given text collection. When searching for an anomaly in the context of topic clustering of legal documents, the Isolating Forest method is proved to be effective. When vectorizing documents using TF-IDF, it is advisable to choose the optimal dictionary parameters and use the One-Class SVM method with the corresponding feature space transformation function.

Highlights

В состав сумм в таблице выше по состоянию на 30 июня 2020 г. включены суммы по договорам, заключенным со связанными сторонами – совместными предприятиями Компании и компаниями под контролем Российской Федерации, в размере 36 503 миллиона рублей и 68 328 миллионов рублей, соответственно
The accuracy of the results depends on the choice of parameters at each step of the problem solving algorithm process, and for different collections different sets of parameters are optimal
Not all of the existing algorithms for detecting anomalies work effectively with text data, which vector representation is characterized by high dimensionality with strong sparsity

Summary

ОБНАРУЖЕНИЯ АНОМАЛИЙ В ОДНОРОДНЫХ ТЕКСТОВЫХ КОЛЛЕКЦИЯХ

Однако для задачи поиска несоответствующих элементов в коллекции объемных текстовых документов, которая рассматривается в рамках данного исследования, наиболее подходят методы обучения с частичным привлечением учителя, поскольку в реальных прикладных системах известны общие характеристики «корректных» документов, и у пользователей есть представление о том, какие документы точно не должны попасть в коллекцию. При отсутствии или неявной выраженности содержательной специфики коллекции документов пространство признаков может быть построено с помощью методов дистрибутивной семантики на основе уже известных данных о распределении слов в универсальных языковых корпусах (word2vec [21]). Плюсом такого представления текстовой коллекции для обнаружения аномалий является гораздо меньшая, по сравнению с предыдущими двумя способами, размерность пространства признаков, а также более высокая объяснимость решения, что важно при разработке прикладных информационных систем. В таблице 1 для различных групп методов машинного обучения показано сопоставление их объяснимости и точности, а также указана их вычислительная сложность

Средняя Средняя

Точность Средняя

Нейронные сети

Длина дов эксперикумента менте

Цифровой двойник сортировочной горки

Разбалансировка рынка

Униграммы и биграммы с РС

Выбросы Новизна Выбросы Новизна Выбросы Новизна Выбросы Новизна

SVD random Выбросы

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Информатика и автоматизация	Publication Date: Aug 3, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

Optimization approach to the choice of explicable methods for detecting anomalies in homogeneous text collections

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Информатика и автоматизация

Lead the way for us

Similar Papers

Macroeconomic performance of oil price shocks: Outlier evidence from nineteen major oil-related countries/regions
Keyi Ju ... Lifan Liu
Energy Economics | VOL. 60
Keyi Ju, et. al.Keyi Ju ... Lifan Liu
20 Oct 2016
Energy Economics | VOL. 60

Effectiveness of LOF, iForest and OCSVM in detecting anomalies in stream sediment geochemical data
Shahed Shahrestani ... Emmanuel John M Carranza
Geochemistry: Exploration, Environment, Analysis | VOL. 24
Shahed Shahrestani, et. al.Shahed Shahrestani ... Emmanuel John M Carranza
26 Aug 2024
Geochemistry: Exploration, Environment, Analysis | VOL. 24

Outlier Detection On Graduation Data Of Darussalam Gontor University Using One-Class Support Vector Machine
Oddy Virgantara Putra ... Ahmad Saroji
Procedia of Engineering and Life Science | VOL. 2
Oddy Virgantara Putra, et. al.Oddy Virgantara Putra ... Ahmad Saroji
01 Dec 2021
Procedia of Engineering and Life Science | VOL. 2

Machine Learning Approaches to Advanced Outlier Detection in Psychological Datasets
Khoula Al Abri ... Manjit Singh Sidhu
International journal of electrical and computer engineering systems | VOL. 15
Khoula Al Abri, et. al.Khoula Al Abri ... Manjit Singh Sidhu
19 Jan 2024
International journal of electrical and computer engineering systems | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Optimization approach to the choice of explicable methods for detecting anomalies in homogeneous text collections

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Информатика и автоматизация