Anomaly Detection in Arabic Texts using Ngrams and Self Organizing Maps

Abdulwahed Almarimi,Asmaa Salem

doi:10.5121/ijcsea.2021.11402

Abstract

Every written text in any language has one author or more authors (authors have their individual sublanguage). An analysis of text if authors are not known could be done using methods of data analysis, data mining, and structural analysis. In this paper, two methods are described for anomaly detections: ngrams method and a system of Self-Organizing Maps working on sequences built from a text. there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for Arabic texts. Arabic texts were analyzed from many statistical characteristics point of view. We applied some heuristics for measurements of text parts dissimilarities. We evaluate some Arabic texts and show its parts they contain discrepancies and they need some following analysis for anomaly detection. The analysis depends on selected parameters prepared in xperiments. The system is trained to input sequences after which it determines text parts with anomalies using a cumulative error and winner analysis in the networks. Both methods have been tested on Arabic texts and they have a perspective contribution to text analysis.

Full Text