Web-ресурстардағы экстремисттік мәліметтерді анықтауға арналған машиналық әдістерді оқыту және сынау үшін қазақ тіліндегі мәтіндер корпусын құру

Shynar Mussiraliyeva,Ihor Tereikovskyi,Gulshat Baispay,Moldir Sagynay,Milana Bolatbek

doi:10.52209/1609-1825_2023_3_453

Abstract

The article deals with the issue of creating a corpus in the Kazakh language, which is necessary to create a semantic model for identifying extremist information on web resources. As a result of the study, text data was collected from closed groups of the VKontakte social network, YouTube comments and news sites using previously developed analytics. The texts were classified into two classes: «extremist» and «neutral». A corpus for further training of models was created. The total number of extremist messages included in the corpus is about 1,200, and the total number of words in the corpus is about 140,000 words. The distribution of extremist and neutral texts was carried out, the analysis of the corpus was carried out using a word cloud. The Python 3.7 programming language with pandas, numpy, matplotlib, plotly, bokeh, cufflinks, spacy, googletrans packages was used as the main calculation and visualization libraries for data analysis. Signs of texts of extremist content were identified.

Full Text