Abstract

The article deals with the issue of creating a corpus in the Kazakh language, which is necessary to create a semantic model for identifying extremist information on web resources. As a result of the study, text data was collected from closed groups of the VKontakte social network, YouTube comments and news sites using previously developed analytics. The texts were classified into two classes: «extremist» and «neutral». A corpus for further training of models was created. The total number of extremist messages included in the corpus is about 1,200, and the total number of words in the corpus is about 140,000 words. The distribution of extremist and neutral texts was carried out, the analysis of the corpus was carried out using a word cloud. The Python 3.7 programming language with pandas, numpy, matplotlib, plotly, bokeh, cufflinks, spacy, googletrans packages was used as the main calculation and visualization libraries for data analysis. Signs of texts of extremist content were identified.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call