KazNewsDataset: Single Country Overall Digital Mass Media Publication Corpus

Kirill Yakunin,Rustam Mussabayev,Ravil I Mukhamediev,Viktors Gopejenko,Marina Yelis,Sanzhar Murzakhmetov,Yan Kuchin,Alibek Abdurazakov,Vladimir Barakhnin,Akylbek Zhumabayev,Maksat Kalimoldayev,Ulzhan Ospanova,Timur Buldybayev,Zhazirakhanym Meirambekkyzy

doi:10.3390/data6030031

Abstract

Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.

Highlights

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations
The paper described a text corpus, which contains over 4 million publications of KaThe paper described a text corpus, which contains over 4 million publications of zakhstani media, more than 2 million texts of Russian media and about 4000 sections of Kazakhstani media, more than 2 million texts of Russian media and about 4000 sections state development program documents
The corpus was used in several research cases, of state development program documents

Summary

Form 2 of Corpus Representation

It is available at [30] It includes 1,142,735 documents from news web sites and social networks with the same data as in the corpus described above with an addition of: Sixty-seven columns with handpicked and topic groups weights with semantic names (group economy, group politics, etc.). They were normalized to range from 0 to 1; Two hundred columns with topic weights were obtained through topic modeling.

Weekly

Limitations of the Study

Findings

Conclusions