Application of principal component analysis to identify semantic differences and estimate relative positioning of network communities in the study of social networks content

I A Rytsarev,A V Kupriyanov,D V Kirsh,R A Paringer

doi:10.1088/1742-6596/1368/5/052032

Abstract

In the paper, we propose an approach to the analysis of social groups and their relative positioning based on the identification of semantic differences in texts presented in the form of frequency dictionaries. The initial textual data was obtained by collecting records of thematic Internet communities. To collect entries, we implemented a specialized software module for downloading and analyzing posts as well as comments from open communities of interest in the social network VKontakte. The developed algorithm of frequency dictionary compilation evaluates the characteristics of data collected from social networks. For keywords identification, we propose a new approach based on the analysis of word frequency distribution, using methods for dimension reduction of feature spaces. The presented algorithm using the principal component analysis allowed to assess the significance of words by coefficients of the linear transformation. Along with the keywords, we identified semantic differences of social network communities and estimated their relative positioning in the transformed feature space.

Full Text