Procedure for checking the uniformity of samples of text documents based on nonparametric criteria

S I Safin,V O Tolcheev

doi:10.26896/1028-6861-2023-89-7-71-77

Abstract

One of the most important tasks in Text Mining is the formation of sufficiently large representative and consistent samples (datasets). Usually, datasets are obtained from various information sources. In some cases, due to the lack of specialized texts in Russian, the dataset is expanded by adding translated English-language documents. In such situations, it is advisable to evaluate the uniformity-heterogeneity of the combined arrays. However, such a verification is complicated by the fact that the documents are multidimensional vectors, the correct comparison of which is a very non-trivial task. Insufficient elaboration of procedures for checking the uniformity of samples for the multidimensional case leads to the fact the problem of possible differences in data is ignored that in practice as insignificant. As a result, classifiers are trained on samples that are a mixture of quite diverse texts, and the resulting quality of categorization does not improve (or even deteriorates). Thus, it seems relevant to develop a procedure for checking the uniformity of documentary samples. To do this, we provide a comprehensive study of the problem of shift in textual data, identified and analyzed the reasons that cause the heterogeneity of documentary arrays. In this study, the datasets consist of bibliographic descriptions of scientific articles (title, abstract, keywords). The authors develop a procedure for assessing the homogeneity of two samples having approximately the same volume and the same method for calculating the weights of terms. For comparison, centroids are used, which have the size of a common dictionary of two datasets (in the absence of some terms, zero values are put in the corresponding positions of the centroids). The representation of samples in the form of «terminological portraits» (centroids) allowed us to reduce the verification of the homogeneity of multidimensional document vectors to a well-studied problem of analyzing two one-dimensional connected samples, for which nonparametric criteria were used. The sign criterion and the Wilcoxon sign rank criterion were used in the study. The proposed procedure for checking the uniformity of samples was tested on three collections of documents obtained from Russian and English-language sources.

Full Text