Data analysis is performed to examine, interpret, and extract information from data series, and it includes applying various methods and techniques to understand patterns and compare data. An approach to compare data is to use rank metrics that help identify how distinct two data series are when compared to each other according to patterns, formats, criteria, and dimensions in both data series. Among these metrics, Kendall’s Tau metric stands out, as it is robust and inexpensive, widely used in analyzing sequences and genomes, to detect errors in flash memories, and to compare distributions and top-k ranked values. However, a challenge arises when comparing lists with different lengths or when lists do not share the same elements. This happens, for example, when lists are defined by top-k elements, commonly called k-list. In this case, there is no guarantee that two k-lists share the same set of elements. Traditional metrics like Kendall’s Tau are designed to quantify differences only between shared elements in lists. Recognizing this limitation, a possible solution is to apply the metric to the shared elements of the lists. Another solution, named the generalization of Kendall’s Tau, proposed by Fagin et al., considers all elements in two lists. However, this generalization of Kendall Tau is a semi-metric, as it does not satisfy the triangular inequality. To solve this problem, we propose the Extended Kendall Tau (EKT) metric that meets all the conditions of a metric and simultaneously considers the distinct elements of the compared lists. The proposed metric was evaluated by applying conventional Kendall’s Tau and the extended Kendall’s Tau over 40 text files divided into five different languages (eight files per language). We compared KT and EKT measures within the ”same language” and across ”other language” files for the two scenarios. The results revealed that both methods could accurately identify the differences between the groups of texts of the ”same language” and ”other language”. However, the numerical results show that EKT is able to more significantly highlight the difference between groups of texts of different languages.
Read full abstract