A probabilistic similarity index between binary vectors for questionnaire data analysis (abstract only)

Xiaobo Li

doi:10.1145/322917.323053

Abstract

This paper proposes a probabilistic similarity index as a proximity measure between binary feature vectors. Different situations of binary features are analyzed and applications of the new index to questionnaire analysis is reported.Binary vectors are the most frequently used data format in computer science. Pattern recognition problems, such as questionnaire data analysis, require that the association between binary pattern vectors or feature vectors be measured. Many association measures, including the simple matching coefficient and the Jaccard coefficient, have been proposed.Consider two binary vectors of size n, V1 = [V1(i)] and V2 = [V2(i)], i = 1,2,…,n; where V1(i)e{0,1}, V2(i)e{0,1}. Let the vectors represent two features where the i-th components correspond to the i-th pattern. We wish to assess the association between these two vectors. We define the elements of the contingency table as follows. All the summations are for i from 1 through n. n1 = S V1(i) n2 = S V2(i) n00 = S [1-Vt(i)][1-V2(i)] n01 = S [1-V1(i)]V2(i) n10 = S V1(i)[1-V2(i)] n11 = S V1(i)V2(i).Various elements of the contingency table have been used to measure the similarity between V1 and V2. They differ in the coding of (0,1) and in the emphasis placed on the 0-0 and 1-1 matches. In the situation where the coding of 0 and 1 is meaningful and consistent between the vectors, in which (1,1) matches and (0,0) matches are equally important, the elements n11 and n00 are directly involved in the computation of the similarity, while n01 and n10 are not. Indices like the simple matching coefficient are suitable in this case.If the labeling of 0 and 1 is arbitrary, as when the labels simply denote two distinct states, or as when the two binary vectors represent two binary partitions of the same set, several well-known indices, such as Hubert's G, can be reduced to the binary case for this situation.The third situation arises in many questionnaire problems, where the coding of 0 and 1 is meaningful and significant, and n11 is treated differently from n00. Consider a questionnaire in which each participant responds to several “yes-no” questions. The answers are coded 0 or 1, where “1” means “yes” and “0” means “no” or “not clear”, “don't remember”, or even “missing data”. The (1,1) matches are significant, while (0,1), (1,0), and (0,0) matches are not. The Jaccard coefficient is an example of an index suitable for this situation. We are mainly interested in this situation in this paper.Probabilistic measures of association are based on the unusualness of the proximity. This paper discusses the application of a new probabilistic similarity index to questionnaire data analysis. This index is defined as the cumulative probability, under a random permutation hypothesis, that a random labeling of components within a vector has at most the observed number of matches. The permutation null hypothesis is defined as Ho(p): all permutations of V2, r(V2)'s, are equally likely. The baseline population under Ho(p) is Op = {r(V2): S(V2) is a permutation of V2}.A new similarity index T is defined as the cumulative probability that a random permutation of V2 has at most the same number of (1,1) matches with V1 as the original V2 has with V1, i.e., T = H(n11), where H is the cumulative hypergeometric probability.The probabilistic similarity index defined under this model has clear statistical significance and its critical value for testing the null hypothesis can be easily computed. This unique property of probabilistic proximity measures frees the researcher from Monte Carlo study and provides a natural threshold for vector correspondence. The proposed similarity index is self-standardized and thus does not require a correction for randomness. A pipelined architecture is available for computing this similarity index efficiently.Applications of this probabilistic similarity index to a questionnaire data analysis problem is reported. The application of a randomized version of this index is reported elsewhere [1,2]. Its performance is compared with some well known similarity measures. Its advantages and limitations are demonstrated.

Full Text