Efficiency Analysis of Jaccard Similarity in Probabilistic Distribution Model

doi:10.25236/ajcis.2023.060208

Abstract

The inner probabilistic properties of the big data have a great impact on the performance of pattern recognition systems. Jaccard similarity (JS) is a most popular statistic metric used for cal-culating the similarity of objects in feature extraction process. The paper combines JS with probabil-istic distribution model to explore the effect of the inner properties of big data. It deduced the gener-alized form of JS for probabilistic model and determined the calculation method of JS for power-law and exponential distribution. Experiment observations showed that power-law distribution has high-er JS than the correspondent exponential distribution, which denotes that power-law probabilistic structure is a more efficient probability structure. The original normalized data in MNIST database exhibited a more power-law-like distribution and the randomly translated data exhibited a more exponential-like distribution. The MNIST data with power-law-like property has higher JS and are more efficient comparing to the translated data. Thus, these observations provide possible guidelines for efficient information coding and processing methods.

Full Text