The exact probability law for the approximated similarity from the Minhashing method

Soumaila Dembele,Gane Samb Lo

doi:10.16929/as/2017.1199.100

Abstract

We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman \textit{RU} algorithm and a modified version of it denoted by \textit{RUM}. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given. \noindent \textbf{R\'{e}sum\'{e}.} Nous proposons un cadre probabilistique dans lequel nous \'{e}tudions la loi de probabilit\'{e} de l'algorithme de Rajaraman et Ullman \textit{RU} ainsi qu'une version modifi\'{e}e de cet algorithme not\'{e}e \textit{RUM}. Ces alogrithmes visent \`{a} estimer l'indice de la similarit\'{e} entre des textes de grandes tailles dans le contexte du Web. Nous donnons une base de validit\'e de cette m\'{e}thode en montrant que pour des lois de probabilit\'{e}s minutieusement choisies, la similarit\'{e} exacte est l'esp\'{e}rance math\'{e}matique de la similarit\'{e} al\'{e}atoire donn\'{e}e par l'algorithme \textit{RUM}. Des g\'en\'eralisations sont abord\'ees.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The exact probability law for the approximated similarity from the Minhashing method

Abstract

Talk to us

Similar Papers

More From: Afrika Statistika

Lead the way for us

Similar Papers

Hybrid Diversification Operator-Based Evolutionary Approach Towards Tomographic Image Reconstruction
S A Qureshi ... S M Mirza
IEEE Transactions on Image Processing | VOL. 20
S A Qureshi, et. al.S A Qureshi ... S M Mirza
20 Jan 2011
IEEE Transactions on Image Processing | VOL. 20

On the Convergence of Loopy Belief Propagation Algorithm for Different Update Rules
N Taga
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | VOL. E89-A
N TagaN Taga
01 Feb 2006
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | VOL. E89-A

Maximum entropy from the laws of probability
Anthony J M Garrett
-
Anthony J M GarrettAnthony J M Garrett
01 Jan 2001
01 Jan 2001

Preface
W Tabbara ... L Cander
Inverse Problems | VOL. 18
W Tabbara, et. al.W Tabbara ... L Cander
08 Jan 2002
Inverse Problems | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The exact probability law for the approximated similarity from the Minhashing method

Abstract

Talk to us

Similar Papers

More From: Afrika Statistika