The Google Similarity Distance

Rudi L Cilibrasi,Paul M.B Vitanyi

doi:10.1109/tkde.2007.48

Abstract

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers, the equivalent of "society" is "database," and the equivalent of "use" is "a way to search the database". We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts, we use the World Wide Web (WWW) as the database, and Google as the search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the WWW using Google page counts. The WWW is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87 percent with the expert crafted WordNet categories

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Google Similarity Distance

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Journal: IEEE Transactions on Knowledge and Data Engineering	Publication Date: Mar 1, 2007
Citations: 1682

Similar Papers

Minimum Normalized Google Distance for Unsupervised Multilingual Chinese-English Word Sense Disambiguation
Pengyuan Liu ... Shiqi Li
-
Pengyuan Liu, et. al. Pengyuan Liu ... Shiqi Li
01 Dec 2010
01 Dec 2010

Bioinformatics: Searching the net
Steven Kastin ... John Wexler
Seminars in Nuclear Medicine | VOL. 28
Steven Kastin, et. al.Steven Kastin ... John Wexler
01 Apr 1998
Seminars in Nuclear Medicine | VOL. 28

Internet Search Engines
Vijay Kasi ... Radhika Jain
-
Vijay Kasi, et. al.Vijay Kasi ... Radhika Jain
01 Jan 2006
01 Jan 2006

Internet Search Engines
Vijay Kasi ... Radhika Jain
-
Vijay Kasi, et. al.Vijay Kasi ... Radhika Jain
18 Jan 2011
18 Jan 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Google Similarity Distance

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering