Vector-based similarity measurements for historical figures

Yanqing Chen,Bryan Perozzi,Steven Skiena

doi:10.1016/j.is.2016.07.001

Yanqing Chen, Bryan Perozzi + Show 1 more

Open Access

PDF Available

https://doi.org/10.1016/j.is.2016.07.001

Copy DOI

Export

Save

Cite

Journal: Information Systems	Publication Date: Jul 13, 2016
Citations: 5	License type: publisher-specific-oa

Affiliation: Stony Brook University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Historical interpretation benefits from identifying analogies among famous people: Who are the Lincolns, Einsteins, Hitlers, and Mozarts? As a knowledge source that benefits many applications in language processing and knowledge representation, Wikipedia provides the information we need to make such comparisons. We investigate several approaches to convert the Wikipedia pages of approximately 600,000 historical figures into vector representations to quantify similarity.On the other hand, Wikipedia pages are assigned to different categories according to their contents as human-annotated labels. A rough similarity estimation could just count the number of shared Wikipedia categories. However, such counting can neither make good similarity quantification (i.e. Is there a difference between those with same number of shared categories?) nor make distinguishable comments on different categories (i.e. Is US Presidents more important than state lawyer when defining similarity?). We use the counting as an indicator to demonstrate high-level agreements of our similarity detection algorithms.In particular, we investigate four different unsupervised approaches to representing the semantic associations of individuals: (1) TF-IDF, (2) Weighted average of distributed word embedding, (3) LDA Topic analysis and (4) Deepwalk graph embedding from page links. All proved effective, but the Deepwalk embedding yielded an overall accuracy of 88.23% in our evaluation. Combining LDA and Deepwalk yielded even higher performance.Finally, we demonstrate that our similarity measurements can also be used to recognize the most descriptive Wikipedia categories for historical figures.We rank the descriptive level of Wikipedia categories according to their categorical coherence, and our ranking yield an overall agreement of 88.27% compared with human crowdsourced data. HighlightsWe use labeled features from Wikipedia to generated effective evaluation standards.The best approach using Deepwalk, utilized graph structure of words.We provide an interactive demo at http://peoplesimilarity.appspot.com/.We identify the best distance function for each single model.We tried model combination to balance graph structures and semantics.We identify the most salient categories associated with Wikipedia entities.We collect human responses from Crowdflower for verification.We also have fashioned an iOS game app (FameMatch, available on iTunes) for testing.Our ranking of Wikipedia categories agree with 88.27% of human judgment.

Full Text