Abstract

Person name disambiguation on the Web (PNDW) consists of grouping the Web pages retrieved by a search engine when a person’s name is queried according to the individuals they refer to. This problem is of interest to the research community because Internet users often search for information about people on search engines, and also because people’s names are a very ambiguous type of named entity. In addition, the Web domain presents several challenges for natural language processing and information retrieval methods. In this paper, we classify PNDW systems according to their main characteristics: 1) features used to identify different individuals with the same name; 2) mathematical models used to represent the search results; 3) clustering algorithms used to group the Web pages; 4) methods used to address the impact of Web pages from social networking sites; and 5) methods used to deal with the multilingual nature of the Web. Also, we present the data sets most widely used to evaluate PNDW systems. Finally, we analyze the results obtained by the best PNDW systems in the literature.

Highlights

  • Person name disambiguation has received the interest from Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) communities due to people names being a very ambiguous type of Named Entities (NEs)

  • 1) EVALUATION METRICS The performance of PNDW systems has been measured with extrinsic evaluation metrics used in clustering problems for two main reasons: (i) PNDW corpora have associated gold standards annotated by experts; and (ii) PNDW has been formalized as a clustering problem

  • This is for two reasons: (i) these systems have been evaluated only in WEB PEOPLE SEARCH (WePS) corpora because the corpora University of Amsterdam (UvA) and MC4WePS are more recent, most PNDW systems have not been evaluated with them; and (ii) these systems have been trained with some of the WePS data sets in order to be evaluated with the other WePS collections

Read more

Summary

Introduction

Person name disambiguation has received the interest from Natural Language Processing (NLP), Information Retrieval (IR) and Text Mining (TM) communities due to people names being a very ambiguous type of Named Entities (NEs). Since 2009, the Text Analysis Conferences (TAC) have organized tasks about the entity linking problem, recently renamed as entity discovery and linking. The goal of this problem is to link mentions of an entity in a document to entities in a reference knowledge base, usually Wikipedia, or to detect new entities. He et al [1] and Grütze et al [2] have presented data sets for entity linking exclusively composed of person names. Person name disambiguation has been addressed in the news domain because people are often at the core of the events reported in the

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call