Abstract

Personal name disambiguation is a significant issue in natural language processing, which is the basis for many tasks in automatic information processing. This research explores the Chinese personal name disambiguation based on clustering technique. Preprocessing is applied to transform raw corpus into standardized format at the beginning. And then, Chinese word segmentation, part-of-speech tagging, and named entity recognition are accomplished by lexical analysis. Furthermore, we make an effort to extract features that can better disambiguate Chinese personal names. Some rules for identifying target personal names are created to improve the experimental effect. Additionally, many calculation methods of feature weights are implemented such as bool weight, absolute frequency weight, tf-idf weight, and entropy weight. As for clustering algorithm, an agglomerative hierarchical clustering is selected by comparison with other clustering methods. Finally, a labeling approach is employed to bring forward feature words that can represent each cluster. The experiment achieves a good result for five groups of Chinese personal names.

Highlights

  • The ambiguity of named entities is a prevalent phenomenon in natural language

  • Parts-of-speech tagging, and named entity recognition are performed on corpus

  • This paper studied the task of Chinese personal name disambiguation based on an unsupervised method

Read more

Summary

Introduction

The ambiguity of named entities is a prevalent phenomenon in natural language. There is considerable ambiguity about the personal name in the texts or the web pages, especially in the Chinese dataset. The Chinese personal name “Gao Jun (高军)” has a total of 51 items in the Baidu Encyclopedia. Eliminating the ambiguity of such personal name is beneficial to many tasks like information retrieval and data summarization. Take searching a person name on the Internet for example, documents of different person entities with the same name can be found by search engine. It is necessary to divide the documents into clusters automatically and secure the key information of each cluster. This research focuses on this task of importance and attempts to solve the problem by unsupervised approaches

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call