Abstract
Having a unique personal identifier is a prerequisite to run person-centric analytical queries and data mining tasks, such as fraud detection, expert finding, and credit scoring. Personal names are the most commonly used identifier of individuals in datasets; however, the name of a person may not be unique across the dataset's records, especially where data are integrated from various sources. Intelligent systems utilize name matching methods to identify different name representations of persons. The performance of previous name matching methods is inadequate since they solely consider name similarities and ignore dissimilarities. Unavailability of Part of Name (PON, e.g., first name and last name) is an important limitation of dissimilarity consideration. To address this issue, this paper proposes an unsupervised personal name matching framework, namely Swash. This framework can model the information gatherable from a name dataset into a layered Heterogeneous Information Network, which facilitates control over the learning process. Swash predicts PON tags using a self-trainable algorithm and then collectively clusters the name vertices on the network. Evaluations on three public bibliographic datasets (i.e., CiteSeer, ArXiv, and DBLP) recognize the significance of the proposed framework. The results showed that Swash outperformed F1 of the state-of-the-art method up to 38%.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.