Abstract: The demand for automated validation of new customer company accounts, with a specific emphasis on resolving name variations within large databases, is steadily increasing. This surge in demand is a direct response to the significant scale at which data for new customers is being input into CRM systems, necessitating manual validation and correction processes. The primary objective of this survey paper is to identify machine learning methods capable of addressing the challenge of resolving name variations in extensive databases. To achieve this goal, we propose a multi-step approach. Firstly, we employ an approximate string matching algorithm with high computational speed to identify matches with a very high percentage of accuracy. In cases where such matches are not found in the system, we utilize named entity recognition techniques to annotate company names and their suffixes, such as INC, PVT, LTD, CORP, and CO, which may appear in various forms. To resolve abbreviation disambiguity, we explore the application of machine learning algorithms, including the naïve Bayes classifier, decision trees, and Support Vector Machines. In this survey paper, we conclude by presenting potential approximate string matching algorithms, a named entity recognition method, and a model for resolving abbreviation disambiguity. Our review not only provides a comprehensive overview of the current state of research in this area but also highlights gaps in the existing knowledge, offering valuable insights for future research and practical application.
Read full abstract