Abstract

This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.

Highlights

  • No benchmark dataset for company entity matching exists which contains this record linkage (RL) challenges. We show that this problem can be solved with our Hybrid approach consisting of a set of rules and a supervised machine learning (ML) method, or our Deep Learning (DL) approach

  • We show that general data quality problems with the concise and consistent representation of attributes could be solved with such approaches

  • The integration of the data sources is enabled by the data integration process, which consists of the process steps (1) schema matching, (2) record linkage (RL), and (3) data fusion [7]

Read more

Summary

Motivation and problem statement

Companies try to integrate data in their decision-making processes in the most efficient way to achieve corporate added value. The RL step matches data records from different data sources that refer to the same real-world entity such as companies, products, or persons [7, p. We identified these challenges through our data-driven inductive research method [14] This method describes our approach to analysing our eleven existing data sources (see table 1) and integrating various of them through a RL process to find general RL challenges for the real-world entity company. Company names can be represented differently in various databases due to the inconsistent representation of the legal form This makes the company entity matching more difficult. Our paper aims to develop an approach that classifies and extracts the company name's legal form to improve the data quality and support further data processing steps such as the RL.

Related Work
Discussion
Findings
Limitations
Conclusion and Outlook
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.