Developing a Legal Form Classification and Extraction Approach for Company Entity Matching

Felix Kruse,Peter Loos,Jorge Marx Gómez,Jan-Philipp Awick

doi:10.52825/bis.v1i.44

Abstract

This paper explores the data integration process step record linkage. Thereby we focus on the entity company. For the integration of company data, the company name is a crucial attribute, which often includes the legal form. This legal form is not concise and consistent represented among different data sources, which leads to considerable data quality problems for the further process steps in record linkage. To solve these problems, we classify and ex-tract the legal form from the attribute company name. For this purpose, we iteratively developed four different approaches and compared them in a benchmark. The best approach is a hybrid approach combining a rule set and a supervised machine learning model. With our developed hybrid approach, any company data sets from research or business can be processed. Thus, the data quality for subsequent data processing steps such as record linkage can be improved. Furthermore, our approach can be adapted to solve the same data quality problems in other attributes.

Highlights

No benchmark dataset for company entity matching exists which contains this record linkage (RL) challenges. We show that this problem can be solved with our Hybrid approach consisting of a set of rules and a supervised machine learning (ML) method, or our Deep Learning (DL) approach
We show that general data quality problems with the concise and consistent representation of attributes could be solved with such approaches
The integration of the data sources is enabled by the data integration process, which consists of the process steps (1) schema matching, (2) record linkage (RL), and (3) data fusion [7]

Summary

Motivation and problem statement

Companies try to integrate data in their decision-making processes in the most efficient way to achieve corporate added value. The RL step matches data records from different data sources that refer to the same real-world entity such as companies, products, or persons [7, p. We identified these challenges through our data-driven inductive research method [14] This method describes our approach to analysing our eleven existing data sources (see table 1) and integrating various of them through a RL process to find general RL challenges for the real-world entity company. Company names can be represented differently in various databases due to the inconsistent representation of the legal form This makes the company entity matching more difficult. Our paper aims to develop an approach that classifies and extracts the company name's legal form to improve the data quality and support further data processing steps such as the RL.

Related Work

Discussion

Findings

Limitations

Conclusion and Outlook

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Business Information Systems	Publication Date: Jul 2, 2021
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Developing a Legal Form Classification and Extraction Approach for Company Entity Matching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Business Information Systems

Lead the way for us

Similar Papers

GBIF Data Processing and Validation
John Waller ... Nikolay Volik
Biodiversity Information Science and Standards | VOL. 5
John Waller, et. al.John Waller ... Nikolay Volik
27 Sep 2021
Biodiversity Information Science and Standards | VOL. 5

Data Quality Control in Biodiversity Informatics: The Case of Species Occurrence Data
Allan Koch Veiga ... Etienne Americo Cartolano
IEEE Latin America Transactions | VOL. 12
Allan Koch Veiga, et. al.Allan Koch Veiga ... Etienne Americo Cartolano
01 Jun 2014
IEEE Latin America Transactions | VOL. 12

Pushing the limits of solubility prediction via quality-oriented data selection.
Murat Cihan Sorkun ... J.M. Vianney A. Koelman
iScience | VOL. 24
Murat Cihan Sorkun, et. al.Murat Cihan Sorkun ... J.M. Vianney A. Koelman
17 Dec 2020
iScience | VOL. 24

Data Quality in Citizen Science Projects: Challenges and Solutions
Weigelhofer Gabriele ... Pölz Eva-Maria
Frontiers in Environmental Science | VOL. 4
Weigelhofer Gabriele, et. al.Weigelhofer Gabriele ... Pölz Eva-Maria
01 Jan 2015
Frontiers in Environmental Science | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Developing a Legal Form Classification and Extraction Approach for Company Entity Matching

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Business Information Systems