A Corpus-Based Sentence Classifier for Entity–Relationship Modelling

Sabrina Šuman,Alen Jakupović,Sanja Čandrlić

doi:10.3390/electronics11060889

Sabrina Šuman, Alen Jakupović + Show 1 more

Open Access

https://doi.org/10.3390/electronics11060889

Copy DOI

Journal: Electronics	Publication Date: Mar 11, 2022
Citations: 1	License type: CC BY 4.0

Affiliation: Polytechnic of Rijeka, University of Rijeka

Abstract

Automated creation of a conceptual data model based on user requirements expressed in the textual form of a natural language is a challenging research area. The complexity of natural language requires deep insight into the semantics buried in words, expressions, and string patterns. For the purpose of natural language processing, we created a corpus of business descriptions and an adherent lexicon containing all the words in the corpus. Thus, it was possible to define rules for the automatic translation of business descriptions into the entity–relationship (ER) data model. However, since the translation rules could not always lead to accurate translations, we created an additional classification process layer—a classifier which assigns to each input sentence some of the defined ER method classes. The classifier represents a formalized knowledge of the four data modelling experts. This rule-based classification process is based on the extraction of ER information from a given sentence. After the detailed description, the classification process itself was evaluated and tested using the standard multiclass performance measures: recall, precision and accuracy. The accuracy in the learning phase was 96.77% and in the testing phase 95.79%.

Highlights

IntroductionWe used methods from the NLP (natural language processing) field in the development of an automated (knowledge-based) system to support the creation of ER (entity–relationship) data models
We used methods from the NLP field in the development of an automated system to support the creation of ER data models
The work includes current research activities and results of the development of a knowledge-based system to support the creation of ER models

Summary

Introduction

We used methods from the NLP (natural language processing) field in the development of an automated (knowledge-based) system to support the creation of ER (entity–relationship) data models. To analyse the natural language more deeply, a linguistic corpus was created in the previous research phase, which contains the repository of business descriptions (BDs), BD sentences, words and POS (part-of-speech) tags. The purpose of creating the corpus was to define a set of translation rules that enables the translation of text-expressed BDs into a text-expressed (formal language) ER data model. This section describes some important concepts such as classification, pattern recognition, and recent text analysis and classification methods, as well as some existing solutions for automated translation of natural language text into an ER data model. Classifications, whose aim is specific, are typically created by an individual (e.g., a scientist in order to describe a part of reality that is being explored)

Objectives

Methods

Results

Conclusion