Design and Development of Schema for Schemaless Databases
A schema database functions as a repository for interconnected data points, facilitating comprehension of data structures by organizing information into tables with rows and columns. These databases utilize established connections to arrange data, with attribute values linking related tuples. This integrated approach to data management and distributed processing enables schema databases to maintain models even when the working set size surpasses available RAM. However, challenges such as data quality, storage, scarcity of data science professionals, data validation, and sourcing from diverse origins persist. Notably, while schema databases excel at reviewing transactions, they often fall short in updating them effectively. To address these issues, a Chimp-based radial basis neural model (CbRBNM) is employed. Initially, the Schemaless database was considered and integrated into the Python system. Subsequently, compression functions were applied to both schema and schema-less databases to optimize relational data size by eliminating redundant files. Performance validation involved calculating compression parameters, with the proposed method achieving memory usage of 383.37[Formula: see text]Mb, a computation time of 0.455[Formula: see text]s, a training time of 167.5[Formula: see text]ms, and a compression rate of 5.60%. Extensive testing demonstrates that CbRBNM yields a favorable compression ratio and enables direct searching on compressed data, thereby enhancing query performance.
- Research Article
23
- 10.1002/spe.820
- May 29, 2007
- Software: Practice and Experience
In this paper we consider an approach to developing complex database schemas. Apart from the theoretical model of the approach, we also developed a CASE tool named Integrated Information Systems*Case, R.6.2 (IIS*Case) that supports the practical application of the approach. In this paper the basis of our approach to the design and integration of database schemas and ways of using IIS*Case is outlined. The main features of a new version of IIS*Case, developed in Java, are described. IIS*Case is based on the concept of ‘form type’ and supports the conceptual modelling of a database schema, generating subschemas and integrating them into a relational database schema in 3NF. IIS*Case provides an intelligent support for complex and highly formalized design and programming tasks. Having an advanced knowledge of information systems and database design is not a compulsory prerequisite for using IIS*Case. IIS*Case is based on a methodology of gradual integration of independently designed subschemas into a database schema. The process of independent subschema design may lead to collisions in expressing real‐world constraints. IIS*Case uses specialized algorithms for checking the consistency of constraints embedded in a database schema and its subschemas. This paper briefly outlines the application of the process of detecting collisions, and actions the designer may take to resolve them. Copyright © 2007 John Wiley & Sons, Ltd.
- Research Article
12
- 10.1007/s41870-020-00515-8
- Sep 20, 2020
- International Journal of Information Technology
Database schema design has a significant importance in software design. There are lots of tools and methods available for schema design in RDBMS but limited attention is given in NoSQL for schema design as it is emerging in database technology. NoSQL requires a different approach in designing efficient schema, such as in the document database which information should be stored as embedded document, or which information should be stored as referenced document. There are certain thumb rules for schema design in NoSQL databases. In reengineering projects, especially in Old RDBMS to new NoSQL system, developing good and efficient database schema is a very difficult task. In this paper, we have proposed a schema design advisor model which uses the existing software’s SQL queries load as an input along with an algorithm for schema design recommendation. Also, we have proposed a cost model for various schemas created by recommendation model. The proposed model is implemented through a prototype for the MongoDB document database in Java. The prototype produces all possible combinations of schemas and calculates cost of each schema. Automated schema design process produces all possible combinations of NoSQL schemas, which is difficult with manual schema design approach.
- Conference Article
29
- 10.1145/3340482.3342743
- Aug 27, 2019
Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.
- Conference Article
7
- 10.1109/bigdata50022.2020.9378228
- Dec 10, 2020
Database schemas evolve over time to satisfy changing application requirements. If this evolution is not performed correctly, some quality attributes are at risk such as data integrity, functional correctness, or maintainability. To help developer teams in the design of database schemas, several design methodologies for NoSQL databases have proposed to use conceptual models during this process. The use of an explicit conceptual model can also help developers in the tasks of schema evolution. In this work-in-progress paper, we propose a framework that, given a change in the conceptual model, identifies what must be modified in a NoSQL database schema and the underlying data. We researched several open source projects that use Apache Cassandra to study the benefits of using a conceptual model during the schema evolution process as well as to understand how these models evolve. In this first work, we have focused on studying seven types of conceptual model changes identified in these projects. For each change we describe the transformation required in the database schema to maintain the consistency between the schema and the model as well as the migration of data required to the new schema version.
- Book Chapter
- 10.1007/978-3-658-10934-9_23
- Jan 1, 2015
Privacy by design and data protection by design focus mostly on product/service features, high-level design, security measures and organizational practices. However, low-level implementation details can also have a data protection impact that may need to be taken into account. This contribution discusses three emerging software development trends (immutability, schema-less databases and reactive programming) that are not yet well-known outside the software development sector. Each of these trends relies on ideas that may at first glance seem discordant with fundamental data protection principles (such as data quality, data minimization and data retention limitations). Even so, upon closer inspection, they also offer direct or indirect data protection benefits. Depending on the circumstances, the use of these trends may therefore be beneficial or even advisable from a data protection perspective. It nevertheless remains difficult to assess to which extent the new Data Protection Regulation will require these aspects to be integrated into data protection impact assessments and data protection by design/default compliance processes.
- Research Article
3
- 10.3897/biss.5.75686
- Sep 27, 2021
- Biodiversity Information Science and Standards
GBIF (Global Biodiversity Information Facility) is the largest data aggregator of biological occurrences in the world. GBIF was officially established in 2001 and has since aggregated 1.8 billion occurrence records from almost 2000 publishers. GBIF relies heavily on Darwin Core (DwC) for organising the data it receives. GBIF Data Processing Pipelines Every single occurrence record that gets published to GBIF goes through a series of three processing steps until it becomes available on GBIF.org. source downloading parsing into verbatim occurrences interpreting verbatim values source downloading parsing into verbatim occurrences interpreting verbatim values Once all records are available in the standard verbatim form, they go through a set of interpretations. In 2018, GBIF processing underwent a significant rewrite in order to improve speed and maintainablility. One of the main goals of this rewrite was to improve the consistency between GBIF's processing and that of the Living Atlases. In connection with this, GBIF's current data validator fell out of sync with GBIF pipelines processing. New GBIF Data Validator The current GBIF data validator is a service that allows anyone with a GBIF-relevant dataset to receive a report on the syntactical correctness and the validity of the content contained within the dataset. By submitting a dataset to the validator, users can go through the validation and interpretation procedures usually associated with publishing in GBIF and quickly determine potential issues in data, without having to publish it. GBIF is planning to rework the current validator because the current validator does not exactly match current GBIF pipelines processing. Planned Changes The new validator will match the processing of the GBIF pipelines project. Validations will be saved and show up on user pages similar to the way downloads and derived datasets appear now (no more bookmarking validations!) A downloadable report of issues found will be produced. Validations will be saved and show up on user pages similar to the way downloads and derived datasets appear now (no more bookmarking validations!) A downloadable report of issues found will be produced. Suggested Changes/Ideas One of the main guiding philosophies for the new validator user interface will be avoiding information overload. The current validator is often quite verbose in its feedback, highlighting data issues that may or may not be fixable or particularly important. The new validator will: generate a map of record geolocations; give users issues by order of importance; give "What", "Where", "When" flags priority; give some possible solutions or suggested fixes for flagged records. generate a map of record geolocations; give users issues by order of importance; give "What", "Where", "When" flags priority; give some possible solutions or suggested fixes for flagged records. We see the hosted portal environment as a way to quickly implement a pre-publication validation environment that is interactive and visual. Potential New Data Quality Flags The GBIF team has been compiling a list of new data quality flags. Not all of the suggested flags are easy to implement, so GBIF cannot promise the flags will get implemented, even if they are a great idea. The advantage of the new processing pipelines is that almost any new data quality flag or processing step in pipelines will be available for the data validator. Easy new potential flags: country centroid flag : Country/province centroids are a known data quality problem. any zero coordinate flag : Sometimes publishers leave either the latitude or longitude field as zero when it should have been left blank or NULL. default coordinate uncertainty in meters flag: Sometimes a default value or code is used for dwc:coordinateUncertaintyInMeters, which might indicate that it is incorrect. This is especially the case for values 301, 3036, 999, 9999. no higher taxonomy flag: Often publishers will leave out the higher taxonomy of a record. This can cause problems for matching to the GBIF backbone taxonomy.. null coordinate uncertainty in meters flag: There has been some discussion that GBIF should encourage publishers more to fill in dwc:coordinateUncertaintyInMeters. This is because every record, even ones taken from a Global Positioning System (GPS) reading, have an associated dwc:coordinateUncertaintyInMeters country centroid flag : Country/province centroids are a known data quality problem. any zero coordinate flag : Sometimes publishers leave either the latitude or longitude field as zero when it should have been left blank or NULL. default coordinate uncertainty in meters flag: Sometimes a default value or code is used for dwc:coordinateUncertaintyInMeters, which might indicate that it is incorrect. This is especially the case for values 301, 3036, 999, 9999. no higher taxonomy flag: Often publishers will leave out the higher taxonomy of a record. This can cause problems for matching to the GBIF backbone taxonomy.. null coordinate uncertainty in meters flag: There has been some discussion that GBIF should encourage publishers more to fill in dwc:coordinateUncertaintyInMeters. This is because every record, even ones taken from a Global Positioning System (GPS) reading, have an associated dwc:coordinateUncertaintyInMeters It is also nice when a data quality flag has an escape hatch, such that a data publisher can get rid of false positives or remove a flag through filling in a value. Batch-type validations that are doable for pipelines, but probably not in the validator include: outlier: Outliers are a known data quality problem. There are generally two types of outliers: environmental outliers and distance outliers. Currently GBIF does not flag either type of outlier. record is sensitive species: A sensitive species would be a record where the species is considered vulnerable in some way. Usually this is due to poaching threat or the species is only found in one area. gridded dataset: Rasterized or gridded datasets are common on GBIF. These are datasets where location information is pinned to a low-resolution grid. This is already available with an experimental API (Application Programming Interface). outlier: Outliers are a known data quality problem. There are generally two types of outliers: environmental outliers and distance outliers. Currently GBIF does not flag either type of outlier. record is sensitive species: A sensitive species wou
- Conference Article
3
- 10.1109/pacrim.2003.1235871
- Oct 14, 2003
Nowadays relational database schemas are designed by using well-known database design techniques such as the entity relationship model and the normalization process. The result schemas can be guaranteed to have minimum redundancies if the fifth normal form (5NF) is achieved. However, in more recent database schema design such as the design of object database schema, the concern about minimum redundancies does not seem to be an important issue. Functional dependencies may still appear in an object class of the class diagram thus introduce update anomalies. This paper presents the use of NIAM, a well-established conceptual schema model, as a conceptual model for the design of object databases. A transformation from a NIAM to an OODB schema with minimum redundancy is presented. The conceptual schema can also be transformed into an XML schema. This is a good approach to XML schema design since NIAM gives a conceptual framework for the design. A transformation from a NIAM schema to an XML schema is presented. A software tool for object and XML schema generation has been developed.
- Conference Article
- 10.1109/icsmc.1996.561500
- Oct 14, 1996
The main goal of the paper is to give a formalization of the relational database schema integration process, as a part of a database schema design. We briefly describe the form type concept and the module schema design, which are related to the integration process. Database schema of the information system is obtained by progressive pairwise integration of so called database module schemes, that are usually built by different designers. An integration process is preceded by checking of database module schemes onto mutual consistency.
- Research Article
- 10.1145/3770750
- Dec 8, 2025
- Journal of Data and Information Quality
The quality of metadata plays a crucial role in many data FAIRification processes. So much so, in fact, that all the four main principles of data FAIRification prescribe the use of high-quality metadata. One of the main data management paradigms where metadata is a first-class citizen is Ontology-Based Data Management (OBDM). The goal of OBDM is to provide users with a reconciled view of a set of heterogeneous data sources by means of a semantic metadata layer comprising an ontology and a mapping. The former is a high-level, declarative representation of the domain of interest written in terms of a logical theory, and the latter is a formal description of the relation between the symbols in the ontology and the data at the sources. In this article, we introduce a novel data quality framework based on OBDM and specifically tailored for metadata analysis. The target of this framework is one of the most common forms of metadata currently in circulation, i.e., the integrity constraints defined by a database schema. Specifically, we will focus on the data quality dimension known as Consistency, i.e., the property of data that is free of contradictions and incoherence. In this context, our techniques provide a set of tools to compare the integrity constraints defined by a database schema against the knowledge encoded in an ontology and check whether these constraints are strict enough (i.e., protect) and are not too strict (i.e., are faithful to) for such knowledge. The contribution of the article is the presentation of the framework and the study of the related computational problems. We will present a detailed computational complexity analysis of such problems and show that they are decidable for classes of OBDM specifications and integrity constraints that are very popular in practice.
- Research Article
55
- 10.1145/319732.319749
- Sep 1, 1982
- ACM Transactions on Database Systems
This paper addresses the problem of database schema design in the framework of the relational data model and functional dependencies. It suggests that both Third Normal Form (3NF) and Boyce-Codd Normal Form (BCNF) supply an inadequate basis for relational schema design. The main problem with 3NF is that it is too forgiving and does not enforce the separation principle as strictly as it should. On the other hand, BCNF is incompatible with the principle of representation and prone to computational complexity. Thus a new normal form, which lies between these two and captures the salient qualities of both is proposed. The new normal form is stricter than 3NF, but it is still compatible with the representation principle. First a simpler definition of 3NF is derived, and the analogy of this new definition to the definition of BCNF is noted. This analogy is used to derive the new normal form. Finally, it is proved that Bernstein's algorithm for schema design synthesizes schemata that are already in the new normal form.
- Conference Article
9
- 10.1109/soca.2015.29
- Oct 1, 2015
While database schema options in relational database management systems are few and have been studied for decades, little effort has so far been devoted to NoSQL column stores. Today, schema design for column stores is still based on the gut feeling of the application developer instead of being approached systematically. This is even more critical as "good" schemas in column stores do not only depend on the data model of the application but also on the queries on that data: Poor schema design will either lead to a situation where not all queries can be answered or where some queries will show really poor performance. In this paper, we propose a systematic and informed approach to database schema design in NoSQL column stores by means of automated schema generation and application-specific schema ranking.
- Book Chapter
4
- 10.1007/978-3-030-62522-1_35
- Jan 1, 2020
Database schema design requires careful consideration of the application’s data model, workload, and target database technology to optimize for performance and data size. Traditional normalization schemes used in relational databases minimize data redundancy, whereas NoSQL document-oriented databases favor redundancy and optimize for horizontal scalability and performance. Systematic NoSQL schema design involves multiple dimensions, and a database designer is in practice required to carefully consider (i) which data elements to copy and co-locate, (ii) which data elements to normalize, and (iii) how to encode data, while taking into account factors such as the workload and data model. In this paper, we present a workload-driven document database schema recommender (DBSR), which takes a systematic, search-based approach in exploring the complex schema design space. The recommender takes as main inputs the application’s data model and its read workload, and outputs (i) the suggested document schema (featuring secondary indexing), (ii) query plan recommendations, and (iii) a document utility matrix that encodes insights on their respective costs and relative utility. We evaluate recommended schema in MongoDB using YCSB, and show significant benefits to read query performance.
- Research Article
5
- 10.1002/int.4550100902
- Jan 1, 1995
- International Journal of Intelligent Systems
This thesis focuses on the problem of designing a highly portable domain independent natural language interface for standard relational database systems. It is argued that a careful strategy for providing the natural language interface (NLI) with morphological, syntactic, and semantic knowledge about the subject of discourse and the database is needed to make the NLI portable from one subject area and database to another. There has been a great deal of interest recently in utilizing the database system to provide that knowledge. Previous approaches attempted to solve this challenging problem by capturing knowledge from the relational database (RDB) schema, but were unsatisfactory for the following reasons: 1.) RDB schemas contain referential ambiguities which seriously limit their usefulness as a knowledge representation strategy for NL understanding. 2.) Knowledge captured from the RDB schema is sensitive to arbitrary decisions made by the designer of the schema. In our work we provide a new solution by applying a conceptual model for database schema design to the design of a portable natural language interface. It has been our observation that the process used for adapting the natural language interface to a new subject area and database overlaps considerably with the process of designing the database schema. Based on this important observation, we design an enhanced natural language interface with the following significant features: complete independence of the linguistic component from the database component, economies in attaching the natural language and DB components, and sharing of knowledge about the relationships in the subject of discourse for database schema design and NL understanding.
- Conference Article
7
- 10.1145/2896921.2896933
- May 14, 2016
Relational databases are a vital component of many modern software applications. Key to the definition of the database schema --- which specifies what types of data will be stored in the database and the structure in which the data is to be organized --- are integrity constraints. Integrity constraints are conditions that protect and preserve the consistency and validity of data in the database, preventing data values that violate their rules from being admitted into database tables. They encode logic about the application concerned, and like any other component of a software application, need to be properly tested. Mutation analysis is a technique that has been successfully applied to integrity constraint testing, seeding database schema faults of both omission and commission. Yet, as for traditional mutation analysis for program testing, it is costly to perform, since the test suite under analysis needs to be run against each individual mutant to establish whether or not it exposes the fault. One overhead incurred by database schema mutation is the cost of communicating with the database management system (DBMS). In this paper, we seek to eliminate this cost by performing mutation analysis virtually on a local model of the DBMS, rather than on an actual, running instance hosting a real database. We present an empirical evaluation of our virtual technique revealing that, across all of the studied DBMSs and schemas, the virtual method yields an average time saving of 51% over the baseline.
- Conference Article
7
- 10.1109/itcc.2004.1286678
- Jan 1, 2004
Nowadays relational database schemas are designed by using well-known database design techniques such as the entity relationship model and the normalization process. The result schemas can be guaranteed to have minimum redundancies if the fifth normal form (5NF) is achieved. However, in more recent database schema design such as the design of object database schema, the concern about minimum redundancies does not seem to be an important issue. Functional dependencies may still appear in an object class of the class diagram thus introduce update anomalies. This article first presents the use of NIAM, a well-established conceptual schema model, as a conceptual model for the design of object databases. A transformation from a NIAM to an OODB schema with minimum redundancy is presented. The conceptual schema can also be transformed into Extensible Markup Language (XML) which is originally a language for document management. However, it now gains popularity in database representation. It is particularly useful as a data format when an application must communicate with another application. This article also presents the NIAM conceptual schema model as a conceptual design tool for XML schema. A software tool that allows users to create NIAM schemas and generate object and XML schemas is developed.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.