Database Research Community Research Articles

Storage-compute disaggregation has recently emerged as a novel architecture in modern data centers, particularly in the cloud. By decoupling compute from storage, this new architecture enables independent and elastic scaling of compute and storage resources, potentially increasing resource utilization and reducing overall costs. To best leverage the disaggregated architecture, a new breed of database systems termed storage-disaggregated databases has recently been developed, such as Amazon Aurora, Microsoft Socrates, Google AlloyDB, Alibaba PolarDB, and Huawei Taurus. However, little is known about the effectiveness of the design principles in these databases since they are typically developed by industry giants, and only the overall performance results are presented without detailing the impact of individual design principles. As a result, many critical research questions remain unclear, such as the performance impact of storage-disaggregation, the log-as-the-database design, shared-storage, and various log-replay methods. In this paper, we investigate the performance implications of the design principles that are widely adopted in storage-disaggregated databases for the first time. As these databases were usually not open-sourced, we have made a significant effort to implement a storage-disaggregated database prototype based on PostgreSQL v13.0. By fully controlling and instrumenting the codebase, we are able to selectively enable and disable individual optimizations and techniques to evaluate their impact on performance in various scenarios. Furthermore, we open-source our storage-disaggregated database prototype for use by the broader database research community, fostering collaboration and innovation in this field.

Read full abstract

Machine learning (ML) applications have been thriving recently, largely attributed to the increasing availability of data. However, inconsistency and incomplete information are ubiquitous in real-world datasets, and their impact on ML applications remains elusive. In this paper, we present a formal study of this impact by extending the notion of Certain Answers for Codd tables , which has been explored by the database research community for decades, into the field of machine learning. Specifically, we focus on classification problems and propose the notion of "Certain Predictions" (CP) --- a test data example can be certainly predicted (CP'ed) if all possible classifiers trained on top of all possible worlds induced by the incompleteness of data would yield the same prediction. We study two fundamental CP queries: (Q1) checking query that determines whether a data example can be CP'ed; and (Q2) counting query that computes the number of classifiers that support a particular prediction (i.e., label). Given that general solutions to CP queries are, not surprisingly, hard without assumption over the type of classifier, we further present a case study in the context of nearest neighbor (NN) classifiers, where efficient solutions to CP queries can be developed --- we show that it is possible to answer both queries in linear or polynomial time over exponentially many possible worlds. We demonstrate one example use case of CP in the important application of "data cleaning for machine learning (DC for ML)." We show that our proposed CPClean approach built based on CP can often significantly outperform existing techniques, particularly on datasets with systematic missing values. For example, on 5 datasets with systematic missingness, CPClean (with early termination) closes 100% gap on average by cleaning 36% of dirty data on average, while the best automatic cleaning approach BoostClean can only close 14% gap on average.

Read full abstract

Database Research Community Research Articles

Articles published on Database Research Community

Legal Perspectives on Research Data Storage

Understanding the Performance Implications of the Design Principles in Storage-Disaggregated Databases

Ad Hoc Transactions through the Looking Glass: An Empirical Study of Application-Level Transactions in Web Applications

Technical Perspective: Query Answers - Fewer is Faster

Ad Hoc Transactions: What They Are and Why We Should Care

A Case for Graphics-Driven Query Processing

The World of Graph Databases from An Industry Perspective

Cloud data systems

Nearest neighbor classifiers over incomplete information

Winds from seattle

Combating fake news

An in-depth comparison of s-t reliability algorithms over uncertain graphs

Monochromatic and bichromatic ranked reverse boolean spatial keyword nearest neighbors search

Efficient Answering of Why-Not Questions in Similar Graph Matching

On uncertain graphs modeling and queries

An efficient scheme for probabilistic skyline queries over distributed uncertain data

Approaches and Challenges in Database Intrusion Detection

The Beckman Report on Database Research

Aggregate nearest neighbor queries in uncertain graphs

Expanding Database Keyword Search for Database Exploration

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Database Research Community Research Articles

Articles published on Database Research Community

Legal Perspectives on Research Data Storage

Understanding the Performance Implications of the Design Principles in Storage-Disaggregated Databases

Ad Hoc Transactions through the Looking Glass: An Empirical Study of Application-Level Transactions in Web Applications

Technical Perspective: Query Answers - Fewer is Faster

Ad Hoc Transactions: What They Are and Why We Should Care

A Case for Graphics-Driven Query Processing

The World of Graph Databases from An Industry Perspective

Cloud data systems

Nearest neighbor classifiers over incomplete information

Winds from seattle

Combating fake news

An in-depth comparison of s-t reliability algorithms over uncertain graphs

Monochromatic and bichromatic ranked reverse boolean spatial keyword nearest neighbors search

Efficient Answering of Why-Not Questions in Similar Graph Matching

On uncertain graphs modeling and queries

An efficient scheme for probabilistic skyline queries over distributed uncertain data

Approaches and Challenges in Database Intrusion Detection

The Beckman Report on Database Research

Aggregate nearest neighbor queries in uncertain graphs

Expanding Database Keyword Search for Database Exploration