Processing queries for first-few answers

Roberto J Bayardo,Daniel P Miranker

doi:10.1145/238355.238372

Abstract

Special support for quickly finding the first-few answers of a query is already appearing in commercial database systems. This support is useful in active databases, when dealing with potentially unmanageable query results, and as a declarative alternative to navigational techniques. In this paper, we discuss query processing techniques for first-answer queries. We provide a method for predicting the cost of a first-answer query plan under an execution model that attempts to reduce wasted effort in join pipelining. We define new statistics necessary for accurate cost prediction, and discuss techniques for obtaining the statistics through traditional statistical measures (e.g. selectivity) and semantic data properties commonly specified through modern OODB and relational schemas. The proposed techniques also apply to all-answer query processing when optimizing for fast delivery of the initial query results. 1I ntroduction Traditional methods for query processing, primarily those based on the relational model, process queries with the goal of materializing the set of all answer tuples with minimal cost. Several applications instead require only the first answer or first-few answers to particular queries, or require the first answers of a query to be delivered as quickly as possible. This is evidenced by increasing support for first answer query optimization in modern relational systems [11, 16]. First-answer query support is also important in active databases based on production system models, where fast match algorithms lazily enumerate answers to a query one at a time [15]. Object-oriented database systems and knowledge-representation systems support complex structures allowing data to be retrieved through navigation as well as querying. Navigation is often preferable over querying for locating a single object since query engines, usually geared around set-oriented constructs, inevitably touch more data than necessary. A declarative query language with first-answer support can enable more understandable code than navigation, and potentially faster retrieval due to cost-based optimization. Finally, there will always be cases when producing the entire query result is simply too costly. Various search engines (including those for the world wide web) provide functionality for lazily enumerating answers in case of overly general search criteria. In this domain one might argue that all-answer query responses may take infinitely long or an input “table” may be a stream with no known end. Thus only depth first, first solution methods are applicable. This paper presents our work on query processing techniques specifically geared for optimizing and executing first-answer join queries. The techniques also apply to optimizing all-answer queries when the goal is to minimize latency of first-answer delivery instead of overall throughput. The analysis is independent of any storage model, and therefore applies should the database be disk resident, main memory resident, or distributed. We begin by providing a modified pipelined join algorithm that remedies performance problems sometimes exhibited by naive join pipelining. We then present a probabilistic technique for predicting query-plan cost under this modified pipelined join execution model. Though the costestimation technique requires database statistics not typically maintained by traditional centralized database systems, the statistics are derivable from those commonly maintained by distributed query processors. We also show how they can often be derived or estimated from selectivity information and semantic information often specified in the form of cardinality constraints (such as existence and functional dependencies) in modern relational, object-oriented, and knowledge-base systems.

Full Text