Bringing Together Physical Design and Fast Querying of Large Data Warehouses
Data partitioning is a well-known technique for decision-support query performance optimization. In this paper, we present a horizontal data partitioning approach tailored to a large data warehouse, interrogated through a high number of queries. The idea behind our approach is to partition horizontally only the large fact table based on partitioning predicates, elected from the set of the selection predicates used by the analytic queries. The partitioning predicates election depends on their numbers of occurrences, their access frequencies, and their selectivities. With the Star Scheme Benchmark under Oracle 12c, we demonstrate that our partitioning technique reduces both query response time and fact partitions number; which is the major drawback of existing partitioning techniques. We also show, that our partitioning algorithm is around 66% faster compared to the primary and derived partitioning techniques based on the genetic algorithm.
- Book Chapter
4
- 10.1007/978-3-642-31552-7_53
- Jan 1, 2013
A data warehouse stores historical data for answering analytical queries. These analytical queries are long, complex and exploratory in nature and, when processed against a large data warehouse, consume a lot of time for processing. As a result the query response time is high. This time can be reduced by materializing views over a data warehouse. These views aim to improve the query response time. For this, they are required to contain relevant information for answering future queries. In this paper, an approach is presented that identifies such relevant information, obtained from previously posed queries on the data warehouse. The approach first identifies subject specific queries and then, from amongst such subject specific queries, frequent queries are selected. These selected frequent queries contain information that has been accessed frequently in the past and therefore has high likelihood of being accessed by future queries. This would result in an improvement in query response time and thereby result in efficient decision making.KeywordsSubject AreaData WarehouseQuery OptimizationDice CoefficientQuery Response TimeThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Book Chapter
36
- 10.1007/978-3-642-32129-0_26
- Jan 1, 2012
A data warehouse stores historical information, integrated from several large heterogeneous data sources spread across the globe, for the purpose of supporting decision making. The queries for decision making are usually analytical and complex in nature and their response time is high when processed against a large data warehouse. This query response time can be reduced by materializing views over a data warehouse. Since all views cannot be materialized, due to space constraints, and optimal selection of subsets of views is an NP-complete problem, there is a need for selecting appropriate subsets of views for materialization. An approach for selecting such subsets of views using Genetic Algorithm is proposed in this paper. This approach computes the top-T views from a multidimensional lattice by exploring and exploiting the search space containing all possible views. Further, this approach, in comparison to the greedy algorithm, is able to comparatively lower the total cost of evaluating all the views.KeywordsData WarehouseMaterialized Views SelectionGenetic Algorithm
- Book Chapter
- 10.4018/978-1-60566-010-3.ch142
- Jan 1, 2009
Decision support applications require complex queries, e.g., multi way joins defining on huge warehouses usually modelled using star schemas, i.e., a fact table and a set of data dimensions (Papadomanolakis & Ailamaki, 2004). Star schemas have an important property in terms of join operations between dimensions tables and the fact table (i.e., the fact table contains foreign keys for each dimension). None join operations between dimension tables. Joins in data warehouses (called star join queries) are particularly expensive because the fact table (the largest table in the warehouse by far) participates in every join and multiple dimensions are likely to participate in each join. To speed up star join queries, many optimization structures were proposed: redundant structures (materialized views and advanced index schemes) and non redundant structures (data partitioning and parallel processing). Recently, data partitioning is known as an important aspect of physical database design (Sanjay, Narasayya & Yang, 2004; Papadomanolakis & Ailamaki, 2004). Two types of data partitioning are available (Özsu & Valduriez, 1999): vertical and horizontal partitioning. Vertical partitioning allows tables to be decomposed into disjoint sets of columns. Horizontal partitioning allows tables, materialized views and indexes to be partitioned into disjoint sets of rows that are physically stored and usually accessed separately. Contrary to redundant structures, data partitioning does not replicate data, thereby reducing storage requirement and minimizing maintenance overhead. In this paper, we concentrate only on horizontal data partitioning (HP). HP may affect positively (1) query performance, by performing partition elimination: if a query includes a partition key as a predicate in the WHERE clause, the query optimizer will automatically route the query to only relevant partitions and (2) database manageability: for instance, by allocating partitions in different machines or by splitting any access paths: tables, materialized views, indexes, etc. Most of database systems allow three methods to perform the HP using PARTITION statement: RANGE, HASH and LIST (Sanjay, Narasayya & Yang, 2004). In the range partitioning, an access path (table, view, and index) is split according to a range of values of a given set of columns. The hash mode decomposes the data according to a hash function (provided by the system) applied to the values of the partitioning columns. The list partitioning splits a table according to the listed values of a column. These methods can be combined to generate composite partitioning. Oracle currently supports range-hash and range-list composite partitioning using PARTITION - SUBPARTITION statement. The following SQL statement shows an example of fragmenting a table Student using range partitioning.
- Conference Article
378
- 10.1145/1007568.1007609
- Jun 13, 2004
In addition to indexes and materialized views, horizontal and vertical partitioning are important aspects of physical design in a relational database system that significantly impact performance. Horizontal partitioning also provides manageability; database administrators often require indexes and their underlying tables partitioned identically so as to make common operations such as backup/restore easier. While partitioning is important, incorporating partitioning makes the problem of automating physical design much harder since: (a) The choices of partitioning can strongly interact with choices of indexes and materialized views. (b) A large new space of physical design alternatives must be considered. (c) Manageability requirements impose a new constraint on the problem. In this paper, we present novel techniques for designing a scalable solution to this integrated physical design problem that takes both performance and manageability into account. We have implemented our techniques and evaluated it on Microsoft SQL Server. Our experiments highlight: (a) the importance of taking an integrated approach to automated physical design and (b) the scalability of our techniques.
- Research Article
27
- 10.1504/ijict.2010.034979
- Jan 1, 2010
- International Journal of Information and Communication Technology
A materialised view is constructed to improve response time for complex analytical queries posed on a large data warehouse. Most existing approaches use all the queries posed on the data warehouse for constructing materialised views. It is generally observed that, among all the queries posed on the data warehouse in the past, queries that are similar and more frequently posed have high likelihood of being posed again in future and are therefore, appropriate for constructing materialised views. The approach presented in this paper, attempts to select such frequently posed queries from among all the queries posed on the data warehouse. Further, since the materialised views are required to fit within the available storage space, the approach selects a subset of profitable frequent queries that conforms to the space constraint. The information accessed by these queries has high likelihood of being accessed again by future queries. Furthermore, it is experimentally shown that use of this information for constructing materialised views reduces query response time. This in turn would facilitate decision-making.
- Research Article
20
- 10.1504/ijvcm.2011.042071
- Jan 1, 2011
- International Journal of Value Chain Management
The queries for decision making are usually analytical and complex in nature and their response times are high when processed against a large data warehouse. This problem of high response times can be addressed by materialising views over a data warehouse. Since all possible views cannot be materialised due to space constraint, there is a need to select an appropriate subset of views that can improve the query response time. One way to address this problem is by selecting views in a greedy manner. Most of the greedy-based view selection algorithms consider size of the views to select most beneficial views for materialisation. This paper presents a greedy based approach that considers query frequency, along with the size, of the views to select most profitable views for materialisation. These profitable views are likely to answer most future queries and thereby may lead to reduction in the query response time.
- Research Article
25
- 10.1504/ijbis.2012.050172
- Jan 1, 2012
- International Journal of Business Information Systems
A data warehouse contains historical and summarised data that grows, almost, exponentially with time. It provides a uniform platform for posing decision support queries. These queries are usually analytical and complex in nature and, when processed against a large data warehouse, consume a lot of processing time resulting in an increased query response time. This time can be reduced by using materialised views, which pre-compute the most frequently accessed information and stores them in a data warehouse. In this paper, an algorithm to construct materialised views using previously posed, optimal, user queries on the data warehouse, is proposed. This algorithm defines a heuristic that maximally merges the optimal queries to construct a single materialised view. These materialised views are capable of providing meaningful information for a given future query. Further, experiments are performed to evaluate the effectiveness of the materialised views with respect to the query response time. The experimental results show that materialised views so constructed are capable of answering future user queries in a reduced response time. This would enable effective and efficient decision-making.
- Book Chapter
2
- 10.1007/978-3-662-49784-5_3
- Jan 1, 2016
Selecting the optimal subset of views for materialization provides an effective way to reduce the query evaluation time for real-time Online Analytic Processing OLAP queries posed against a data warehouse. However, materializing a large number of views may be counterproductive and may exceed storage thresholds, especially when considering very large data warehouses. Thus, an important concern is to find the best set of views to materialize, in order to guarantee acceptable query response times. It further follows that this set of views may differ, from user to user, based on personal preferences. In addition, the set of queries that a specific user poses also changes over time, which further impacts the view selection process. In this paper, we introduce the personalized Smart Cube algorithm that combines vertical partitioning, partial materialization and dynamic computation to address these issues. In our approach, we partition the search space into fragments and proceed to select the optimal subset of fragments to materialize. We dynamically adapt the set of materialized views that we store, as based on query histories and user interests. The experimental evaluation of our personalized Smart Cube algorithm shows that our work compare favorably with the state-of-the-art. The results indicate that our algorithm materializes a smaller number of views than other techniques, while yielding fast query response times.
- Conference Article
1
- 10.1109/aset.2017.7983684
- Jan 1, 2017
Hardware/Software partitioning presents a critical problem in the co-design methodology. It resides on deciding which processes of the embedded application should be executed on a specific hardware architecture and which ones can be implemented on general purpose processor (software architecture), taking into account a set of constraints. The hardware architecture is selected to increase the embedded system performance (execution time, area, energy, etc.) and speedup design. However, the software architecture is more flexible and inexpensive. It is, generally, chosen to decrease the design cost and complexity. Several significant research work on hardware/software partitioning techniques and/or algorithms exist. In this paper, we will presents a comparative study of different hardware/software partitioning algorithms. Performance analysis reveals that Particle Swarm Optimization (PSO) algorithm outperforms Simulated Annealing (SA) algorithm, Ant Colony Optimization (ACO), algorithm Genetic Algorithm (GA) and the Fuzzy C-Means (FCM) algorithm.
- Book Chapter
26
- 10.1007/978-3-642-27872-3_7
- Jan 1, 2012
A data warehouse stores historical data to support analytical query processing. These analytical queries are long and complex and processing these against a large data warehouse consumes a lot of time. As a result, the query response time is high. One way to reduce this time is by selecting views that are likely to answer a large number of future queries and storing them in a data warehouse. This problem is referred to as view selection. Several view selection algorithms have been proposed with most of these being focused around HRUA. HRUA considers the size of the views to select the most beneficial view for materialization. The views selected using HRUA, though beneficial with respect to size, may be unable to account for large numbers of queries and thus making them an unnecessary overhead. The algorithm proposed in this paper attempts to address this problem by considering query frequency, along with the size, of the view to select Top-K views for materialization. The proposed algorithm, in each iteration, computes the profit, defined in terms of size and query frequency, and then selects the most profitable view for materialization. As a result, the views selected are beneficial with respect to size and have the ability to answer future queries. Further, experimental results show that the proposed algorithm, in comparison to HRUA, is able to select views capable of answering larger number of queries against a slight increase in the total cost of evaluating all the views. This in turn would result in efficient decision making.
- Book Chapter
2
- 10.1007/978-3-540-77226-2_53
- Dec 16, 2007
Proposing efficient techniques for discovery of useful information and valuable knowledge from very large databases and data warehouses has attracted the attention of many researchers in the field of data mining. The well-known Association Rule Mining (ARM) algorithm, Apriori, searches for frequent itemsets (i.e., set of items with an acceptable support) by scanning the whole database repeatedly to count the frequency of each candidate itemset. Most of the methods proposed to improve the efficiency of the Apriori algorithm attempt to count the frequency of each itemset without re-scanning the database. However, these methods rarely propose any solution to reduce the complexity of the inevitable enumerations that are inherited within the problem. In this paper, we propose a new algorithm for mining frequent itemsets and also association rules. The algorithm computes the frequency of itemsets in an efficient manner. Only a single scan of the database is required in this algorithm. The data is encoded into a compressed form and stored in main memory within a suitable data structure. The proposed algorithm works in an iterative manner, and in each iteration, the time required to measure the frequency of an itemset is reduced further (i.e., checking the frequency of n-dimensional candidate itemsets is much faster than those of n-1 dimensions). The efficiency of our algorithm is evaluated using artificial and real-life datasets. Experimental results indicate that our algorithm is more efficient than existing algorithms.
- Research Article
1
- 10.1023/b:jmse.0000043449.85576.da
- Jan 1, 2005
- Journal of Materials Science: Materials in Electronics
It is desirable to design partitioning methods that minimize the I/O time incurred during query execution in spatial databases. This paper explores optimal partitioning for two-dimensional data for a class of queries and develops multi-disk allocation techniques that maximize the degree of I/O parallelism obtained in each case. We show that hexagonal partitioning has optimal I/O performance for circular queries among all partitioning methods that use convex non-overlapping regions. An analysis and extension of this result to all possible partitioning techniques is also given. For rectangular queries, we show that hexagonal partitioning has overall better I/O performance for a general class of range queries, except for rectilinear queries, in which case rectangular grid partitioning is superior. By using current algorithms for rectangular grid partitioning, parallel storage and retrieval algorithms for hexagonal partitioning can be constructed. Some of these results carry over to circular partitioning of the data—which is an example of a non-convex region.
- Conference Article
34
- 10.1109/ssdm.2000.869781
- Jul 26, 2000
Multidimensional data cubes are used in large data warehouses as a tool for online aggregation of information. As the number of dimensions increases, supporting efficient queries as well as updates to the data cube becomes difficult. Another problem that arises with increased dimensionality is the sparseness of the data space. In this paper we develop a new data structure referred to as the pCube (data cube for progressive querying), to support efficient querying and updating of multidimensional data cubes in large data warehouses. While the pCube concept is very general and can be applied to any type of query, we mainly focus on range queries that summarize the contents of regions of the data cube. pCube provides intermediate results with absolute error bounds (to allow trading accuracy for fast response time), efficient updates, scalability with increasing dimensionality, and pre-aggregation to support summarization of large ranges. We present both a general solution and an implementation of pCube and report the results of experimental evaluations.
- Research Article
8
- 10.1504/ijcat.2009.028042
- Jan 1, 2009
- International Journal of Computer Applications in Technology
This paper presents a novel multi-objective evolutionary algorithm for hardware software partitioning of embedded systems. Customised genetic algorithms have been effectively used for solving complex optimisation problems (NP Hard) but are mainly applied to optimise a particular solution with respect to a single objective. Many real world problems in embedded systems have multiple objective functions like area, performance, power, latency, etc., which are to be maximised or minimised at the early stage of the design process. Hence multi-objective formulations are realistic models for many complex engineering optimisation problems. A multi-objective optimisation problem usually has a set of Pareto-optimal solutions, instead of one single optimal solution. A method is put forward for generating Pareto solutions using elitist non-dominated sorting genetic algorithm (ENGA) whose complexity is only O(MN²), where M is the number of objectives and N is the population size. The algorithm is implemented using Visual C++ and the performance metrics for weighted-sum genetic algorithm (WSGA) and ENGA are compared. The results of extensive hardware/software partitioning technique on numerous benchmarks are also presented which can be used practically at the early stage of the design process. From the simulation results ENGA (NSGA-II) was found to perform better than WSGA.
- Book Chapter
- 10.4018/978-1-5225-5088-4.ch001
- Jan 1, 2018
We are moving towards digitization and making all our devices, such as sensors and cameras, connected to internet, producing bigdata. This bigdata has variety of data and has paved the way to the emergence of NoSQL databases, like Cassandra, for achieving scalability and availability. Hadoop framework has been developed for storing and processing distributed data. In this chapter, the authors investigated the storage and retrieval of geospatial data by integrating Hadoop and Cassandra using prefix-based partitioning and Cassandra's default partitioning algorithm (i.e., Murmur3partitioner) techniques. Geohash value is generated, which acts as a partition key and also helps in effective search. Hence, the time taken for retrieving data is optimized. When users request spatial queries like finding nearest locations, searching in Cassandra database starts using both partitioning techniques. A comparison on query response time is made so as to verify which method is more effective. Results show the prefix-based partitioning technique is more efficient than Murmur3 partitioning technique.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.