Rewriting the Infinite Chase for Guarded TGDs
This paper revisits Datalog rewriting methods for query answering over guarded tuple-generating dependencies, introducing algorithms that generate shortcut rules to efficiently bypass chase steps. Empirical results demonstrate their effectiveness on complex synthetic and real benchmarks, supporting practical applicability.
Guarded tuple-generating dependencies (GTGDs) are a natural extension of description logics and referential constraints. It has long been known that queries over GTGDs can be answered by a variant of the chase —a quintessential technique for reasoning with dependencies. However, there has been little work on concrete algorithms and even less on implementation. To address this gap, we revisit Datalog rewriting approaches to query answering, where a set of GTGDs is transformed to a Datalog program that entails the same base facts on each base instance. We show that a rewriting consists of “shortcut” rules that circumvent certain chase steps, we present several algorithms that compute a rewriting by deriving such “shortcuts” efficiently, and we discuss important implementation issues. Finally, we show empirically that our techniques can process complex GTGDs derived from synthetic and real benchmarks and are thus suitable for practical use.
- Research Article
5
- 10.14778/3551793.3551851
- Jul 1, 2022
- Proceedings of the VLDB Endowment
Guarded tuple-generating dependencies (GTGDs) are a natural extension of description logics and referential constraints. It has long been known that queries over GTGDs can be answered by a variant of the chase ---a quintessential technique for reasoning with dependencies. However, there has been little work on concrete algorithms and even less on implementation. To address this gap, we revisit Datalog rewriting approaches to query answering, where GTGDs are transformed to a Datalog program that entails the same base facts on each base instance. We show that the rewriting can be seen as containing "shortcut" rules that circumvent certain chase steps, we present several algorithms that compute the rewriting by simulating specific types of chase steps, and we discuss important implementation issues. Finally, we show empirically that our techniques can process complex GTGDs derived from synthetic and real benchmarks and are thus suitable for practical use.
- Conference Article
3
- 10.1109/ipdpsw.2017.108
- May 1, 2017
We explore the use of synthetic benchmarks for the training phase of machine-learning-based automatic performance tuning. We focus on the problem of predicting if the use of local memory on a GPU is beneficial for caching a single target array in a GPU kernel. We show that the use of only 13 real benchmarks leads to poor prediction accuracy (about to 58%) of the 13 leave-one-out models trained using these benchmarks, even when the model features are sufficiently comprehensive. We define a metric, called the average vicinity density to measure the quality of a training set. We then use it to demonstrate that the poor accuracy of the models built with the real benchmarks is indeed because of the limited size and coverage of the training set. In contrast, the use of 90K properly generated set of synthetic benchmarks leads to significantly better accuracies, up to 87%. These results validate our approach of using synthetic benchmarks for training machine learning models. We describe a synthetic benchmark template for the local memory optimization. We then present two approaches to using this template and a seed set of real benchmarks to generate a large number of synthetic benchmark. We also explore the impact of the number of synthetic benchmarks used in training.
- Conference Article
- 10.1109/dsd.2014.16
- Aug 1, 2014
We present a framework that automatically generates system level synthetic benchmarks from traditional benchmarks. Synthetic benchmarks have similar performance behavior as the original benchmarks that they are generated from and they can run faster. Synthetics can also be used as proxies where original applications are not available in source form. In experiments we observe that not only are our system level benchmarks much smaller than the real benchmarks that they are generated from but they are also much faster. For example, when we generate synthetic benchmarks from the well-known multicore benchmark suite, PARSEC, our benchmarks have an average speedup of 149x over PARSEC benchmarks. We also observe that the performance behavior of synthetics have more than 85% similarity to the real benchmarks.
- Research Article
62
- 10.1093/logcom/9.3.319
- Jun 1, 1999
- Journal of Logic and Computation
Number restrictions are concept constructors that are available in almost all implemented Description Logic systems. However, they are mostly available only in a rather weak form, which considerably restricts their expressive power. On the one hand, the roles that may occur in number restrictions are usually of a very restricted type, namely atomic roles or complex roles built using either intersection or inversion. In the present paper, we increase the expressive power of Description Logics by allowing for more complex roles in number restrictions. As role constructors, we consider composition of roles (which will be present in all our logics) and intersection, union, and inversion of roles in different combinations. We will present two decidability results (for the basic logic that extends ALC by number restrictions on roles with composition, and for one extension of this logic), and three undecidability results for three other extensions of the basic logic. On the other hand, with the rather weak form of number restrictions available in implemented systems, the number of role successors of an individual can only be restricted by a fixed non-negative integer. To overcome this lack of expressiveness, we allow for variables ranging over the non-negative integers in place of the fixed numbers in number restrictions. The expressive power of this constructor is increased even further by introducing explicit quantifiers for the numerical variables. The Description Logic obtained this way turns out to have an undecidable satisfiability problem. For a restricted logic we show that concept satisfiability is decidable.
- Dissertation
- 10.31274/etd-180810-2363
- May 30, 2013
This thesis proposes using an integrated hardware-software solution for improving Single-Instruction Multiple-Thread branching efficiency. Unlike current SIMT hardware branching architectures, this hardware-software solution allows programmers the ability to fine tune branching behavior for their application or allow the compiler to implement a generic software solution. To support a wide range of SIMT applications with different control flow properties, three branching methods are implemented in hardware with configurable software instructions. The three branching methods are the contemporary Post-Dominator Re-convergence that is currently implemented in SIMT processors, a proposed Hyperthreaded SIMT processor cores for maintaining statically allocated thread warps and a proposed Dynamic Micro-Kernels that modified thread warps during run-time execution. Each of the implemented branching methods have their strengths and weaknesses and result in different performance improvements depending on the application. SIMT hyper-threading turns a single SIMT processor core into multiple virtual processors. These virtual processors run divergent control flow paths in parallel from threads in the same warp. Controlling how the virtual processor cores are created is done using a per-warp stack that is managed through software instructions. Dynamic Micro-Kernels creates new threads at run-time to execute divergent control flow paths instead of using branching instructions. A spawn instruction is used to create threads at run-time and once created are placed into new warps with similar threads follow the same control flow path. This thesis's integrated hardware-software branching architectures are evaluated using different realistic benchmarks with varying control flow divergence. Synthetic benchmarks are also used for evaluation and are designed to test specific branching conditions and isolate common branching behaviors. Each of the hardware implemented branching solutions are tested in isolation using different software algorithms. Algorithms are designed for general purpose use or to target specific types of branching conditions. Results shows improved performance for divergent applications and using different software algorithms will affect performance.
- Research Article
- 10.1016/j.jpdc.2020.04.014
- May 21, 2020
- Journal of Parallel and Distributed Computing
Transparent speculation in geo-replicated transactional data stores
- Conference Article
6
- 10.1109/icicdt.2013.6563340
- May 1, 2013
This paper presents a bidirectional interconnect design which achieves significant reduction in area and power by allowing for simultaneous transmission and reception of signals on a single interconnect segment. The proposed interconnect design achieves twice the throughput with the same link width. We have modeled the bi-directional link on the 7×7 cycle accurate NoC design. We have explored the latency for synthetic and realistic SPLASH-2 benchmarks. Synthetic benchmark results show that bidirectional design does exceedingly well in high congestion. Combination of realistic benchmarks shows that bidirectional design does much better with latency whenever the injection level of the combined benchmark is higher.
- Conference Article
4
- 10.1145/3208040.3208055
- Jun 11, 2018
This work presents Speculative Transaction Replication (STR), a protocol that exploits transparent speculation techniques to enhance performance of geo-distributed, partially replicated transactional data stores. In addition, we define a new consistency model, Speculative Snapshot Isolation (SPSI), that extends the semantics of Snapshot Isolation (SI) to shelter applications from the subtle anomalies that can arise from using speculative transaction processing. SPSI extends SI in an intuitive and rigorous fashion by specifying desirable atomicity and isolation guarantees that must hold when using speculative execution.STR provides a form of speculation that is fully transparent for programmers (it does not expose the effects of misspeculations to clients). Since the speculation techniques employed by STR satisfy SPSI, they can be leveraged by application programs in a transparent way, without requiring any source-code modification to applications designed to operate using SI. STR combines two key techniques: speculative reads, which allow transactions to observe pre-committed versions, which can reduce the 'effective duration' of pre-commit locks and enhance throughput; Precise Clocks, a novel timestamping mechanism that uses per-item timestamps with physical clocks, which together greatly enhance the probability of successful speculation.We assess STR's performance on up to nine geo-distributed Amazon EC2 data centers, using both synthetic benchmarks as well as realistic benchmarks (TPC-C and RUBiS). Our evaluation shows that STR achieves throughput gains up to 11X and latency reduction up to 10X, in workloads characterized by low inter-data center contention. Furthermore, thanks to a self-tuning mechanism that dynamically and transparently enables and disables speculation, STR offers robust performance even when faced with unfavourable workloads that suffer from high misspeculation rates.
- Research Article
13
- 10.1145/1453175.1453184
- Aug 31, 2008
- ACM SIGMETRICS Performance Evaluation Review
Benchmarks are crucial to understanding software systems and assessing their performance. In file-system research, synthetic benchmarks are accepted and widely used as substitutes for more realistic and complex workloads. However, synthetic benchmarks are largely based on the benchmark writer's interpretation of the real workload, and how it exercises the system API. This is insufficient since even a simple operation through the API may end up exercising the file system in very different ways due to effects of features such as caching and prefetching. In this paper, we describe our first steps in creating "realistic synthetic" benchmarks by building a tool, CodeMRI. CodeMRI leverages file-system domain knowledge and a small amount of system profiling in order to better understand how the benchmark is stressing the system and to deconstruct its workload.
- Conference Article
18
- 10.1109/itsc.2019.8916914
- Oct 1, 2019
Learning based monocular depth estimation has become popular in recent years. However, training of reliable depth estimators requires large volumes of ground truth depth data, which is expensive to obtain. To overcome this challenge, we propose a novel monocular depth estimation system that adopts pixel-perfect synthetic image training. The model is trained with synthetic data but inferred with realistic images by applying image domain adaptation. Considering realistic constraints, such as shadow regions, which cause the performance drops in depth estimation, the system adopts a dedicated module to remove such ambient shadows from images to guarantee premier performance in the task. Experimental results, with both synthetic and realistic benchmarks, indicate the efficacy and the advantages of the system compared to existing state-of-the-art technologies.
- Conference Article
6
- 10.1109/pdp.2012.70
- Feb 1, 2012
Three-dimensional (3D) integration is a viable design paradigm to overcome the existing interconnect bottleneck in integrated systems and enhance system power/performance characteristics. In order to exploit the intrinsic capability of reducing the wire length in 3D ICs, stacked mesh 3D NoC architecture was proposed. However, this architecture suffers from naive and straightforward hybridization between NoC and bus media. In this paper, an efficient hybridization scheme is presented to enhance system performance, power consumption, and area of stacked mesh 3D NoC architectures. By utilizing a routing rule called LastZ the proposed hybridization scheme offers many advantages investigated in detail to emphasize the significant achievements. Our extensive simulations with synthetic and real benchmarks, including an integrated videoconference application show that compared to a typical 3D NoC-Bus Hybrid Mesh architecture, our hybridization scheme achieves significant power, performance, and area improvements.
- Conference Article
9
- 10.1109/compsac.2011.65
- Jul 1, 2011
This paper presents a novel virtual-channel (VC) sharing technique for NoC architecture. The proposed architecture improves the utilization of resources to enhance the performance with minimal overheads. A heuristic approach towards the proper VC sharing strategy is proposed, which is performed by an adaptive algorithm that configures the VC sharing based on link load parameters. Architectural design to realize the adaptive VC sharing in generic router is elaborated. The technique can be applied to any NoC architecture, including 3-D NoCs. Extensive quantitative experiments with synthetic and real benchmarks, including an integrated video conference application, demonstrate considerable improvement in area and power efficiency compared to existing VC-based 2D/3D NoC architectures.
- Research Article
6
- 10.1145/3007898
- Jan 4, 2017
- ACM Transactions on Design Automation of Electronic Systems
The synchronous dataflow model of computation is widely used to design embedded stream-processing applications under strict quality-of-service requirements (e.g., buffering size, throughput, input-output latency). The required analyses can either be performed at compile time (for design space exploration) or at runtime (for resource management and reconfigurable systems). However, these analyses have an exponential time complexity, which may cause a huge runtime overhead or make design space exploration unacceptably slow. In this article, we argue that symbolic analyses are more appropriate since they express the system performance as a function of parameters (i.e., input and output rates, execution times). Such functions can be quickly evaluated for each different configuration or checked with respect to different quality-of-service requirements. We provide symbolic analyses for computing the maximal throughput of acyclic synchronous dataflow graphs, the minimum required buffers for which as soon as possible (ASAP) scheduling achieves this throughput, and finally, the corresponding input-output latency of the graph. The article first investigates these problems for a single parametric edge. The results are extended to general acyclic graphs using linear approximation techniques. We assess the proposed analyses experimentally on both synthetic and real benchmarks.
- Conference Article
40
- 10.1109/wacv56688.2023.00236
- Jan 1, 2023
Multi-task learning has recently become a promising solution for comprehensive understanding of complex scenes. With an appropriate design, multi-task models can not only be memory-efficient but also favour the exchange of complementary signals across tasks. In this work, we jointly address 2D semantic segmentation, and two geometry-related tasks, namely dense depth, surface normal estimation as well as edge estimation showing their benefit on several datasets. We propose a novel multi-task learning architecture that exploits pair-wise cross-task exchange through correlation-guided attention and self-attention to enhance the average representation learning for all tasks. We conduct extensive experiments on three multi-task setups, showing the benefit of our proposal in comparison to competitive baselines in both synthetic and real benchmarks. We also extend our method to the novel multi-task unsupervised domain adaptation setting. Our code is available at https://github.com/cv-rits/DenseMTL
- Conference Article
28
- 10.1145/3106237.3106261
- Aug 21, 2017
In this paper we present the design and implementation of a distributed, whole-program static analysis framework that is designed to scale with the size of the input. Our approach is based on the actor programming model and is deployed in the cloud. Our reliance on a cloud cluster provides a degree of elasticity for CPU, memory, and storage resources. To demonstrate the potential of our technique, we show how a typical call graph analysis can be implemented in a distributed setting. The vision that motivates this work is that every large-scale software repository such as GitHub, BitBucket, or Visual Studio Online will be able to perform static analysis on a large scale. We experimentally validate our implementation of the distributed call graph analysis using a combination of both synthetic and real benchmarks. To show scalability, we demonstrate how the analysis presented in this paper is able to handle inputs that are almost 10 million lines of code (LOC) in size, without running out of memory. Our results show that the analysis scales well in terms of memory pressure independently of the input size, as we add more virtual machines (VMs). As the number of worker VMs increases, we observe that the analysis time generally improves as well. Lastly, we demonstrate that querying the results can be performed with a median latency of 15 ms.