Improving scalability of sequential task flow models with cache-friendly parallel dependency tracking
Improving scalability of sequential task flow models with cache-friendly parallel dependency tracking
- Conference Article
2
- 10.1109/ross56639.2022.00009
- Nov 1, 2022
The sequential task flow (STF) model is the main-stream approach for interacting with task-based runtime systems, with StarPU and the Dynamic task discovery (DTD) in PaRSEC being two implementations of this model. Compared with other approaches of submitting tasks into a runtime system, STF has interesting advantages centered around an easy-to-use API, that allows users to expressed algorithms as a sequence of tasks (much like in OpenMP), while allowing the runtime to automatically identify and analyze the task dependencies and scheduling. In this paper, we focus on the DTD interface in PaRSEC, highlight some of its lesser known limitations and implemented two optimization techniques for DTD: support for user level graph trimming, and a new API for broadcast read-only data to remote tasks. We then analyze the benefits and limitations of these optimizations with benchmarks as well as on two common matrix factorization kernels Cholesky and QR, on two different systems Shaheen II from KAUST and Fugaku from RIKEN. We point out some potential for further improvements, and provided valuable insights into the strength and weakness of STF model. hoping to guide the future developments of task-based runtime systems.
- Research Article
2
- 10.1177/1094342019888567
- Nov 29, 2019
- The International Journal of High Performance Computing Applications
We describe the parallelization of the solve phase in the sparse Cholesky solver SpLLT when using a sequential task flow model. In the context of direct methods, the solution of a sparse linear system is achieved through three main phases: the analyse, the factorization and the solve phases. In the last two phases, which involve numerical computation, the factorization corresponds to the most computationally costly phase, and it is therefore crucial to parallelize this phase in order to reduce the time-to-solution on modern architectures. As a consequence, the solve phase is often not as optimized as the factorization in state-of-the-art solvers, and opportunities for parallelism are often not exploited in this phase. However, in some applications, the time spent in the solve phase is comparable to or even greater than the time for the factorization, and the user could dramatically benefit from a faster solve routine. This is the case, for example, for a conjugate gradient (CG) solver using a block Jacobi preconditioner. The diagonal blocks are factorized once only, but their factors are used to solve subsystems at each CG iteration. In this study, we design and implement a parallel version of a task-based solve routine for an OpenMP version of the SpLLT solver. We show that we can obtain good scalability on a multicore architecture enabling a dramatic reduction of the overall time-to-solution in some applications.
- Research Article
40
- 10.1079/bjn2001317
- May 1, 2001
- British Journal of Nutrition
Large (>1600 microm), ingestively masticated particles of bermuda grass (Cynodon dactylon L. Pers.) leaf and stem labelled with 169Yb and 144Ce respectively were inserted into the rumen digesta raft of heifers grazing bermuda grass. The concentration of markers in digesta sampled from the raft and ventral rumen were monitored at regular intervals over approximately 144 h. The data from the two sampling sites were simultaneously fitted to two pool (raft and ventral rumen-reticulum) models with either reversible or sequential flow between the two pools. The sequential flow model fitted the data equally as well as the reversible flow model but the reversible flow model was used because of its greater application. The reversible flow model, hereafter called the raft model, had the following features: a relatively slow age-dependent transfer rate from the raft (means for a gamma 2 distributed rate parameter for leaf 0.0740 v. stem 0.0478 h(-1)), a very slow first order reversible flow from the ventral rumen to the raft (mean for leaf and stem 0.010 h(-1)) and a very rapid first order exit from the ventral rumen (mean of leaf and stem 0.44 h(-1)). The raft was calculated to occupy approximately 0.82 total rumen DM of the raft and ventral rumen pools. Fitting a sequential two pool model or a single exponential model individually to values from each of the two sampling sites yielded similar parameter values for both sites and faster rate parameters for leaf as compared with stem, in agreement with the raft model. These results were interpreted as indicating that the raft forms a large relatively inert pool within the rumen. Particles generated within the raft have difficulty escaping but once into the ventral rumen pool they escape quickly with a low probability of return to the raft. It was concluded that the raft model gave a good interpretation of the data and emphasized escape from and movement within the raft as important components of the residence time of leaf and stem particles within the rumen digesta of cattle.
- Conference Article
1
- 10.1109/pesgm.2016.7741896
- Jul 1, 2016
In view of the advance of dynamic wireless power transfer (DWPT) for electric vehicles (EVs) charging, integrated DWPT systems are conceivable to emerge. However, quasi-continuous and time-varying DWPT charging loads bring large challenges to the power flow analysis of integrated DWPT systems. In this paper, a sequential load model is proposed to describe the temporal dependency of DWPT charging demands. Furthermore, on the basis of back/forward sweep method, a new power flow analysis method is addressed to deal with quasi-continuous and time-varying DWPT charging loads. Then a daylong sequential power flow simulation is conducted on a test system. Simulation results show that the fluctuation of branch flows varies significantly. Adequate power and energy managements are necessary to integrated DWPT systems.
- Conference Article
- 10.1109/icassp39728.2021.9414580
- Jun 6, 2021
Stochastic recurrent neural networks have shown promising performance for modeling complex sequences. Nonetheless, existing methods adopt KL divergence as distribution regularizations in their latent spaces, which limits the choices of models for latent distribution construction. In this paper, we incorporate maximum mean discrepancy in the recurrent structure for distribution regularization. Maximum mean discrepancy is able to measure the difference between two distributions by just sampling from them, which enables us to construct more complicated latent distributions by neural networks. Therefore, our proposed algorithm is able to model more complex sequences. Experiments conducted on two different sequential modeling tasks show that our method outperforms the state-of-the-art sequential modeling algorithms.
- Book Chapter
4
- 10.1007/978-3-319-78024-5_18
- Jan 1, 2018
We describe the design of a sparse direct solver for symmetric positive-definite systems using the PaRSEC runtime system. In this approach the application is represented as a DAG of tasks and the runtime system runs the DAG on the target architecture. Portability of the code across different architectures is enabled by delegating to the runtime system the task scheduling and data management. Although runtime systems have been exploited widely in the context of dense linear algebra, the DAGs arising in sparse linear algebra algorithms remain a challenge for such tools because of their irregularity. In addition to overheads induced by the runtime system, the programming model used to describe the DAG impacts the performance and the scalability of the code. In this study we investigate the use of a Parametrized Task Graph (PTG) model for implementing a task-based supernodal method. We discuss the benefits and limitations of this model compared to the popular Sequential Task Flow model (STF) and conduct numerical experiments on a multicore system to assess our approach. We also validate the performance of our solver SpLLT by comparing it to the state-of-the-art solver MA87 from the HSL library.
- Research Article
35
- 10.1145/2898348
- Aug 16, 2016
- ACM Transactions on Mathematical Software
To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This article evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications, namely, sparse matrix multifrontal factorizations that feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system.
- Research Article
2
- 10.1016/0898-1221(94)90007-8
- Jan 1, 1994
- Computers and Mathematics with Applications
Sequential and data flow models of processing
- Book Chapter
1
- 10.1007/978-3-030-33966-1_1
- Nov 15, 2019
Natural Language Understanding is one of the essential tasks for building clinical text-based applications. Understanding of these clinical texts can be achieved through Vector Space Models and Sequential Modelling tasks. This paper is focused on sequential modelling i.e. Named Entity Recognition and Part of Speech Tagging by attaining a state of the art performance of 93.8% as F1 score for i2b2 clinical corpus and achieves 97.29% as F1 score for GENIA corpus. This paper also states the performance of feature fusion by integrating word embedding, feature embedding and character embedding for sequential modelling tasks. We also propose a framework based on a sequential modelling architecture, named MedNLU, which has the capability of performing Part of Speech Tagging, Chunking, and Entity Recognition on clinical texts. The sequence modeler in MedNLU is an integrated framework of Convolutional Neural Network, Conditional Random Fields and Bi-directional Long-Short Term Memory network.
- Book Chapter
- 10.1007/978-981-10-8788-2_44
- Jan 1, 2018
Nowadays, with the huge number of leaves data, plant species recognition process becomes computationally expensive. Many computer scientists have suggested that the usage of parallel and distributed computing should be strongly considered as mandatory for handling computationally intensive programs. The availability of high performance multi-cores architecture results the complex recognition system to become popular in parallel computing area. This paper emphasizes on the computational flow design to enable the execution of the complex image processing tasks for Ficus deltoidea varietal recognition to be processed on parallel computing environment. Multi-cores computer is used whereas one of them acts as a master processor of the process and the other remaining processors act as worker processors. The master processor responsibles for controlling the main system operations such as data partitioning, data allocation, and data merging which results from worker processors. Experiments showed that a multi-cores parallel environment is a very appropriate platform for pipeline image processing. From the results, the sequential complex image processing model and computational flow design are significantly improved when executed through parallel model under multi-cores computer system. As the number of cores increases, the computational time taken by the parallel algorithm becomes less.
- Research Article
- 10.1609/aaai.v39i20.35403
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Among various branches of offline reinforcement learning (RL) methods, goal-conditioned supervised learning (GCSL) has gained increasing popularity as it formulates the offline RL problem as a sequential modeling task, therefore bypassing the notoriously difficult credit assignment challenge of value learning in conventional RL paradigm. Sequential modeling, however, requires capturing accurate dynamics across long horizons in trajectory data to ensure reasonable policy performance. To meet this requirement, leveraging large, expressive models has become a popular choice in recent literature, which, however, comes at the cost of significantly increased computation and inference latency. Contradictory yet promising, we reveal that lightweight models as simple as shallow 2-layer MLPs, can also enjoy accurate dynamics consistency and significantly reduced sequential modeling errors against large expressive models by adopting a simple recursive planning scheme: recursively planning coarse-grained future sub-goals based on current and target information, and then executes the action with a goal-conditioned policy learned from data relabeled with these sub-goal ground truths. We term our method as Recursive Skip-Step Planning (RSP). Simple yet effective, RSP enjoys great efficiency improvements thanks to its lightweight structure, and substantially outperforms existing methods, reaching new SOTA performances on the D4RL benchmark, especially in multi-stage long-horizon tasks.
- Conference Article
17
- 10.1109/iccad51958.2021.9643435
- Nov 1, 2021
Modern designs are increasingly reliant on physical design (PD) tools to derive full technology scaling benefits of Moore's Law. Designers often perform power, performance, and area (PPA) exploration through parallel PD runs with different tool configurations. Efficient exploration of PPA is mission-critical for chip designers who are working with stringent time-to-market constraints and finite compute resources. Therefore, a framework that can accurately predict a “doomed run” (i.e., will not meet the PPA targets) at early phases of the PD flow can provide a significant productivity boost by enabling early termination of such runs. Multiple QoR metrics can be leveraged to classify successful or doomed PD runs. In this paper, we specifically focus on the aspect of timing, where our goal is to identify the PD runs that cannot achieve end-of-flow timing results by predicting the post-route total negative slack (TNS) values in early PD phases. To achieve our goal, we develop an end-to-end machine learning (ML) framework that performs TNS prediction by modeling PD implementation as a sequential flow. Particularly, our framework leverages graph neural networks (GNNs) to encode netlist graphs extracted from various PD phases, and utilize long short-term memory (LSTM) networks to perform sequential modeling based on the GNN-encoded features. Experimental results on seven industrial designs with 5:2 train/test split ratio demonstrate that our framework predicts post-route TNS values in high fidelity within 5.2% normalized root mean squared error (NRMSE) in early design stages (e.g., placement, CTS) on the two validation designs that are unseen during training.
- Research Article
27
- 10.1177/002199839302701704
- Dec 1, 1993
- Journal of Composite Materials
Two well-established resin flow models for laminate processing by autoclave/vacuum degassing are compared. These are the sequential compaction and the squeezed sponge models, both based on viscous flow through porous media. A detailed examination of these models in this investigation shows that, contrary to what is com monly assumed, the fibre bed carries some of the applied pressure in the sequential com paction model. Equations for the implicit fibre bed compaction behaviour are derived, and using these equations, it is shown that the sequential compaction model is essentially a special case of the squeezed sponge model. The fibre bed pressure/compaction curve and the fibre bed permeability are identified as the two key relevant material properties. For one special case of these material properties, the squeezed sponge model predicts a se quential compaction sequence, with a roughly linear resin pressure profile, as assumed by the sequential compaction model. Using the fibre bed compaction and permeability as the primary parameters in the squeezed sponge flow model it is shown that a range of laminate compaction behaviour, from sequential to uniform compaction, can be obtained. It is also shown that the permeability controls the compaction time and the fibre bed behaviour con trols the shape of the laminate compaction response.
- Research Article
- 10.1088/1755-1315/300/4/042121
- Jul 1, 2019
- IOP Conference Series: Earth and Environmental Science
In the power flow calculation of AC/DC hybrid distribution network, the loss and capacity of the converter station have a great influence on the power flow calculation, and the AC/DC alternately solves the power flow, which has the disadvantage of poor convergence performance. In this paper, the power flow model of AC/DC hybrid distribution network with VSC is established. The loss mechanism of the converter station is analyzed and the capacity limit of the converter station is fully considered. The sequential power flow algorithm with VSC AC-DC hybrid distribution network is proposed. Its convergence performance is improved. In the improved 33-node AC/DC hybrid power distribution network with multiple types of DG, the correctness, effectiveness and algorithm fastness of the algorithm are verified.
- Research Article
32
- 10.1016/j.geoderma.2021.115111
- May 11, 2021
- Geoderma
Investigation and prediction of water infiltration process in cracked soils based on a full-scale model test
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.