CADAS: Communication-Aware Dynamic Scheduler on CGRAs for Large-Volume and Real-Time Processing

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Modern data-intensive applications demand accelerators that can adapt to dynamic and high-throughput workloads. Coarse-Grained Reconfigurable Arrays (CGRAs) have emerged as promising candidates for such workloads due to their spatial architecture and run-time reconfigurability. However, ad-hoc hardware configurations and traditional static compilation techniques struggle to cope with the run-time irregularity and control-flow dynamism. This paper first presents a systematic design space exploration (DSE) to identify the optimized hardware configurations tailored to application-specific constraints, such as area budget, throughput requirement, and throughput efficiency. Then, it proposes a communication-aware dynamic scheduling approach built on a hardware/software co-design that combines preloading and scoreboard mechanisms to minimize reconfiguration overhead while maximizing interconnect bandwidth utilization. Evaluated on the optimized configurations and the respective spectrum sensing benchmarks, the proposed scheduling method achieves up to 1.6 × performance improvement over a baseline and 1.3 × over an adapted state-of-the-art (SOTA) dynamic scheduling strategy.

Similar Papers
  • Research Article
  • Cite Count Icon 3
  • 10.4028/www.scientific.net/amr.403-408.2420
Research on Network Control System Using Improved EDF Dynamic Scheduling Algorithm
  • Nov 1, 2011
  • Advanced Materials Research
  • Zai Ping Chen + 1 more

In this paper classic static and dynamic scheduling strategy is analyzed first, and then communication network of schedule ability judgment basis is given. An improved dynamic EDF scheduling algorithm is proposed in order to improve the scheduling task of real-time. The scheduling strategy is to change task priority according to the transmission error over deadline task when applying dynamic EDF scheduling strategy. True Time tool is used to build CAN network control system simulation platform. Dynamic EDF scheduling algorithm and improved scheduling algorithm are simulated respectively. The effectiveness of improved scheduling algorithm is verified by the simulation Keywords-Network control system; Scheduling Algorithm; True Time toolbox result.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3656176
HierCGRA: A Novel Framework for Large-scale CGRA with Hierarchical Modeling and Automated Design Space Exploration
  • May 10, 2024
  • ACM Transactions on Reconfigurable Technology and Systems
  • Sichao Chen + 9 more

Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains, since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture design and solving large-scale mapping problems. To address these challenges, we introduce HierCGRA, a novel framework that integrates hierarchical CGRA modeling, Chisel-based Verilog generation, LLVM-based data flow graph (DFG) generation, DFG mapping, and design space exploration (DSE). With the graph homomorphism (GH) mapping algorithm, HierCGRA achieves a faster mapping speed and higher PE utilization rate compared with the existing state-of-the-art CGRA frameworks. The proposed hierarchical mapping strategy achieves 41× speedup on average compared with the ILP mapping algorithm in CGRA-ME. Furthermore, the automated DSE based on Bayesian optimization achieves a significant performance improvement by the heterogeneity of PEs and interconnections. With these features, HierCGRA enables the agile development for large-scale CGRA and accelerates the process of finding a better CGRA architecture.

  • Research Article
  • Cite Count Icon 11
  • 10.1016/j.vehcom.2023.100628
A comparative analysis of the semi-persistent and dynamic scheduling schemes in NR-V2X mode 2
  • Jun 1, 2023
  • Vehicular Communications
  • Luca Lusvarghi + 4 more

Over the last years, the evolution of Vehicle-to-Everything (V2X) services from basic safety-related to enhanced V2X (eV2X) applications prompted the development of the 5G New Radio (NR)-V2X technology. Standardized by the Third Generation Partnership Project (3GPP) in Release 16, NR-V2X features a distributed resource allocation mode, known as Mode 2, that allows vehicles to autonomously select their transmission resources employing a Semi-Persistent Scheduling (SPS) or a Dynamic Scheduling (DS) scheme. The SPS approach relies on the periodic reservation of resources, whereas the DS scheme is a reservation-less solution that forces the selection of new transmission resources for every generated message. 3GPP standards do not indicate under which conditions each scheduling scheme should be used. In this context, this study analyzes and compares the performance of SPS and DS under different traffic types and Packet Delay Budget (PDB) requirements. Simulation results demonstrate that the SPS scheme represents the best solution for serving fixed size periodic traffic, whereas DS is more adequate for aperiodic traffic (of fixed or variable size). The study shows that the superiority of DS over SPS becomes more evident when tighter PDB requirements are considered, and that the performance of the DS scheme is independent of the PDB. It is also demonstrated that an adaptive scheduling strategy, which allows vehicles to select the scheduling scheme that best suits the type of generated traffic, is the best solution in mixed traffic scenarios where fixed size periodic traffic and variable size aperiodic traffic sources coexist.

  • Conference Article
  • Cite Count Icon 20
  • 10.1109/asap52443.2021.00029
OpenCGRA: Democratizing Coarse-Grained Reconfigurable Arrays
  • Jul 1, 2021
  • Cheng Tan + 9 more

Reconfigurable architectures are today experiencing a renewed interest for their ability to provide specialization without sacrificing the capability to adapt to disparate workloads. Coarse-grained reconfigurable arrays (CGRAs) provide higher flexibility than application-specific integrated circuits (ASICs) while offering increased hardware efficiency with respect to field-programmable gate arrays (FPGAs). This makes CGRAs a promising alternative to enable power-/area-efficient acceleration across different application domains. Unfortunately, specializing and implementing a CGRA for a specific application domain requires the exploration in a large design space (e.g., applying appropriate loop transformation on each application, specializing the reconfigurable processing elements of the CGRA, refining the network topology, deciding the size of the data memory, etc.) and involves enormous software/hardware engineering effort (e.g., modeling, testing, and evaluating the CGRA, map operations onto the CGRA, etc). In this paper, we discuss a hardware/software co-design framework<sup>*</sup> to automatically specialize and implement optimal CGRA designs given a set of applications of interest.

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/cscn.2017.8088635
Dynamic pilot scheduling scheme for 5G outdoor ultra-dense network
  • Sep 1, 2017
  • Estifanos Yohannes Menta + 2 more

5G is envisioned to provide 1000 fold increase in terms of area capacity when compared to today's network. Large part of this promise is going to be fulfilled by ultra dense network(UDN). However, dense network suffers from interference challenges. In this work, we investigated the performance of interference aware dynamic pilot scheduling scheme for UDN using 3D map based channel model. Initial Pilot scheduling has been carried out using location-aware scheduling scheme which results in minimum tolerable level of channel state information (CSI) error. Following initial schedule, dynamic rescheduling has been carried out by letting 20% of users to move at constant speed towards their respective destination. The impact of different design criteria such as pilot reuse distance, deployment planning, and the frequency of rescheduling (as signaling over head) for various user density and mobility are assessed. Moreover, the probability of successfully rescheduling users within the coherence time has been studied for the dynamic scheduling scheme. Results show that dynamic pilot scheduling scheme has higher probability of successfully rescheduling users in high mobility environment. Furthermore, this work demonstrates that locationaware pilot scheduling with received signal strength (RSS) based rescheduling scheme can be taken as a promising solution to the CSI uncertainty problem so that the full advantage of enabling technologies of 5G can be utilized.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s11265-015-0974-8
A Dynamic Modulo Scheduling with Binary Translation: Loop optimization with software compatibility
  • Feb 17, 2015
  • Journal of Signal Processing Systems
  • Ricardo Ferreira + 5 more

In the past years, many works have demonstrated the applicability of Coarse-Grained Reconfigurable Array (CGRA) accelerators to optimize loops by using software pipelining approaches. They are proven to be effective in reducing the total execution time of multimedia and signal processing applications. However, the run-time reconfigurability of CGRAs is hampered overheads introduced by the needed translation and mapping steps. In this work, we present a novel run-time translation technique for the modulo scheduling approach that can convert binary code on-the-fly to run on a CGRA. We propose a greedy approach, since the modulo scheduling for CGRA is an NP-complete problem. In addition to read-after-write dependencies, the dynamic modulo scheduling faces new challenges, such as register insertion to solve recurrence dependences and to balance the pipelining paths. Our results demonstrate that the greedy run-time algorithm can reach a near-optimal ILP rate, better than an off-line compiler approach for a 16-issue VLIW processor. The proposed mechanism ensures software compatibility as it supports different source ISAs. As proof of concept of scaling, a change in the memory bandwidth has been evaluated. In this analysis it is demonstrated that when changing from one memory access per cycle to two memory accesses per cycle, the modulo scheduling algorithm is able to exploit this increase in memory bandwidth and enhance performance accordingly. Additionally, to measure area and performance, the proposed CGRA was prototyped on an FPGA. The area comparisons show that a crossbar CGRA (with 16 processing elements and including an 4-issue VLIW host processor) is only 1.11 × bigger than a standalone 8-issue VLIW softcore processor.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3663675
SAT-Based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAs
  • Jul 31, 2024
  • ACM Journal on Emerging Technologies in Computing Systems
  • Cristian Tirelli + 8 more

Coarse-Grain Reconfigurable Arrays (CGRAs) represent emerging low-power architectures designed to accelerate Compute-Intensive Loops (CILs). The effectiveness of CGRAs in providing acceleration relies on the quality of mapping: how efficiently the CIL is compiled onto the platform. State-of-the-Art (SoA) compilation techniques utilize modulo scheduling to minimize the Iteration Interval (II) and use graph algorithms like Max-Clique Enumeration to address mapping challenges. Our work approaches the mapping problem through a satisfiability (SAT) formulation. We introduce the Kernel Mobility Schedule (KMS), an ad hoc schedule used with the Data Flow Graph and CGRA architectural information to generate Boolean statements that, when satisfied, yield a valid mapping. Experimental results demonstrate SAT-MapIt outperforming SoA alternatives in almost 50% of explored benchmarks. Additionally, we evaluated the mapping results in a synthesizable CGRA design and emphasized the runtime metrics trends, i.e., energy efficiency and latency, across different CILs and CGRA sizes. We show that a hardware-agnostic analysis performed on compiler-level metrics can optimally prune the architectural design space, while still retaining Pareto-optimal configurations. Moreover, by exploring how implementation details impact cost and performance on real hardware, we highlight the importance of holistic software-to-hardware mapping flows, as the one presented herein.

  • Research Article
  • Cite Count Icon 2
  • 10.1007/s11433-014-5610-2
Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays
  • Oct 21, 2014
  • Science China Physics, Mechanics & Astronomy
  • Chen Yang + 3 more

The computational capability of a coarse-grained reconfigurable array (CGRA) can be significantly restrained due to data and context memory bandwidth bottlenecks. Traditionally, two methods have been used to resolve this problem. One method loads the context into the CGRA at run time. This method occupies very small on-chip memory but induces very large latency, which leads to low computational efficiency. The other method adopts a multi-context structure. This method loads the context into the on-chip context memory at the boot phase. Broadcasting the pointer of a set of contexts changes the hardware configuration on a cycle-by-cycle basis. The size of the context memory induces a large area overhead in multi-context structures, which results in major restrictions on application complexity. This paper proposes a Predictable Context Cache (PCC) architecture to address the above context issues by buffering the context inside a CGRA. In this architecture, context is dynamically transferred into the CGRA. Utilizing a PCC significantly reduces the on-chip context memory and the complexity of the applications running on the CGRA is no longer restricted by the size of the on-chip context memory. Data preloading is the most frequently used approach to hide input data latency and speed up the data transmission process for the data bandwidth issue. Rather than fundamentally reducing the amount of input data, the transferred data and computations are processed in parallel. However, the data preloading method cannot work efficiently because data transmission becomes the critical path as the reconfigurable array scale increases. This paper also presents a Hierarchical Data Memory (HDM) architecture as a solution to the efficiency problem. In this architecture, high internal bandwidth is provided to buffer both reused input data and intermediate data. The HDM architecture relieves the external memory from the data transfer burden so that the performance is significantly improved. As a result of using PCC and HDM, experiments running mainstream video decoding programs achieved performance improvements of 13.57%–19.48% when there was a reasonable memory size. Therefore, 1080p@35.7fps for H.264 high profile video decoding can be achieved on PCC and HDM architecture when utilizing a 200 MHz working frequency. Further, the size of the on-chip context memory no longer restricted complex applications, which were efficiently executed on the PCC and HDM architecture.

  • Conference Article
  • Cite Count Icon 4
  • 10.2514/6.2011-6364
Dynamic Real-time Scheduling of Terminal Traffic
  • Jun 14, 2011
  • Heming Chen + 2 more

This paper presents dynamic strategies for the integrated scheduling and runway assignment of both arrival and departure traffic over an airport. While a static scheduling scheme handles traffic over a specified planning horizon simultaneously, sequential dynamic schedulers divide the planning horizon into a series of smaller scheduling windows and apply a static scheduling scheme sequentially over each window. Dynamic scheduling strategies are desirable for obtaining real-time solutions of continual traffic streams and for taking advantage of updated traffic information. In this paper, a multiple-point scheduling framework is used in which scheduling locations include runway thresholds as well as fixes over the terminal airspace and gates on the airport surface. Integrated static scheduling of both arrival and departure traffic is formulated as mixed-integerlinear programming (MILP). Solution variables include scheduled times of arrival (STA) at the multiple scheduling locations and aircraft sequences at merge points. In addition, aircraft route assignments for both ground and airborne traffic are included as discrete solution variables, from which optimal runway assignments can be determined. Then, different dynamic strategies with either overlapping or non-overlapping scheduling windows are developed and compared. Induced constraints for ensuring sufficient separations among traffic in neighboring windows are discussed. Real traffic data from the JFK airport is used in extensive numerical solutions to evaluate the computational speeds and scheduling performances of different dynamic strategies.

  • Dissertation
  • Cite Count Icon 2
  • 10.31390/gradschool_theses.1901
Dynamic Scheduling, Allocation, and Compaction Scheme for Real-Time Tasks on FPGAs
  • Jan 1, 2001
  • Shobharani Tatineni

Run-time reconfiguration (RTR) is a method of computing on reconfigurable logic, typically FPGAs, changing hardware configurations from phase to phase of a computation at run-time. Recent research has expanded from a focus on a single application at a time to encompass a view of the reconfigurable logic as a resource shared among multiple applications or users. In real-time system design, task deadlines play an important role. Real-time multi-tasking systems not only need to support sharing of the resources in space, but also need to guarantee execution of the tasks. At the operating system level, sharing logic gates, wires, and I/O pins among multiple tasks needs to be managed. From the high level standpoint, access to the resources needs to be scheduled according to task deadlines. This thesis describes a task allocator for scheduling, placing, and compacting tasks on a shared FPGA under real-time constraints. Our consideration of task deadlines is novel in the setting of handling multiple simultaneous tasks in RTR. Software simulations have been conducted to evaluate the performance of the proposed scheme. The results indicate significant improvement by decreasing the number of tasks rejected.

  • Conference Article
  • Cite Count Icon 5
  • 10.23919/date51398.2021.9473971
Reducing Memory Access Conflicts with Loop Transformation and Data Reuse on Coarse-grained Reconfigurable Architecture
  • Feb 1, 2021
  • Yuge Chen + 5 more

Coarse-Grained Reconfigurable Arrays (CGRAs) are promising to have low power consumption and high energy-efficiency characteristics as accelerators. Recent years, many research works focus on improving the programmability of the CGRAs by enabling the fast reconfiguration during execution. The performance of these CGRAs critically hinges upon the scheduling power of the compiler. One of the critical challenges is to reduce memory access conflicts using static compilation techniques. Memory accessing conflict brings the synchronization overhead which causes the pipelining stall and reduces CGRA performance. Existing compilers usually tackle this challenge by orchestrating the data placement of the on-chip global memory (OGM) in CGRA to let the parallel memory accesses avoid the bank conflict. However, we find bank conflict is not the only reason that causes the memory access conflicts. In some CGRAs, the bandwidth of the data network between OGM and processing element array (PEA) is also limited due to the low power design principle. The unbalanced network bandwidth loads is another reason that causes memory access conflicts. Furthermore, the redundant data access across iterations is one of the primary causes of memory access conflicts. Based on these observations, we provide a comprehensive and generalized compilation flow to reduce the memory conflicts. Firstly, we develop a loop transformation model to maximize the inter-iteration data reuse of the loops to reduce the memory accessing operations under the software pipelining scheme. Secondly, we enhance the bandwidth utilization of the network between OGM and PEA and avoid the bank conflict by providing a conflict-aware spatial mapping algorithm which can be easily integrated into existing CGRA modulo scheduling compilation flow. Experimental results show our method is capable of improving performance by an average of 44% comparing with state-of-the-art CGRA compiling flow.

  • Research Article
  • Cite Count Icon 5
  • 10.1109/tce.2004.1362507
A dynamic scheduling algorithm for video-on-demand servers
  • Nov 1, 2004
  • IEEE Transactions on Consumer Electronics
  • Kyung Oh Lee + 2 more

An innovative dynamic scheduling scheme is proposed to improve the efficiency of video-on-demand servers. We first introduce a paged segment striping model that makes dynamic scheduling possible. Based on this striping scheme, we propose a dynamic scheduling scheme that adapts to frequently changing workloads. In particular, we can change the round length without any additional disk access so that it can be adapted to changing request trends with a negligible cost in performance. This dynamic scheduling scheme always shows better performance than the static scheduling scheme in simulation. Although the dynamical scheme introduces additional scheduling overhead, it is very small when compared with the performance degradation in the static scheme.

  • Book Chapter
  • 10.1007/978-3-540-30541-5_73
Paged Segment Striping Scheme for the Dynamic Scheduling of Continuous Media Servers
  • Jan 1, 2004
  • Kyungoh Lee + 2 more

An innovative dynamic scheduling scheme is proposed to improve the efficiency of video-on-demand servers. We first introduce a paged segment striping model that makes dynamic scheduling possible. Based on this striping scheme, we propose a dynamic scheduling scheme that adapts to frequently changing workloads. In particular, we can change the round length without any additional disk access so that it can be adapted to changing request trends with a negligible cost in performance. This dynamic scheduling scheme always shows better performance than the static scheduling scheme in simulation. Although the dynamical scheme introduces additional scheduling overhead, it is very small when compared with the performance degradation in the static scheme.

  • Research Article
  • Cite Count Icon 6
  • 10.1145/3447970
MC-DeF
  • Apr 14, 2021
  • ACM Transactions on Architecture and Code Optimization
  • George Charitopoulos + 2 more

Executing complex scientific applications on Coarse-Grain Reconfigurable Arrays ( CGRAs ) promises improvements in execution time and/or energy consumption compared to optimized software implementations or even fully customized hardware solutions. Typical CGRA architectures contain of multiple instances of the same compute module that consist of simple and general hardware units such as ALUs, simple processors. However, generality in the cell contents, while convenient for serving a wide variety of applications, penalizes performance and energy efficiency. To that end, a few proposed CGRAs use custom logic tailored to a particular application’s specific characteristics in the compute module. This approach, while much more efficient, restricts the versatility of the array. To date, versatility at hardware speeds is only supported with Field programmable gate arrays (FPGAs), that are reconfigurable at a very fine grain. This work proposes MC-DeF, a novel Mixed-CGRA Definition Framework targeting a Mixed-CGRA architecture that leverages the advantages of CGRAs by utilizing a customized cell array, and those of FPGAs by incorporating a separate LUT array used for adaptability. The framework presented aims to develop a complete CGRA architecture. First, a cell structure and functionality definition phase creates highly customized application/domain specific CGRA cells. Then, mapping and routing phases define the CGRA connectivity and cell-LUT array transactions. Finally, an energy and area estimation phase presents the user with area occupancy and energy consumption estimations of the final design. MC-DeF uses novel algorithms and cost functions driven by user defined metrics, threshold values, and area/energy restrictions. The benefits of our framework, besides creating fast and efficient CGRA designs, include design space exploration capabilities offered to the user. The validity of the presented framework is demonstrated by evaluating and creating CGRA designs of nine applications. Additionally, we provide comparisons of MC-DeF with state-of-the-art related works, and show that MC-DeF offers competitive performance (in terms of internal bandwidth and processing throughput) even compared against much larger designs, and requires fewer physical resources to achieve this level of performance. Finally, MC-DeF is able to better utilize the underlying FPGA fabric and achieves the best efficiency (measured in LUT/GOPs).

  • Book Chapter
  • Cite Count Icon 2
  • 10.1007/978-1-4471-4796-1_6
Research on Dynamic Job Shop Scheduling
  • Dec 6, 2012
  • Jingmin Zhang + 1 more

In the paper, the development of dynamic job shop scheduling problem were summarized comprehensively. It discusses the conception of dynamic job shop scheduling, dynamic events, evaluation indicator, dynamic scheduling strategy, and dynamic scheduling methods. The research methods are divided into two classes: the precise methods and the approximate methods. Characters of each method are analyzed. At last, problems which need further investigation and possible research directions are pointed out.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.