PointCore: An efficient framework for unsupervised point cloud anomaly detection using joint local-global features.
PointCore: An efficient framework for unsupervised point cloud anomaly detection using joint local-global features.
- Research Article
1
- 10.1007/s10617-010-9065-z
- Nov 30, 2010
- Design Automation for Embedded Systems
In High-Level Synthesis, Binary Synthesis is a method for synthesizing compiled applications for which the source code is not available. One of the advantages of FPGAs over processors is the availability of multiple internal and external memory banks. Binary synthesis tools use multiple memory banks if they are able to recover data-structures from the binary. In this work we improve the recovery of data-structures by introducing dynamic memory analysis and combining it with improved static memory analysis. We show that many applications can only be synthesized using dynamic memory analysis. We present two FPGA based architectures for implementing the bound-checking and recovery for the synthesized code. Our experiments show that the proposed technique accelerates the execution of applications which use multiple memory banks concurrently. We demonstrate that many binary applications indeed benefit from this technique.
- Conference Article
28
- 10.1145/1878961.1878989
- Oct 24, 2010
In high-level synthesis, pipelined designs are often restricted by the number of memory banks available to the synthesis system. Using multiple memory banks can improve the performance of accelerated applications. Currently, programmers must manually assign data structures to specific memory banks on the accelerator. This paper describes Automatic Memory Partitioning, a method for automatically partitioning data structures into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. We present an ILP based algorithm for allocating memory segments into multiple memory banks. Experiments show significant improvements in performance while using a minimal number of memory banks.
- Conference Article
15
- 10.1109/icassp.2008.4517894
- Mar 1, 2008
Multiple memory banks design is employed in many high performance DSP processors. This architectural feature supports higher memory bandwidth by allowing multiple data memory access to be executed in parallel. Dedicated address generation units (AGUs) are commonly presented in DSPs to perform address arithmetic in parallel to the main datapath. Address assignment, optimization of memory layout of program variables to reduce address arithmetic instruction, has been studied extensively on single memory architecture. Make effective use of AGUs on multiple memory banks is a great challenge to compiler design and has not been studied previously. In this paper, we exploit address assignment with variable partitioning for scheduling on DSP architectures with multiple memory banks and AGUs. Our approach is built on novel graph models which capture both parallelism and serialism demands. An efficient scheduling algorithm, Address Assignment Sensitive Variable Partitioning (AASVP), is proposed to best leverage both multiple memory banks and AGUs. Experimental results show significant improvement compare to existing methods.
- Conference Article
9
- 10.1109/fpl.2009.5272381
- Aug 1, 2009
High level synthesis (HLS) is the field of transforming a high level programming language, such as C, into a register transfer level(RTL) description of the design. In HLS, binary synthesis is a method for synthesizing existing compiled applications for which the source code is not available. One of the advantages of FPGAs over software is the availability of multiple memory banks. Until now, binary synthesis systems have not made use of the multiple memory banks on FPGAs. In our work, we decompile the binary executable into an intermediate representation, and we target architectures with multiple memory banks and multiple memory ports. We present methods for detecting memory regions and synthesis of the decompiled code. The proposed methods accelerate the execution time of applications which use multiple memory regions concurrently.
- Research Article
6
- 10.1145/2442116.2442118
- Mar 10, 2013
- ACM Transactions on Embedded Computing Systems
One of the main advantages of high-level synthesis (HLS) is the ability to synthesize circuits that can access multiple memory banks in parallel. Current HLS systems synthesize parallel memory references based on explicit array declarations in the source code. We consider the need to synthesize not only array references but also memory operations targeting pointers and dynamic data structures. This paper describes Automatic Memory Partitioning, a method for automatically synthesizing general data structures (arrays and pointers) into multiple memory banks for increased parallelism and performance. We use source code instrumentation to collect memory traces in order to detect linear memory access patterns. The memory traces are used to split data structures into disjoint memory regions and determine which segments may benefit from parallel memory access. We present an algorithm for allocating memory segments into multiple memory banks. Experiments show significant improvements in performance while conserving the number of memory banks.
- Book Chapter
11
- 10.1007/978-3-540-39920-9_25
- Jan 1, 2003
To improve the overall performance, many of the modern advanced digital signal processors (DSPs) are equipped with on-chip multiple data memory banks which can be accessed in parallel in one instruction. In order to effectively exploit this architectural feature, the compiler must partition program variables between the memory banks appropriately – two parallel memory accesses always must take place on different memory banks. There is some research work that addresses this issue, however, most of this has been proposed as a post-pass (machine dependent) optimization. We attempt to resolve this problem by applying an algorithm which operates on the high-level intermediate representation, independent of the target machine. The partitioning scheme is based on the concepts of the interference graph which is constructed utilizing the control flow, data flow, and alias information. Partitioning of the interference graph is modeled as a Max Cut problem. The variable partitioning algorithm has been designed as an optional optimization phase integrated in the C compiler for a digital signal processor. This paper describes our efforts. The experimental results demonstrate that our partitioning algorithm finds a fairly good assignment of variables to memory banks. For small kernels from the DSPstone benchmark suite the performance is improved from 10% to 20%, for FFT filters by about 10%.
- Research Article
8
- 10.1109/tcad.2017.2648838
- Oct 1, 2017
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Parallelizing the memory accesses in a nested loop is a critical challenge to facilitate loop pipelining. An effective approach for high-level synthesis on field-programmable gate array is to map these accesses to multiple on-chip memory banks using a memory partitioning technique. In this paper, we propose an efficient memory partitioning algorithm with low overhead and low time complexity for parallel data access via data reuse. We find that for most applications in image and video processing, a large amount of data can be reused among different iterations of a loop nest. Motivated by this observation, we propose to cache reusable data using on-chip registers, organized as register chains. The nonreusable data are then separated into several memory banks by a memory partitioning algorithm. We revise the existing padding method to cover cases occurring frequently in our method wherein certain components of partition vector are zeros. Experimental results have demonstrated that compared with the state-of-the-art algorithms, the proposed method is efficient in terms of execution time, resource overhead, and power consumption across a wide range of access patterns extracted from applications in image and video processing. As for the testing patterns, the execution time is typically less than one millisecond. And the number of required memory banks is reduced by 59.7% on average, which leads to an average reduction of 78.2% in look-up tables, 65.5% in flip-flops, 37.1% in DSP48Es, and therefore 74.8% reduction in dynamic power consumption. Moreover, the storage overhead incurred by the proposed method is zero for most widely used access patterns in image filtering.
- Research Article
18
- 10.1145/966137.966140
- Jan 1, 2004
- ACM Transactions on Design Automation of Electronic Systems
Most vendors of digital signal processors (DSPs) support a Harvard architecture, which has two or more memory buses, one for program and one or more for data and allow the processor to access multiple words of data from memory in a single instruction cycle. Also, many existing fixed-point DSPs are known to have an irregular architecture with heterogeneous registers, which contains multiple register files that are distributed and dedicated to different sets of instructions. Although there have been several studies conducted to efficiently assign data to multimemory banks, most of them assumed processors with relatively simple, homogeneous general-purpose registers. Thus, several vendor-provided compilers for DSPs that we examined were unable to efficiently assign data to multiple data memory banks, thereby often failing to generate highly optimized code for their machines. As a consequence, programmers for these DSPs often manually assign program variables to memories so as to fully utilize multimemory banks in their code. This paper reports on our recent attempt to address this problem by presenting an algorithm that helps the compiler to efficiently assign data to multimemory banks. Our algorithm differs from previous work in that it assigns variables to memory banks in separate, decoupled code generation phases, instead of a single, tightly coupled phase. The experimental results have revealed that our decoupled algorithm greatly simplifies our code generation process; thus our compiler runs extremely fast, yet generates target code that is comparable in quality to the code generated by a coupled approach.
- Book Chapter
4
- 10.1007/978-3-540-71229-9_3
- Mar 26, 2007
This paper presents a compiler technique that reduces the energy consumption of the memory subsystem, for an off-chip partitioned memory architecture having multiple memory banks and various low-power operating modes for each of these banks. More specifically, we propose an efficient array allocation scheme to reduce the number of simultaneously active memory banks, so that the other memory banks that are inactive can be put to low power modes to reduce the energy. We model this problem as a graph partitioning problem, and use well known heuristics to solve the same. We also propose a simple Integer Linear Programming (ILP) formulation for the above problem. Our approach achieves, on an average, 20% energy reduction over the base scheme, and 8% to 10% energy reduction over previously suggested methods. Further, the results obtained using our heuristic are within 1% of optimal results obtained by using our ILP method.
- Research Article
1
- 10.1007/s00145-018-9301-4
- Aug 9, 2018
- Journal of Cryptology
Oblivious RAM (ORAM) is a cryptographic primitive that allows a trusted CPU to securely access untrusted memory, such that the access patterns reveal nothing about sensitive data. ORAM is known to have broad applications in secure processor design and secure multiparty computation for big data. Unfortunately, due to a logarithmic lower bound by Goldreich and Ostrovsky (J ACM 43(3):431–473, 1996), ORAM is bound to incur a moderate cost in practice. In particular, with the latest developments in ORAM constructions, we are quickly approaching this limit, and the room for performance improvement is small. In this paper, we consider new models of computation in which the cost of obliviousness can be fundamentally reduced in comparison with the standard ORAM model. We propose the oblivious network RAM model of computation, where a CPU communicates with multiple memory banks, such that the adversary observes only which bank the CPU is communicating with, but not the address offset within each memory bank. In other words, obliviousness within each bank comes for free—either because the architecture prevents a malicious party from observing the address accessed within a bank, or because another solution is used to obfuscate memory accesses within each bank—and hence we only need to obfuscate communication patterns between the CPU and the memory banks. We present new constructions for obliviously simulating general or parallel programs in the network RAM model. We describe applications of our new model in distributed storage applications with a network adversary.
- Book Chapter
10
- 10.1007/978-3-662-48797-6_15
- Jan 1, 2015
Oblivious RAM (ORAM) is a cryptographic primitive that allows a trusted CPU to securely access untrusted memory, such that the access patterns reveal nothing about sensitive data. ORAM is known to have broad applications in secure processor design and secure multi-party computation for big data. Unfortunately, due to a logarithmic lower bound by Goldreich and Ostrovsky (Journal of the ACM, ’96), ORAM is bound to incur a moderate cost in practice. In particular, with the latest developments in ORAM constructions, we are quickly approaching this limit, and the room for performance improvement is small.In this paper, we consider new models of computation in which the cost of obliviousness can be fundamentally reduced in comparison with the standard ORAM model. We propose the Oblivious Network RAM model of computation, where a CPU communicates with multiple memory banks, such that the adversary observes only which bank the CPU is communicating with, but not the address offset within each memory bank. In other words, obliviousness within each bank comes for free—either because the architecture prevents a malicious party from observing the address accessed within a bank, or because another solution is used to obfuscate memory accesses within each bank—and hence we only need to obfuscate communication patterns between the CPU and the memory banks. We present new constructions for obliviously simulating general or parallel programs in the Network RAM model. We describe applications of our new model in secure processor design and in distributed storage applications with a network adversary.KeywordsHash TableMemory BankMemory OperationMemory WordVirtual AddressThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
- Conference Article
55
- 10.1145/996566.996596
- Jun 7, 2004
Memory-related activity is one of the major sources of energy consumption in embedded systems. Many types of memories used in embedded systems allow multiple operating modes (e.g., active, standby, nap, power-down) to facilitate energy saving. Furthermore, it has been known that the potential energy saving increases when the embedded systems use multiple memory banks in which their operating modes are controlled independently. In this paper, we propose (a compiler-directed) integrated approach to the problem of maximally utilizing the operating modes of multiple memory banks by solving the three important tasks simultaneously: (1) assignment of variables to memory banks, (2) scheduling of memory access operations, and (3) determination of operating modes of banks. Specifically, for an instance of tasks 1 and 2, we formulate task 3 as a shortest path(SP) problem in a network and solved it optimally. We then develop an SP-based heuristic that solves tasks 2 and 3 efficiently in an integrated fashion. We then extend the proposed approach to address the limited register constraint in processor. From experiments with a set of benchmark programs, we confirm that the proposed approach is able to reduce the energy consumption by 15.76 over that by the conventional greedy approach.
- Research Article
6
- 10.1109/tcsii.2016.2638472
- Sep 1, 2017
- IEEE Transactions on Circuits and Systems II: Express Briefs
This brief proposes a low-power low-density parity check convolutional code (LDPC-CC) decoder that is fully compatible with the IEEE 1901 standard. The proposed architecture merges multiple memory banks into one to make it consume much less power than the conventional architecture. Memory operations conducted by all the unit processors are synchronized in the proposed decoder to merge the memory and avoid any possible data hazard. The data hazard happens when a unit processor tries to read a log-likelihood ratio before a different processor updates it, degrading the error-correcting performance. Memory-access patterns appearing in a memory-based LDPC-CC decoder are formulated to determine the size of a sliding window adequate for decoding. Experimental results show that the decoding architecture employing the merged memory and the proper window size reduces the power consumption by up to 40% compared to the conventional architecture that employs multiple memory banks.
- Research Article
- 10.1049/ip-cdt:20050130
- Jan 1, 2006
- IEE Proceedings - Computers and Digital Techniques
Memory-related activity is one of the major sources of energy consumption in embedded systems. Many types of memories used in embedded systems allow multiple operating modes (e.g. active, standby, nap, power-down) to facilitate energy saving. Furthermore, it has been known that the potential energy saving increases when the embedded systems use multiple memory banks in which their operating modes are controlled independently. The authors propose a compiler-directed integrated approach to the problem of maximally utilising the operating modes of multiple memory banks by solving the three important tasks simultaneously: (1) assignment of variables to memory banks, (2) scheduling of memory access operations and (3) determination of operating modes of banks. Specifically, for an instance of tasks 1 and 2, the authors formulate task 3 as a shortest path (SP) problem in a network and solved it optimally. Then, an SP-based heuristic that solves tasks 2 and 3 efficiently in an integrated fashion is developed. Then the proposed approach is extended to address the limited register constraint in the processor. From experiments with a set of benchmark programs, it is confirmed that the proposed approach is able to reduce the energy consumption by 15.76% over that by the conventional approach.
- Research Article
70
- 10.1145/335043.335047
- Apr 1, 2000
- ACM Transactions on Design Automation of Electronic Systems
We address the problem of code generation for DSP systems on a chip. In such systems, the amount of silicon devoted of program ROM is limited, so application software must be sufficiently dense. Additionally, the software must be written so as to meet various high-performance constraints, which may include hard real-time constraints. Unfortunately, current compiler technology is unable to generate high-quality code for DSPs, whose architectures are highly irregular. Thus, designers often resort to programming application software in assembly—a time-consuming task. In this paper, we focus on providing support for architectural feature of DSPs that makes code generation difficult, namely multiple data memory banks. This feature increases memory bandwith by permitting multiple data memory accesses to occur in parallel when the referenced variables belong to different data memory banks and the registers involved conform to a strict set of conditions. We present an algorithm that attempst to maximize the benefit of this architectural feature. While previous approaches have decoupled the phases of register allocation and memory bank assignment, thereby compromising code quality, our algorithm performs these two phases simultaneously. Experimental results demonstrate that our algorithm not only generates high-quality compiled code, but also improves the quality of completely-referenced code.