Practical aspects of fast matrix multiplication
The aim of this paper is to analyze the development of algorithms for Fast Matrix Multiplication (FMM) in both historical and technical contexts, as well as to compare available solutions on consumer-grade computer hardware. We review advancements in estimating the theoretical computational complexity of FMM and optimization techniques that are used in widely adopted algorithms, with a particular focus on optimal cache memory usage and leveraging Graphics Processing Units (GPU). The methodology of tests and their analysis highlight the performance differences of the considered algorithms depending on the matrix size and the nature of the data stored in them. Results indicate the significant role of tailoring the chosen algorithm to the available hardware and the specific application in which the algorithm is being performed. Also, we emphasize that the FMM algorithms can be applied not only to linear algebra problems but also to current problems in science and engineering, such as artificial intelligence, databases, parallel computations, computational biology, pattern recognition, and compiler construction, to mention just a few examples.
- Research Article
- 10.54254/2755-2721/31/20230154
- Jan 31, 2024
- Applied and Computational Engineering
This research mainly focuses on fast matrix multiplication algorithms. Fast matrix multiplication is one of the most fundamental problems in computer science. The fast matrix multiplication algorithm differs from conventional matrix multiplication in that it offers a faster computational approach that can perform the operation in less than O(n3) time complexity. This algorithm provides a more efficient method for multiplying matrices, significantly reducing the computational requirements. The Laser method, developed by Coppersmith and Winograd, is an algorithm for matrix multiplication that does not involve direct computation. It establishes a relationship between matrix multiplication and tensors and simplifies the operation by finding an intermediate tensor that is computationally manageable. This method applies a series of simplification operations to determine an upper bound on the computational complexity of matrix multiplication. However, as matrices become larger, the computational and memory requirements increase, posing challenges for practical implementation. This research will present the main ideas and performance of the Laser method and discuss the improvements made to the Laser method, including refined analysis and asymmetric hashing techniques. Additionally, it highlights the need for further exploration, such as parallel computing and optimization strategies, to enhance the efficiency of matrix multiplication algorithms. Furthermore, this research will also provide a prospectus for the future of matrix multiplication algorithms, such as the practical implementation of the Laser method.
- Conference Article
29
- 10.1145/1145768.1145772
- Jul 9, 2006
The exponent of matrix multiplication is the smallest real number ω such that for all e>0, O(nω+e) arithmetic operations suffice to multiply two n×n matrices. The standard algorithm for matrix multiplication shows that ω≤3. Strassen's remarkable result [5] shows that ω≤2.81, and a sequence of further works culminating in the work of Coppersmith and Winograd [4] have improved this upper bound to ω≤2.376 (see [1] for a full history). Most researchers believe that in fact ω=2, but there have been no further improvements in the known upper bounds for the past fifteen years.It is known that several central linear algebra problems (for example, computing determinants, solving systems of equations, inverting matrices, computing LUP decompositions) have the same exponent as matrix multiplication, which makes ω a fundamental number for understanding algorithmic linear algebra. In addition, there are non-algebraic algorithms whose complexity is expressed in terms of ω.In this talk I will describe a new group-theoretic approach, proposed in [3], to devising algorithms for fast matrix multiplication. The basic idea is to reduce matrix multiplication to group algebra multiplication with respect to a suitable non-abelian group. The group algebra multiplication is performed in the Fourier domain, and then using this scheme recursively yields upper bounds on ω.This general framework produces nontrivial matrix multiplication algorithms if one can construct finite groups with certain properties. In particular, a very natural embedding of matrix multiplication into C[G]-multiplication is possible when group G has three subgroups H1, H2, H3 that satisfy the triple product property. I'll define this property and describe a construction that satisfies the triple product property with parameters that are necessary (but not yet sufficient) to achieve ω=2.In the next part of the talk I'll describe demands on the representation theory of the groups in order for the overall approach to yield non-trivial bounds on ω, namely, that the character degrees must be small. Constructing families of groups together with subgroups satisfying the triple product property and for which the character degrees are sufficiently small has turned out to be quite challenging.In [2], we succeed in constructing groups meeting both requirements, resulting in non-trivial algorithms for matrix multiplication in this framework. I'll outline the basic construction, together with more sophisticated variants that achieve the bounds ω
- Research Article
4
- 10.5897/ijcer10.016
- Jan 1, 2012
- "International Journal of Computer Engineering Research"
On distributed memory electronic computers, the implementation and association of fast parallel matrix multiplication algorithms has yielded astounding results and insights. In this discourse, we use the tools of molecular biology to demonstrate the theoretical encoding of Strassen's fast matrix multiplication algorithm with DNA based on an $n$-moduli set in the residue number system, thereby demonstrating the viability of computational mathematics with DNA. As a result, a general scalable implementation of this model in the DNA computing paradigm is presented and can be generalized to the application of \emph{all} fast matrix multiplication algorithms on a DNA computer. We also discuss the practical capabilities and issues of this scalable implementation. Fast methods of matrix computations with DNA are important because they also allow for the efficient implementation of other algorithms (i.e. inversion, computing determinants, and graph theory) with DNA.
- Conference Article
34
- 10.1109/ipdps.2017.56
- May 1, 2017
Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due to the increased cost of memory movement, which is particularly noticeable for non-square matrices. Such implementations also require considerable workspace and modifications to the standard BLAS interface. We propose a code generator framework to automatically implement a large family of FMM algorithms suitable for multiplications of arbitrary matrix sizes and shapes. By representing FMM with a triple of matrices [U, V, W] that capture the linear combinations of submatrices that are formed, we can use the Kronecker product to define a multi-level representation of Strassen-like algorithms. Incorporating the matrix additions that must be performed for Strassen-like algorithms into the inherent packing and micro-kernel operations inside GEMM avoids extra workspace and reduces the cost of memory movement. Adopting the same loop structures as high-performance GEMM implementations allows parallelization of all FMM algorithms with simple but efficient data parallelism without the overhead of task parallelism. We present a simple performance model for general FMM algorithms and compare actual performance of 20+ FMM algorithms to modeled predictions. Our implementations demonstrate a performance benefit over conventional GEMM on single core and multi-core systems. This study shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use.
- Conference Article
4
- 10.1109/snpd-sawn.2005.2
- May 23, 2005
Fast matrix multiplication (FMM) algorithms to multiply two n /spl times/ n matrices reduce the asymptotic operation count from O(n/sup 3/) of the traditional algorithm to O(n/sup 2.38/), thus on distributed memory computers, the association of FMM algorithms and the parallel matrix multiplication algorithms always gives remarkable results. Within this association, the application of FMM algorithms at inter-processor level requires us to solve more difficult problems in designing but it forms the most effective algorithms. In this paper, a general model of these algorithms is presented and we also introduce a scalable method to implement this model on distributed memory computers.
- Research Article
- 10.54230/delib.2023.2.117
- Jan 1, 2023
- Deliberationes
It is crucial to investigate pattern recognition in the quickly changing field of artificial intelligence. It is becoming more and more important to comprehend the nuances of AI-supported pattern recognition as we move through a time of unparalleled technological growth. This study undertakes a thorough investigation of the topic, exploring the fundamental elements that underpin AI-supported pattern recognition, looking into its various application areas, and casting a glance ahead to show upcoming developments. The ubiquitous influence of pattern recognition across multiple areas highlights the significance of AI in the current technological environment. The ability of AI systems to recognize and understand patterns in data is revolutionary in a variety of fields. The complexities of pattern recognition are becoming a focus for scholars, practitioners, and technologists alike as we approach a new era where AI is incorporated into our daily lives. In my research, I put forward three hypotheses that I am investigating during my research, and I am looking for the possible answers. H1: Artificial intelligence-supported pattern recognition will keep developing, and allowing machines to mimic and even exceed human abilities in seeing and processing complicated data. H2: Artificial intelligence will increasingly integrate with and improve a variety of industries as it develops. H3: The European Union has committed significant financial resources to support the development of AI in recognition of the technology's strategic importance. The main objective of this research is to get a deeper understanding of AI-supported pattern recognition by dissecting its complex components and illuminating its significant consequences for both the present and the future.
- Conference Instance
10
- 10.1145/3357254
- Aug 16, 2019
The theme of AIPR2019 reflects the vital role of artificial intelligence and pattern recognition in many evolving areas of basic and applied research. We sincerely hope it will be a helpful guide for scientists and researchers working or planning to work in the domains of artificial intelligence and pattern recognition.
- Conference Article
7
- 10.1145/3341105.3373852
- Mar 30, 2020
Recent advances in deep neural networks have enabled impressive performance in computer vision, natural language processing, and other fields, yet they remain computationally very intensive to train or use. We consider the use of Winograd's Algorithm for fast matrix multiplication in feedforward neural networks and we find that speedups of 10% -- 30% are possible for fully connected layers in large networks.
- Conference Article
17
- 10.1117/12.2279088
- Jul 13, 2017
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
- Front Matter
6
- 10.1155/2011/840181
- Jan 1, 2011
- International Journal of Biomedical Imaging
There is currently a rapidly growing interest in parallel computation application in various medical imaging and image processing fields. This trend is expected to continue growing as more sophisticated and challenging medical imaging, image processing, and high-order data visualization problems are being addressed. The ongoing cost drop in computational tools and their wide accessibility play a center role as well. Given its short history, this area is still not a welldefined scientific discipline. The selected topics and papers for this special issue shed more light on various aspects of this expanding field and its potential in accelerating medical imaging applications.
- Conference Article
5
- 10.1109/mysurucon55714.2022.9972737
- Oct 16, 2022
In recent years, the world of high-performance computing has been developing rapidly with enormous efforts in the integration of information technology and research. The emergence of CPU-GPU platform computing has made this possible in a very efficient manner. Nowadays, the graphic processing unit (GPU) delivers much better performance than the CPU, because of a few cores with lots of cache memory on the CPU that can handle a few software threads at a time. In contrast, a GPU is composed of hundreds of cores that can handle thousands of threads simultaneously. The CPU-GPU hybrid platform is becoming increasingly important in high-performance computing (HPC) domains such as deep learning, artificial intelligence, etc., because of its tremendous computing power. In this work, we have proposed a performance model to accelerate the performance of HPC applications on a hybrid CPU-GPU platform. We have tested and analyzed the proposed performance model using different HPC benchmark applications such as Merge sort and Matrix multiplication on different platforms such as sequential, OpenMP, MPI in a single system, MPI in the cluster, and CUDA. We have observed that parallel computing in a shared and distributed memory architecture gives better performance than sequential computing. After analyzing we have represented it in the terms of graphs for a better view of the results. Index Terms—hybrid computing, parallel computing, sequential computing, CUDA, MPI, OpenMP, CPU, GPU.
- Research Article
- 10.62056/abhey76bm
- Apr 8, 2025
- IACR Communications in Cryptology
Plaintext-ciphertext matrix multiplication (PC-MM) is an indispensable tool in privacy-preserving computations such as secure machine learning and encrypted signal processing. While there are many established algorithms for plaintext-plaintext matrix multiplication, efficiently computing plaintext-ciphertext (and ciphertext-ciphertext) matrix multiplication is an active area of research which has received a lot of attention. Recent literature have explored various techniques for privacy-preserving matrix multiplication using fully homomorphic encryption (FHE) schemes with ciphertext packing and Single Instruction Multiple Data (SIMD) processing. On the other hand, there hasn't been any attempt to speed up PC-MM using unpacked additively homomorphic encryption (AHE) schemes beyond the schoolbook method and Strassen's algorithm for matrix multiplication. In this work, we propose an efficient PC-MM from unpacked AHE, which applies Cussen's compression-reconstruction algorithm for plaintext-plaintext matrix multiplication in the encrypted setting. We experimentally validate our proposed technique using a concrete instantiation with the additively homomorphic elliptic curve ElGamal encryption scheme and its software implementation on a Raspberry Pi 5 edge computing platform. Our proposed approach achieves up to an order of magnitude speedup compared to state-of-the-art for large matrices with relatively small element bit-widths. Extensive measurement results demonstrate that our fast PC-MM is an excellent candidate for efficient privacy-preserving computation even in resource-constrained environments.
- Conference Article
1
- 10.1117/12.824449
- May 1, 2009
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
High computing requirements for the synchronous impulse reconstruction (SIRE) radar algorithm present a challenge for near real-time processing, particularly the calculations involved in output image formation. Forming an image requires a large number of parallel and independent floating-point computations. To reduce the processing time and exploit the abundant parallelism of image processing, a graphics processing unit (GPU) architecture is considered for the imaging algorithm. Widely available off the shelf, high-end GPUs offer inexpensive technology that exhibits great capacity of computing power in one card. To address the parallel nature of graphics processing, the GPU architecture is designed for high computational throughput realized through multiple computing resources to target data parallel applications. Due to a leveled or in some cases reduced clock frequency in mainstream single and multi-core general-purpose central processing units (CPUs), GPU computing is becoming a competitive option for compute-intensive radar imaging algorithm prototyping. We describe the translation and implementation of the SIRE radar backprojection image formation algorithm on a GPU platform. The programming model for GPU's parallel computing and hardware-specific memory optimizations are discussed in the paper. A considerable level of speedup is available from the GPU implementation resulting in processing at real-time acquisition speeds.
- Research Article
9
- 10.1109/tcbb.2018.2814570
- Mar 12, 2018
- IEEE/ACM transactions on computational biology and bioinformatics
In computational biology, the hierarchy of biological systems requires the development of flexible and powerful computational tools. Graphics processing unit (GPU) architecture has been a suitable device for parallel computing in simulating multi-cellular systems. However, in modeling complex biological systems, scientists often face two tasks, mathematical formulation and skillful programming. In particular, specific programming skills are needed for GPU programming. Therefore, the development of an easy-to-use computational architecture, which utilizes GPU for parallel computing and provides intuitive interfaces for simple implementation, is needed so that general scientists can perform GPU simulations without knowing much about the GPU architecture. Here, we introduce ParaCells, a cell-centered GPU simulation architecture for NVIDIA compute unified device architecture (CUDA). ParaCells was designed as a versatile architecture that connects the user logic (in C++) with NVIDIA CUDA runtime and is specific to the modeling of multi-cellular systems. An advantage of ParaCells is its object-oriented model declaration, which allows it to be widely applied to many biological systems through the combination of basic biological concepts. We test ParaCells with two applications. Both applications are significantly faster when compared with sequential as well as parallel OpenMP and OpenACC implementations. Moreover, the simulation programs based on ParaCells are cleaner and more readable than other versions.
- Book Chapter
39
- 10.1016/s1076-5670(00)80020-8
- Jan 1, 2000
- Advances in Imaging and Electron Physics
Artificial intelligence and pattern recognition techniques in microscope image processing and analysis