Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance
To address the information leak problem in cloud computing, privacy protection techniques are receiving widespread attention. Among them, the Paillier homomorphism algorithm is an effective one since it allows addition and scalar multiplication operations when information is in dencrypted state. However, its computational efficiency is limited by complex modulo operations due to the ciphertext expansion followed by encryption. To accelerate its decryption, the Chinese Remainder Theorem (CRT) is often used to optimize these modulo operations, which makes the decryption chain undesirably long in turn. To address this issue, we propose an eCRT-Paillier decryption algorithm that shortens the decryption computation chain by combining precomputed parameters and eliminating extra judgment operations introduced by Montgomery modular multiplications. These two improvements reduce 50% modular multiplications and 60% judgment operations in the postprocessing of the CRT-Paillier decryption algorithm. Based on these improvements, we propose a highly parallel full-pipeline architecture to remove stalls caused by multiplier reuse in traditional modular exponentiation operations. This architecture also adopts some optimization methods, such as simplifying modular exponentiation units by dividing the exponent into segments and parallelizing data flow by multi-core instantiation. Finally, a high-throughput and efficient Paillier accelerator named MESA is implemented on the Xilinx Virtex-7 FPGA for evaluation. As the experimental result shows, it can complete a decryption within 0.577ms under a 100 MHz clock when using a 2048-bit key. Compared with previous works in the identical conditions, MESA can achieve a 1.16 × to 313.21 × increase in throughput, as well as 2.59% to 96.04% improvement in the Area Time Product (ATP).
- Research Article
5
- 10.1142/s0218126619502293
- Feb 18, 2019
- Journal of Circuits, Systems and Computers
This paper presents an FPGA implementation of the most critical operations of Public Key Cryptography (PKC), namely the Modular Exponentiation (ME) and the Modular Multiplication (MM). Both operations are integrated as Programmable System on Chip (PSoC) where the processor Microblaze of Xilinx is used for flexibility. Our objective is to achieve a best trade-off between time execution, occupied area and flexibility. The implementation of these operations on such environment requires taking into account several criteria. Indeed, the Hardware (HW) architectures data bus should be smaller than the input data length. The design must be scalable to support different security levels. The implementation achieves optimums execution time and HW resources number. In order to satisfy these constraints, Montgomery Power Ladder (MPL) and Montgomery Modular Multiplication (MMM) algorithms are utilized for the ME and the MM implementations as HW accelerators, respectively. Our implementation approach is based on the digit-serial method for performing the basic arithmetic operations. Efficient parallel and pipeline strategies are developed at the digit level for the optimization of the execution times. The application for 1024-bits data length shows that the MMM run in 6.24[Formula: see text][Formula: see text]s and requires 647 slices. The ME is executed in 6.75[Formula: see text]ms using 2881 slices.
- Research Article
10
- 10.1002/sec.1511
- Jul 25, 2016
- Security and Communication Networks
It is widely recognized that the public-key cryptosystem is playing a pivotal role to provide the security services not only in electronic world but also for secured key exchanges in military applications. Modular exponentiation is the crucial, but expensive, operation in several public-key cryptostyems, which makes use of repeated modular multiplications. So, the performance of public-key cryptography is highly influenced by the competent implementation of modular exponentiation. In order to speed up the entire process, it is essential to develop the modular exponential algorithms, which reduces the frequency of modular multiplications, and the time required to implement each modular multiplication. In this paper, we present the bit forwarding BFW techniques to reduce the count of modular multiplications for hardware implementation of modular exponentiation. Montgomery multiplication method is customized according to the needs of BFW techniques and implemented with radix-2, named as adaptable Montgomery multiplication, and also implemented with high radix and named as adaptable high-radix Montgomery multiplication. It has been computed that it is possible to reduce 11.16%, 15.14% and 18.20% of modular multiplications by BFW1, BFW2, and BFW3 algorithms, respectively, for 1024-bit exponent in comparison with prevailing techniques. This reduction will result in increased throughput and decreased power consumption. Copyright © 2016 John Wiley & Sons, Ltd.
- Research Article
7
- 10.1109/tc.2018.2811466
- Jan 1, 2018
- IEEE Transactions on Computers
Modular multiplication forms the basis of many cryptographic functions such as RSA, Diffie-Hellman key exchange, and ElGamal encryption. For large RSA moduli, combining the fast Fourier transform (FFT) with McLaughlin's Montgomery modular multiplication (MLM) has been validated to offer cost-effective implementation results. However, the conditional selections in McLaughlin's algorithm are considered to be inefficient and vulnerable to timing attacks, since extra long additions or subtractions may take place and the running time of MLM varies. In this work, we restrict the parameters of MLM by a set of new bounds and present a modified MLM algorithm involving no conditional selection. Compared to the original MLM algorithm, we inhibit extra operations caused by the conditional selections and accomplish constant running time for modular multiplications with different inputs. As a result, we improve both area-time efficiency and security against timing attacks. Based on the proposed algorithm, efficient FFT-based modular multiplication and exponentiation are derived. Exponentiation architectures with dual FFT-based multipliers are designed obtaining area-latency efficient solutions. The results show that our work offers a better efficiency compared to the state-of-the-art works from and above 2048-bit operand sizes. For single FFT-based modular multiplication, we have achieved constant running time and obtained area-latency efficiency improvements up to 24.3 percent for 1,024-bit and 35.5 percent for 4,096-bit operands, respectively.
- Book Chapter
7
- 10.1007/978-3-319-99136-8_14
- Jan 1, 2018
Security protocols using public-key cryptography often requires large number of costly modular exponentiations (MEs). With the proliferation of resource-constrained (mobile) devices and advancements in cloud computing, delegation of such expensive computations to powerful server providers has gained lots of attention. In this paper, we address the problem of verifiably secure delegation of MEs using two servers, where at most one of which is assumed to be malicious (the OMTUP-model). We first show verifiability issues of two recent schemes: We show that a scheme from IndoCrypt 2016 does not offer full verifiability, and that a scheme for n simultaneous MEs from AsiaCCS 2016 is verifiable only with a probability 0.5909 instead of the author’s claim with a probability 0.9955 for \(n=10\). Then, we propose the first non-interactive fully verifiable secure delegation scheme by hiding the modulus via Chinese Remainder Theorem (CRT). Our scheme improves also the computational efficiency of the previous schemes considerably. Hence, we provide a lightweight delegation enabling weak clients to securely and verifiably delegate MEs without any expensive local computation (neither online nor offline). The proposed scheme is highly useful for devices having (a) only ultra-lightweight memory, and (b) limited computational power (e.g. sensor nodes, RFID tags).
- Conference Article
3
- 10.2991/eeic-13.2013.11
- Jan 1, 2013
Modular multiplication is a basic operation in public key cryptosystems, like RSA and elliptic curve cryptography (ECC). There are many algorithms to speed up its calculation. Among them, Montgomery algorithm is the most efficient method for avoiding expensive divisions. Recently, due to the increasing use of diverse embedded systems, variable precision modular multiplications with scalable architectures gain more and more attentions. In this paper, we propose a new word-based implementation of Montgomery modular multiplication. A predict policy is incorporated with a scalable architecture to reduce area cost and time latency. Compared with other scalable designs, our area-time product is the best among all, with little memory overhead. Keywords—Montgomery modular multiplication, wordbased version of Montgomery algorithm, scalable architecture.
- Research Article
7
- 10.3390/electronics11223712
- Nov 13, 2022
- Electronics
The Montgomery modular multiplication is an integral operation unit in the public key cryptographic algorithm system. Previous work achieved good performance at low input widths by combining Redundant Binary Representation (RBR) with Montgomery modular multiplication, but it is difficult to strike a good balance between area and time as input bit widths increase. To solve this problem, based on the redundant Montgomery modular multiplication, in this paper, we propose a flexible and pipeline hardware implementation of the Montgomery modular multiplication. Our proposed structure guarantees a single-cycle delay between two-stage pipeline units and reduces the length of the critical path by redistributing the data paths between the pipelines and preprocessing the input in the loop. By analyzing the structure and comparing the related work in this paper, our structure ensures a lower area-time product while achieving a controllable and small area consumption. The comprehensive results under different Taiwan Semiconductor Manufacturing Company (TSMC) processes demonstrate the advantages of our structure in terms of flexibility and area-time product.
- Conference Article
3
- 10.1109/cesys.2017.8321286
- Oct 1, 2017
Secure transmission of data in a system over the network makes use of various cryptographic techniques. A good and efficient cryptosystem would play a crucial role in providing security services as Data-integrity, Confidentiality, and Authenticity. The security aspect of a cryptosystem is dependent on the computational difficulty involved in solving the mathematical problems involved in the cryptographic technique. Modular exponentiation which makes use of repeated modular multiplication is a tedious process, which is the core operation utilized in cryptosystems. So one can say that the performance of a cryptosystem confides in the performance of modular multiplication and exponentiation. Montgomery multiplication is considered to be a method for performing fast modular multiplication. In this paper comparative study of Montgomery multiplier for various bits is carried out by implementing in Spartan 3E FPGA board and it's different parameters are categorically analyzed.
- Book Chapter
1
- 10.1007/978-3-030-96305-7_50
- Jan 1, 2022
RSA public-key cryptography requires modular exponentiation and modular multiplication on large numbers. Montgomery Modular Multiplication is a fast method for performing modular multiplication. The modular exponentiation can be performed by repeatedly calling Montgomery Modular Multiplication. Transformations to the Montgomery Domain are required before the calculations, and a transformation back to the normal domain is also required to get the final result. The domain transformations require a value that is calculated by costly modular arithmetic. Many hardware RSA implementations use precomputed values for such domain transformations. As a result, the flexibility to use different public keys is lost. This paper introduces a Shift-Sub Modular Multiplication (SSMM) algorithm for calculating such values in fields. The algorithm does not require modular arithmetic and precomputed values. Instead, it uses shift and addition/subtraction calculations. The SSMM algorithm can also be used directly for RSA public-key cryptography. We give the hardware implementation of RSA public-key cryptography using SSMM in Verilog HDL and compare the cost and performance to that of the RSA public-key cryptography implementation using Montgomery Modular Multiplication. The results show that the performance of the two implementations is about the same, but the implementation using SSMM uses fewer hardware resources (55% to 59% adaptive logic modules and 69% to 85% flip-flops) because it does not require domain transformations.KeywordsRSA public-key cryptographyMontgomery Modular MultiplicationHardware security circuitClock frequencyHardware resources
- Research Article
5
- 10.58346/jowua.2023.i2.009
- Jun 30, 2023
- Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications
This research was designed to provide an idea for choosing the best two equations that can be used to finish the RSA decryption process. In general, the four strategies suggested to accelerate this procedure are competitors. Chinese Remainder Theorem (CRT) is among four rivals. The remains are improved algorithms that have been adjusted from CRT. In truth, the primary building block of these algorithms is CRT, but the sub exponent of CRT is substituted with the new value. Assuming the modulus is obtained by multiplying two prime numbers, two modular exponentiations must be performed prior to combining the results. Three factors are chosen to determine the optimal equation: modular multiplications, modular squares, and modular inverses. In general, the proposed method is always the winner since the optimal equation is selected from among four methods. The testing findings show that the proposed technique is consistently 10-30% faster than CRT.
- Book Chapter
5
- 10.1007/978-981-13-1328-8_56
- Nov 29, 2018
In public key cryptography such as RSA, modular exponentiation is the most time-consuming operation. RSA’s modular exponentiation can be computed by repeated modular multiplication. Fast modular multiplication algorithms have been proposed to speed up decryption/encryption. Montgomery algorithm, commonly used for modular multiplication is limited by the carry propagation delay from the addition of long operands. In this paper, we propose a hardware structure that simplifies the operation of the Q logic in Montgomery multiplier. The resulting design was applied in modular exponentiation for lightweight applications of RSA. Synthesis results showed that the new multiplier design achieved reduce hardware area, consequently, an area-efficient modular exponentiation design. A frequency of 452.49 MHz was achieved for modular exponentiation with 85 K gates using the 130 nm technology.
- Conference Article
2
- 10.1109/fpl.2010.99
- Aug 1, 2010
Modular exponentiation with large modulus and exponent has been widely used in public key cryptosystems. Montgomery's modular multiplication algorithm is normally used since no trial division is necessary and the critical path is reduced by using carry-save addition (CSA). In this paper, the Montgomery multiplication is greatly optimized and architectures are proposed to perform the Least-Significant-Bit (LSB) first and the Most-Significant-Bit (MSB) first algorithms. The architecture here presented has the following distinctive characteristics: 1) Use of digit-serial approach for Montgomery multiplication. 2) Conversion of the CSA representation of intermediate multiplication using carry-skip addition which reduces the critical path with a small area-speed penalty. 3) Precompute quotient value in Montgomery iteration in order to speed up operation frequency. In this work, implementation results in Xilinx Virtex 5 and Virtex 2 are reported. Experimental results show that the proposed modular exponentiation and modular multiplication design obtains the best delay performance compared with previous published works and outperforms them in terms of area-time complexity.
- Research Article
- 10.37231/myjcam.2020.3.1.37
- Jun 30, 2020
- Malaysian Journal of Computing and Applied Mathematics
This paper presents an FPGA implementation of the most critical operations of Public Key Cryptography (PKC), namely the Modular Exponentiation (ME) and the Modular Multiplication (MM). Both operations are integrated in Hardware (HW) as Programmable System on Chip (PSoC). The processor Microblaze of Xilinx is used for flexibility. Our objective is to achieve a best trade-off between execution time, occupied area and flexibility. In order to satisfy this constraint, Montgomery Power Ladder and Montgomery Modular Multiplication (MMM) algorithms are utilized for the ME and for the MM implementations as HW accelerators, respectively. Our implementation approach is based on the digit-serial method for performing the basic arithmetic operations. Efficient parallel and pipeline strategies are developed at the digit level for the optimization of the execution time. The application for 1024-bits data length shows that the MMM run in 6.24 µs and requires 647 slices. The ME is executed in 6.75 ms, using 2881 slices.
- Research Article
- 10.6459/jcm.200903_6(1).0007
- Mar 1, 2009
The motivation of studying high-speed and space-efficient algorithms for modular exponentiation (ME) comes from the applications in cryptography. In this paper, a new parallel Montgomery binary exponentiation algorithm was proposed. This algorithm is based on the Montgomery modular reduction technique, binary method, common-multiplicand-multiplication (CMM) algorithm, and the least-signed-digit recoding (LSD) technique. By using the CMM algorithm of computing the common part from two modular multiplications, the same common part in two modular multiplications can be computed once rather twice, we can thus improve the efficiency of the binary exponentiation algorithm by decreasing the number of modular multiplications. Furthermore, by using the proposed parallel CMM-LSD Montgomery binary exponentiation algorithm, the total number of single-precision multiplications can be reduced by about 66.7% and 30% as compared with the original Montgomery algorithm and the Ha-Moon's improved Montgomery algorithm, respectively.
- Conference Article
20
- 10.23919/fpl.2017.8056769
- Sep 1, 2017
Large number multiplication has always been an essential operation in cryptographic algorithms. In this paper, we propose Broken-Karatsuba multiplication by applying the non-least-positive form to represent large numbers and dig the parallelism hidden in conventional Karatsuba multiplication. Further, we modify Montgomery modular multiplication algorithm with Broken-Karatsuba multiplication to make it suitable for pipeline implementation with fewer hardware resources. Based on this modified algorithm, a 256-bit two-stage modular multiplier is constructed. There is no stall in the pipeline when performing consecutive modular multiplications and the delay of a modular multiplication is reduced significantly. Implemented on Virtex-6 FPGA platforms, our design outperforms most previous works in terms of modular multiplication latency and area-time product, which makes it suitable for server-side applications.
- Research Article
9
- 10.1016/j.procs.2020.04.087
- Jan 1, 2020
- Procedia Computer Science
A Low Latency Montgomery Modular Exponentiation
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.