High-Performance Modular Multiplication on the Cell Processor

The Cell Broadband Engine architecture is a revolutionary processor architecture well suited for many scientific codes. This paper reports on an effort to implement several traditional high-performance scientific computing applications on the Cell Broadband Engine processor, including molecular dynamics, quantum chromodynamics and quantum chemistry codes. The paper discusses data and code restructuring strategies necessary to adapt the applications to the intrinsic properties of the Cell processor and demonstrates performance improvements achieved on the Cell architecture. It concludes with the lessons learned and provides practical recommendations on optimization techniques that are believed to be most appropriate.

Download Full-text

High-performance, low-power architecture for scalable radix 2 montgomery modular multiplication algorithm

Canadian Journal of Electrical and Computer Engineering ◽

10.1109/cjece.2009.5599422 ◽

2009 ◽

Vol 34 (4) ◽

pp. 152-157 ◽

Cited By ~ 9

Author(s):

Atef Ibrahim ◽

Fayez Gebali ◽

Hamed El-Simary ◽

Amin Nassar

Keyword(s):

Low Power ◽

High Performance ◽

Modular Multiplication ◽

Multiplication Algorithm ◽

Montgomery Modular Multiplication ◽

Power Architecture

Download Full-text

High-performance scalable architecture for modular multiplication using a new digit-serial computation

Microelectronics Journal ◽

10.1016/j.mejo.2016.07.012 ◽

2016 ◽

Vol 55 ◽

pp. 169-178 ◽

Cited By ~ 5

Author(s):

Abdalhossein Rezai ◽

Parviz Keshavarzi

Keyword(s):

High Performance ◽

Modular Multiplication ◽

Scalable Architecture

Download Full-text

High-Performance VLSI Architecture for SCS Based Montgomery Modular Multiplication

IOSR Journal of VLSI and Signal processing ◽

10.9790/4200-0605014852 ◽

2016 ◽

Vol 06 (05) ◽

pp. 48-52

Author(s):

B. Vaisalini ◽

M. Pradeep

Keyword(s):

High Performance ◽

Vlsi Architecture ◽

Modular Multiplication ◽

Montgomery Modular Multiplication

Download Full-text

Parallelization and Performance Evaluation of an Edge Detection Algorithm on a Streaming Multi-Core Engine

Journal of Information Technology Research ◽

10.4018/jitr.2009062906 ◽

2009 ◽

Vol 2 (4) ◽

pp. 81-91 ◽

Cited By ~ 1

Author(s):

Hashir Karim Kidwai ◽

Fadi N. Sibai ◽

Tamer Rabie

Keyword(s):

Performance Evaluation ◽

Edge Detection ◽

High Performance ◽

Detection Algorithm ◽

Cell Processor ◽

Edge Detector ◽

Processing Application ◽

Image Processing Application ◽

Host Processor ◽

And Performance

In the world of multi-core processors, the STI Cell Broadband Engine (BE) stands out as a heterogeneous 9-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in graphics and related areas by integrating 8 computation engines each with multiple execution units and large register sets to achieve a high performance per area return. In this paper, we discuss the parallelization, implementation and performance evaluation of an edge detection image processing application based on the Roberts edge detector on the Cell BE. The authors report the edge detection performance measured on a computer with one Cell processor and with varying numbers of synergic processor engines enabled. These results are compared to the results obtained on the Cell’s single PPE with all 8 SPEs disabled. The results indicate that edge detection performs 10 times faster on the Cell BE than on modern RISC processors.

Download Full-text

Low-Cost High-Performance VLSI Architecture for Montgomery Modular Multiplication

IEEE Transactions on Very Large Scale Integration (VLSI) Systems ◽

10.1109/tvlsi.2015.2409113 ◽

2016 ◽

Vol 24 (2) ◽

pp. 434-443 ◽

Cited By ~ 20

Author(s):

Shiann-Rong Kuang ◽

Kun-Yi Wu ◽

Ren-Yao Lu

Keyword(s):

High Performance ◽

Low Cost ◽

Vlsi Architecture ◽

Modular Multiplication ◽

Montgomery Modular Multiplication

Download Full-text

High-performance modular exponentiation algorithm by using a new modified modular multiplication algorithm and common-multiplicand-multiplication method

2011 World Congress on Internet Security (WorldCIS-2011) ◽

10.1109/worldcis17046.2011.5749849 ◽

2011 ◽

Cited By ~ 1

Author(s):

Abdalhossin Rezai ◽

Parviz Keshavarzi

Keyword(s):

High Performance ◽

Modular Exponentiation ◽

Modular Multiplication ◽

Multiplication Algorithm

Download Full-text

A high performance FPGA implementation of 256-bit Modular multiplication processor over GF(p)

2016 2nd IEEE International Conference on Computer and Communications (ICCC) ◽

10.1109/compcomm.2016.7924847 ◽

2016 ◽

Cited By ~ 1

Author(s):

Xiuze Dong ◽

Xiaonan Zhang

Keyword(s):

High Performance ◽

Fpga Implementation ◽

Modular Multiplication

Download Full-text

HIGH PERFORMANCE MONTGOMERY MODULAR MULTIPLIER WITH A NEW RECODING METHOD

Journal of Circuits System and Computers ◽

10.1142/s0218126611007438 ◽

2011 ◽

Vol 20 (03) ◽

pp. 531-548 ◽

Cited By ~ 1

Author(s):

KOOROUSH MANOCHEHRI ◽

BABAK SADEGHIYAN ◽

SAADAT POURMOZAFARI

Keyword(s):

High Performance ◽

Hardware Implementation ◽

Computer Arithmetic ◽

Public Key Cryptography ◽

Modular Exponentiation ◽

Modular Multiplication ◽

Area Reduction ◽

Montgomery Modular Multiplication ◽

Modular Multiplier ◽

Reduction In Area

Modular calculations are widely used in many applications, especially in public key cryptography. Such operations are very time consuming, due to their long operands. To improve the performance of these calculations, many methods have been introduced. Montgomery modular multiplication is an example of such a solution to enhance the performance of modular multiplication and modular exponentiation. The radix-2 version of this method is simple and fast for hardware implementation, where multi-operand adders are required for its implementation. So far, Carry-Save-Adder (CSA) gives the best performance for multi-addition. In this paper, we propose a new recoding method for the Montgomery modular multiplier to enhance its performance. This is done through replacing CSA blocks with new blocks that have better performances than CSA in multi-addition calculations. With this replacement, we can theoretically have up to 40% reduction in area gates. In our experiments, we obtained 5.8% area reduction and 3% speed improvement in a hardware implementation. The idea behind our proposed method is the use of bitwise subtraction operator, where no carry propagation is needed. This recoding method of operands can also be used in many aspects of computer arithmetic, algorithms and computational hardware, such as multiplication, exponentiation and etc., in order to enhance their performances.

Download Full-text

Design of N-Term Scalable High-Performance Modular Multiplication Operator on GF (2m)

2021 IEEE 4th International Conference on Electronics Technology (ICET) ◽

10.1109/icet51757.2021.9451142 ◽

2021 ◽

Author(s):

Benjun Zhang ◽

Ning Wu ◽

Fang Zhou ◽

Fen Ge ◽

Caixian Fei

Keyword(s):

High Performance ◽

Multiplication Operator ◽

Modular Multiplication

Download Full-text