HIGH PRECISION INTEGER MULTIPLICATION WITH A GPU USING STRASSEN'S ALGORITHM WITH MULTIPLE FFT SIZES

2011 ◽  
Vol 21 (03) ◽  
pp. 359-375 ◽  
Author(s):  
NIALL EMMART ◽  
CHARLES C. WEEMS

We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.

2014 ◽  
Vol 2014 ◽  
pp. 1-13 ◽  
Author(s):  
Mouna Baklouti ◽  
Mohamed Abid

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication).


1996 ◽  
Vol 06 (01) ◽  
pp. 3-12 ◽  
Author(s):  
BRIAN GRAYSON ◽  
ROBERT VAN DE GEIJN

In this paper, we give a practical high performance parallel implementation of Strassen’s algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 10– 20% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time.


1995 ◽  
Vol 4 (4) ◽  
pp. 275-289 ◽  
Author(s):  
B. Kumar ◽  
C.-H. Huang ◽  
P. Sadayappan ◽  
R.W. Johnson

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storage of size O(7n) for multiplying 2n× 2nmatrices. We present a modified formulation in which the working storage requirement is reduced to O(4n). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.


2014 ◽  
Vol 36 (4) ◽  
pp. 790-798
Author(s):  
Kai ZHANG ◽  
Shu-Ming CHEN ◽  
Yao-Hua WANG ◽  
Xi NING

2011 ◽  
Vol 28 (1) ◽  
pp. 1-14 ◽  
Author(s):  
W. van Straten ◽  
M. Bailes

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.


Author(s):  
Alexandra Hain ◽  
Arash E. Zaghi

Corrosion at steel beam ends is one of the most pressing challenges in the maintenance of aging bridges. To tackle this challenge, the Connecticut Department of Transportation (DOT) has partnered with the University of Connecticut to develop a repair method that benefits from the superior mechanical and durability characteristics of ultra-high performance concrete (UHPC) material. The repair involves welding shear studs to the intact portions of the web and encasing the beam end with UHPC. This provides an alternate load path for bearing forces that bypasses the corroded regions of the beam. The structural viability of the repair has been extensively proven through small- and full-scale experiments and comprehensive finite element simulations. Connecticut DOT implemented the repair for the first time in the field on a heavily trafficked four-span bridge in 2019. The UHPC beam end repair was chosen because of the access constraints and geometric complexities of the bridge that limited the viable repair options. Four of the repaired beam ends were fully instrumented to collect data on the performance of the repaired locations before casting, during curing, and for approximately 6 months following the application of the repair. This paper provides an overview of the successful repair implementation and presents the lessons learned during construction. Select data from the monitored beam ends are presented. It is expected that this information will provide engineers with a better understanding of the repair implementation process, and thus provide an additional repair option for states to enhance the safety of aging steel bridges.


Author(s):  
Sheng Kang ◽  
Guofeng Chen ◽  
Chun Wang ◽  
Ruiquan Ding ◽  
Jiajun Zhang ◽  
...  

With the advent of big data and cloud computing solutions, enterprise demand for servers is increasing. There is especially high growth for Intel based x86 server platforms. Today’s datacenters are in constant pursuit of high performance/high availability computing solutions coupled with low power consumption and low heat generation and the ability to manage all of this through advanced telemetry data gathering. This paper showcases one such solution of an updated rack and server architecture that promises such improvements. The ability to manage server and data center power consumption and cooling more completely is critical in effectively managing datacenter costs and reducing the PUE in the data center. Traditional Intel based 1U and 2U form factor servers have existed in the data center for decades. These general purpose x86 server designs by the major OEM’s are, for all practical purposes, very similar in their power consumption and thermal output. Power supplies and thermal designs for server in the past have not been optimized for high efficiency. In addition, IT managers need to know more information about servers in order to optimize data center cooling and power use, an improved server/rack design needs to be built to take advantage of more efficient power supplies or PDU’s and more efficient means of cooling server compute resources than from traditional internal server fans. This is the constant pursuit of corporations looking at new ways to improving efficiency and gaining a competitive advantage. A new way to optimize power consumption and improve cooling is a complete redesign of the traditional server rack. Extracting internal server power supplies and server fans and centralizing these within the rack aims to achieve this goal. This type of design achieves an entirely new low power target by utilizing centralized, high efficiency PDU’s that power all servers within the rack. Cooling is improved by also utilizing large efficient rack based fans for airflow to all servers. Also, opening up the server design is to allow greater airflow across server components for improved cooling. This centralized power supply breaks through the traditional server power limits. Rack based PDU’s can adjust the power efficiency to a more optimum point. Combine this with the use of online + offline modes within one single power supply. Cold backup makes data center power to achieve optimal power efficiency. In addition, unifying the mechanical structure and thermal definitions within the rack solution for server cooling and PSU information allows IT to collect all server power and thermal information centrally for improved ease in analyzing and processing.


2014 ◽  
Vol 596 ◽  
pp. 276-279
Author(s):  
Xiao Hui Pan

Graph component labeling, which is a subset of the general graph coloring problem, is a computationally expensive operation in many important applications and simulations. A number of data-parallel algorithmic variations to the component labeling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on CPUs and GPUs using CUDA. We evaluated our system with real-world graphs. We show how to consider different architectural features of the GPU and the host CPUs and achieve high performance.


Author(s):  
Mário Pereira Vestias

High-performance reconfigurable computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to general-purpose processors. Better performance and lower power consumption could be achieved using application-specific integrated circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter, the authors provide a description of reconfigurable hardware for high-performance computing.


Sign in / Sign up

Export Citation Format

Share Document