Multi-Softcore Architecture on FPGA

International Journal of Reconfigurable Computing ◽

10.1155/2014/979327 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 4

Author(s):

Mouna Baklouti ◽

Mohamed Abid

Keyword(s):

High Performance ◽

Design Methodology ◽

Matrix Multiplication ◽

Rapid Prototype ◽

General Purpose ◽

Parallel Applications ◽

Multicore Systems ◽

Processor Core ◽

Nios Ii ◽

Wide Range

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication).

Download Full-text

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626410000090 ◽

2010 ◽

Vol 20 (02) ◽

pp. 103-121 ◽

Cited By ~ 1

Author(s):

MOSTAFA I. SOLIMAN ◽

ABDULMAJID F. Al-JUNAID

Keyword(s):

Performance Evaluation ◽

Matrix Multiplication ◽

General Purpose ◽

System Level ◽

Memory Latency ◽

Single Chip ◽

Wide Range ◽

Matrix Unit ◽

And Performance ◽

Vector Matrix

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

Download Full-text

The STC104 Packet Routing Chip

VLSI Design ◽

10.1155/1995/92096 ◽

1995 ◽

Vol 2 (4) ◽

pp. 305-314 ◽

Cited By ~ 8

Author(s):

Peter W. Thompson ◽

Julian D. Lewis

Keyword(s):

High Performance ◽

Design Methodology ◽

Parallel Systems ◽

System Size ◽

General Purpose ◽

Packet Routing ◽

Design Issues ◽

Exchange Data

High-performance parallel systems demand a high-performance interconnect so that their component parts can exchange data and synchronise efficiently. The interconnect must be cheap, and must also scale well in both performance and cost relative to the system size. In this paper we describe the rationale, architecture and operation of the STC104, the first commercially available, general-purpose interconnect chip. The serial protocols used by the device are described, followed by an overview of the microarchitecture, The operation of the fundamental block is outlined, including the response to error conditions. Chip-wide design issues and design methodology are discussed, and finally various aspects of performance are calculated.

Download Full-text

Improved epoxy thermosets by the use of poly(ethyleneimine) derivatives

Physical Sciences Reviews ◽

10.1515/psr-2016-0128 ◽

2017 ◽

Vol 2 (8) ◽

Author(s):

Cristina Acebo ◽

Xavier Ramis ◽

Angels Serra

Keyword(s):

Mechanical Properties ◽

Epoxy Resins ◽

High Performance ◽

Polymer Network ◽

General Purpose ◽

High Adhesion ◽

Hyperbranched Poly ◽

Fiber Reinforced Materials ◽

Wide Range ◽

Good Heat

Abstract Epoxy resins are commonly used as thermosetting materials due to their excellent mechanical properties, high adhesion to many substrates and good heat and chemical resistances. This type of thermosets is intensively used in a wide range of fields, where they act as fiber-reinforced materials, general-purpose adhesives, high-performance coatings and encapsulating materials. These materials are formed by the chemical reaction of multifunctional epoxy monomers forming a polymer network produced through an irreversible way. In this article the improvement of the characteristics of epoxy thermosets using different hyperbranched poly(ethyleneimine) (PEI) derivatives will be explained.

Download Full-text

HIGH PRECISION INTEGER MULTIPLICATION WITH A GPU USING STRASSEN'S ALGORITHM WITH MULTIPLE FFT SIZES

Parallel Processing Letters ◽

10.1142/s0129626411000266 ◽

2011 ◽

Vol 21 (03) ◽

pp. 359-375 ◽

Cited By ~ 28

Author(s):

NIALL EMMART ◽

CHARLES C. WEEMS

Keyword(s):

High Performance ◽

Implementation Process ◽

General Purpose ◽

Fixed Size ◽

Processor Core ◽

Technology Generation ◽

Strassen’S Algorithm ◽

Integer Multiplication ◽

Strassen's Algorithm ◽

Memory Layout

We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.

Download Full-text

A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUs

Journal of Circuits System and Computers ◽

10.1142/s0218126614300025 ◽

2014 ◽

Vol 23 (08) ◽

pp. 1430002 ◽

Cited By ~ 11

Author(s):

SPARSH MITTAL

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

General Purpose ◽

System Level ◽

Cache Management ◽

Full Potential ◽

Wide Range ◽

Computing Platforms ◽

Graphics Processing

Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU–GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.

Download Full-text

[ANT]: A Machine Learning Approach for Building Performance Simulation: Methods and Development

The Academic Research Community Publication ◽

10.21625/archive.v3i1.442 ◽

2019 ◽

Vol 3 (1) ◽

pp. 205

Author(s):

Mahmoud M. Abdelrahman ◽

Ahmed Mohamed Yousef Toutou

Keyword(s):

Machine Learning ◽

High Performance ◽

General Purpose ◽

Ease Of Use ◽

Building Performance ◽

Performance Simulation ◽

Building Performance Simulation ◽

Wide Range ◽

Performance Development ◽

High Level

In this paper, we represent an approach for combining machine learning (ML) techniques with building performance simulation by introducing four methods in which ML could be effectively involved in this field i.e. Classification, Regression, Clustering and Model selection . Rhino-3d-Grasshopper SDK was used to develop a new plugin for involving machine learning in design process using Python programming language and making use of scikit-learn module, that is, a python module which provides a general purpose high level language to nonspecialist user by integration of wide range supervised and unsupervised learning algorithms with high performance, ease of use and well documented features. ANT plugin provides a method to make use of these modules inside Rhino\Grasshopper to be handy to designers. This tool is open source and is released under BSD simplified license. This approach represents promising results regarding making use of data in automating building performance development and could be widely applied. Future studies include providing parallel computation facility using PyOpenCL module as well as computer vision integration using scikit-image.

Download Full-text

Boosting Parallel Applications Performance on Applying DIM Technique in a Multiprocessing Environment

International Journal of Reconfigurable Computing ◽

10.1155/2011/546962 ◽

2011 ◽

Vol 2011 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Mateus B. Rutzig ◽

Antonio C. S. Beck ◽

Felipe Madruga ◽

Marco A. Alves ◽

Henrique C. Freitas ◽

...

Keyword(s):

General Purpose ◽

Parallel Applications ◽

Instruction Level Parallelism ◽

Great Level ◽

Embedded Processor ◽

Wide Range ◽

Thread Level Parallelism ◽

Multiprocessing Systems ◽

Performance Gains ◽

Level Parallelism

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Download Full-text

Parallel Hybrid Testing Techniques for the Dual-Programming Models-Based Programs

Symmetry ◽

10.3390/sym12091555 ◽

2020 ◽

Vol 12 (9) ◽

pp. 1555

Author(s):

Ahmed Mohammed Alghamdi ◽

Fathy Elbouraey Eassa ◽

Maher Ali Khamakhem ◽

Abdullah Saad AL-Malaise AL-Ghamdi ◽

Ahmed S. Alfakeeh ◽

...

Keyword(s):

High Performance ◽

Message Passing Interface ◽

Dynamic Testing ◽

Programming Model ◽

Parallel Applications ◽

Programming Models ◽

Processing Unit ◽

Wide Range ◽

Dual Programming ◽

Testing Techniques

The importance of high-performance computing is increasing, and Exascale systems will be feasible in a few years. These systems can be achieved by enhancing the hardware’s ability as well as the parallelism in the application by integrating more than one programming model. One of the dual-programming model combinations is Message Passing Interface (MPI) + OpenACC, which has several features including increased system parallelism, support for different platforms with more performance, better productivity, and less programming effort. Several testing tools target parallel applications built by using programming models, but more effort is needed, especially for high-level Graphics Processing Unit (GPU)-related programming models. Owing to the integration of different programming models, errors will be more frequent and unpredictable. Testing techniques are required to detect these errors, especially runtime errors resulting from the integration of MPI and OpenACC; studying their behavior is also important, especially some OpenACC runtime errors that cannot be detected by any compiler. In this paper, we enhance the capabilities of ACC_TEST to test the programs built by using the dual-programming models MPI + OpenACC and detect their related errors. Our tool integrated both static and dynamic testing techniques to create ACC_TEST and allowed us to benefit from the advantages of both techniques reducing overheads, enhancing system execution time, and covering a wide range of errors. Finally, ACC_TEST is a parallel testing tool that creates testing threads based on the number of application threads for detecting runtime errors.

Download Full-text

A High Performance Load Balance Strategy for Real-Time Multicore Systems

The Scientific World JOURNAL ◽

10.1155/2014/101529 ◽

2014 ◽

Vol 2014 ◽

pp. 1-14 ◽

Cited By ~ 6

Author(s):

Keng-Mao Cho ◽

Chun-Wei Tsai ◽

Yi-Shiuan Chiu ◽

Chu-Sing Yang

Keyword(s):

Real Time ◽

Load Balance ◽

High Performance ◽

Scheduling Algorithm ◽

Multicore Systems ◽

Processor Core ◽

Multicore Scheduling ◽

Reduce Energy Consumption ◽

Time Systems ◽

And Task

Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.

Download Full-text

A DESIGN METHODOLOGY FOR VERY LARGE ARRAY PROCESSORS—PART 1: GIPOP PROCESSOR ARRAY

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001495000122 ◽

1995 ◽

Vol 09 (02) ◽

pp. 231-262

Author(s):

N. VENKATESWARAN ◽

S. PATTABIRAMAN ◽

R. DEVANATHAN ◽

B. KUMARAN ◽

ASHRAF AHMED ◽

...

Keyword(s):

High Performance ◽

Design Methodology ◽

General Purpose ◽

Massively Parallel ◽

Inner Product ◽

Processor Array ◽

Large Array ◽

Outer Product ◽

Very Large Array ◽

Array Processors

Very Large Array Processors (VLAP) will be the need of the future for solving computationally intense Very Large Problems (VLP) common in pattern recognition, image processing and other related areas of digital signal processing. Design methodology of such VLAPs for massively parallel dedicated/general purpose applications is highly complex. Two companion papers (Part 1 and Part 2) on VLAP are presented in this issue. In Part 1, we propose a VLAP called Reconfigurable GIPOP Processor Array (RGPA). The RGPA is made up of high performance processing elements called the Generalized Inner Product Outer Product (GIPOP) processor. Unlike the traditional special/general purpose processors, ours has a totally different and new architecture and organization involving higher level functional units to match with the complex computational structures of numeric algorithms and suitable for massively parallel processing. We also present a strategy for mapping VLPs on VLAPs. In Part 2, we propose a novel VLSI design methodology for implementing cost effective and very high performance processors meant for special purpose applications and in particular, for VLAPs.

Download Full-text