A coarse-grained FPGA architecture for high-performance FIR filtering

Coarse-grained reconfigurable architecture (CGRA) mapping involves three main steps: placement, routing, and timing. The mapping is an NP-complete problem, and a common strategy is to decouple this process into its independent steps. This work focuses on the placement step, and its aim is to propose a technique that is both reasonably fast and leads to high-performance solutions. Furthermore, a near-optimal placement simplifies the following routing and timing steps. Exact solutions cannot find placements in a reasonable execution time as input designs increase in size. Heuristic solutions include meta-heuristics, such as Simulated Annealing (SA) and fast and straightforward greedy heuristics based on graph traversal. However, as these approaches are probabilistic and have a large design space, it is not easy to provide both run-time efficiency and good solution quality. We propose a graph traversal heuristic that provides the best of both: high-quality placements similar to SA and the execution time of graph traversal approaches. Our placement introduces novel ideas based on “you only traverse twice” (YOTT) approach that performs a two-step graph traversal. The first traversal generates annotated data to guide the second step, which greedily performs the placement, node per node, aided by the annotated data and target architecture constraints. We introduce three new concepts to implement this technique: I/O and reconvergence annotation, degree matching, and look-ahead placement. Our analysis of this approach explores the placement execution time/quality trade-offs. We point out insights on how to analyze graph properties during dataflow mapping. Our results show that YOTT is 60.6 , 9.7 , and 2.3 faster than a high-quality SA, bounding box SA VPR, and multi-single traversal placements, respectively. Furthermore, YOTT reduces the average wire length and the maximal FIFO size (additional timing requirement on CGRAs) to avoid delay mismatches in fully pipelined architectures.

Download Full-text

Algorithmic implementation of low-power high performance FIR filtering IP cores

18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design ◽

10.1109/icvd.2005.44 ◽

2005 ◽

Cited By ~ 2

Author(s):

C.H. Wang ◽

A.T. Erdogan ◽

T. Arslan

Keyword(s):

Low Power ◽

High Performance ◽

Fir Filtering ◽

Ip Cores

Download Full-text

A new coarse-grained FPGA architecture exploration environment

2008 International Conference on Field-Programmable Technology ◽

10.1109/fpt.2008.4762399 ◽

2008 ◽

Author(s):

Husain Parvez ◽

Zied Marrakchi ◽

Umer Farooq ◽

Habib Mehrez

Keyword(s):

Coarse Grained ◽

Architecture Exploration ◽

Fpga Architecture ◽

Exploration Environment

Download Full-text

High-Performance Reconfigurable Computing

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch053 ◽

2019 ◽

pp. 731-744

Author(s):

Mário Pereira Vestias

Keyword(s):

Power Consumption ◽

Integrated Circuit ◽

Reconfigurable Computing ◽

High Performance ◽

General Purpose ◽

Reconfigurable Hardware ◽

Coarse Grained ◽

Lower Power ◽

Fine Grained ◽

Application Specific

High-performance reconfigurable computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to general-purpose processors. Better performance and lower power consumption could be achieved using application-specific integrated circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter, the authors provide a description of reconfigurable hardware for high-performance computing.

Download Full-text

High-Performance Reconfigurable Computing

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch348 ◽

2018 ◽

pp. 4018-4029

Author(s):

Mário Pereira Vestias

Keyword(s):

Power Consumption ◽

Integrated Circuit ◽

Reconfigurable Computing ◽

High Performance ◽

General Purpose ◽

Reconfigurable Hardware ◽

Coarse Grained ◽

Lower Power ◽

Fine Grained ◽

Application Specific

High-Performance Reconfigurable Computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to General-Purpose Processors. Better performance and lower power consumption could be achieved using Application Specific Integrated Circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter we will provide a description of reconfigurable hardware for high performance computing.

Download Full-text

High Performance CGM-based Parallel Algorithms for the Optimal Binary Search Tree Problem

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2016100104 ◽

2016 ◽

Vol 8 (4) ◽

pp. 55-77 ◽

Cited By ~ 2

Author(s):

Vianney Kengne Tchendji ◽

Jean Frederic Myoupo ◽

Gilles Dequen

Keyword(s):

Parallel Algorithms ◽

Execution Time ◽

High Performance ◽

Scheduling Algorithm ◽

Search Tree ◽

Binary Search ◽

Coarse Grained ◽

Binary Search Tree ◽

Parameter Dependent ◽

The Impact

In this paper, the authors highlight the existence of close relations between the execution time, efficiency and number of communication rounds in a family of CGM-based parallel algorithms for the optimal binary search tree problem (OBST). In this case, these three parameters cannot be simultaneously improved. The family of CGM (Coarse Grained Multicomputer) algorithms they derive is based on Knuth's sequential solution running in time and space, where n is the size of the problem. These CGM algorithms use p processors, each with local memory. In general, the authors show that each algorithms runs in with communications rounds. is the granularity of their model, and is a parameter that depends on and . The special case of yields a load-balanced CGM-based parallel algorithm with communication rounds and execution steps. Alternately, if , they obtain another algorithm with better execution time, say , the absence of any load-balancing and communication rounds, i.e., not better than the first algorithm. The authors show that the granularity has a crucial role in the different techniques they use to partition the problem to solve and study the impact of each scheduling algorithm. To the best of their knowledge, this is the first unified method to derive a set of parameter-dependent CGM-based parallel algorithms for the OBST problem.

Download Full-text

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Information ◽

10.3390/info10100317 ◽

2019 ◽

Vol 10 (10) ◽

pp. 317 ◽

Cited By ~ 1

Author(s):

Karol Nowakowski ◽

Michal Ptaszynski ◽

Fumito Masui

Keyword(s):

Language Processing ◽

High Performance ◽

Computational Cost ◽

Neural Model ◽

Word Segmentation ◽

Coarse Grained ◽

Endangered Language ◽

Modelling Techniques ◽

Series Of Experiments ◽

N Gram

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

Download Full-text

The Combinatorial BLAS: design, implementation, and applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342011403516 ◽

2011 ◽

Vol 25 (4) ◽

pp. 496-509 ◽

Cited By ~ 187

Author(s):

Aydın Buluç ◽

John R Gilbert

Keyword(s):

Data Mining ◽

High Performance ◽

Web Search ◽

Sparse Matrix ◽

Ease Of Use ◽

Coarse Grained ◽

Matrix Methods ◽

The Right ◽

Traditional Approaches ◽

Combinatorial Graphs

This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics, web search, dynamical systems, and sparse matrix methods. Graph computations are difficult to parallelize using traditional approaches due to their irregular nature and low operational intensity. Many graph computations, however, contain sufficient coarse-grained parallelism for thousands of processors, which can be uncovered by using the right primitives. We describe the parallel Combinatorial BLAS, which consists of a small but powerful set of linear algebra primitives specifically targeting graph and data mining applications. We provide an extensible library interface and some guiding principles for future development. The library is evaluated using two important graph algorithms, in terms of both performance and ease-of-use. The scalability and raw performance of the example applications, using the Combinatorial BLAS, are unprecedented on distributed memory clusters.

Download Full-text

A High-Performance FPGA Architecture Using One-Level RRAM-Based Multiplexers

IEEE Transactions on Emerging Topics in Computing ◽

10.1109/tetc.2016.2630121 ◽

2017 ◽

Vol 5 (2) ◽

pp. 210-222 ◽

Cited By ~ 3

Author(s):

Xifan Tang ◽

Giovanni De Micheli ◽

Pierre-Emmanuel Gaillardon

Keyword(s):

High Performance ◽

Fpga Architecture

Download Full-text