NCOR: An FPGA-Friendly Nonblocking Data Cache for Soft Processors with Runahead Execution

International Journal of Reconfigurable Computing ◽

10.1155/2012/915178 ◽

2012 ◽

Vol 2012 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Kaveh Aasaraai ◽

Andreas Moshovos

Keyword(s):

High Efficiency ◽

Main Memory ◽

Data Cache ◽

Improve Performance ◽

Data Caches ◽

Content Addressable Memories ◽

Processor Designs ◽

Level Parallelism ◽

Order Execution ◽

Runahead Execution

Soft processors often use data caches to reduce the gap between processor and main memory speeds. To achieve high efficiency, simple, blocking caches are used. Such caches are not appropriate for processor designs such as Runahead and out-of-order execution that require nonblocking caches to tolerate main memory latencies. Instead, these processors use non-blocking caches to extract memory level parallelism and improve performance. However, conventional non-blocking cache designs are expensive and slow on FPGAs as they use content-addressable memories (CAMs). This work proposes NCOR, an FPGA-friendly non-blocking cache that exploits the key properties of Runahead execution. NCOR does not require CAMs and utilizes smart cache controllers. A 4 KB NCOR operates at 329 MHz on Stratix III FPGAs while it uses only 270 logic elements. A 32 KB NCOR operates at 278 Mhz and uses 269 logic elements.

Efficient local locking for massively multithreaded in-memory hash-based operators

The VLDB Journal ◽

10.1007/s00778-020-00642-5 ◽

2021 ◽

Author(s):

Bashar Romanous ◽

Skyler Windh ◽

Ildar Absalyamov ◽

Prerna Budhkar ◽

Robert Halstead ◽

...

Keyword(s):

Relational Databases ◽

Aggregation Operators ◽

Main Memory ◽

Paradigm Shifts ◽

Multithreaded Processors ◽

Cache Hierarchies ◽

Processor Architectures ◽

Spatial Locality ◽

Content Addressable Memories ◽

Multi Core Processor

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.

Research on Cipher Coprocessor Instruction Level Parallelism Compiler

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.130-134.2907 ◽

2011 ◽

Vol 130-134 ◽

pp. 2907-2910

Author(s):

Hong Yan Li

Keyword(s):

System Architecture ◽

Instruction Level Parallelism ◽

Design Technique ◽

Improve Performance ◽

Specific Instruction ◽

Very Long Instruction Word ◽

Important Method ◽

Reconfigurable Design ◽

Level Parallelism

The important method of studying cipher coprocessor is focus on system architecture of processor in combination with reconfigurable design technique. How to improve performance of cipher coprocessor is important. Based on very long instruction word (VLIW) structure and reconfigurable design technique, specific instruction cipher coprocessor is designed. In this paper, the cipher coprocessor instruction level parallelism compilation technique is studied to enhance the cipher coprocessor performance by increasing the instruction level parallelism.

SimiEncode: A Similarity-based Encoding Scheme to Improve Performance and Lifetime of Non-Volatile Main Memory

10.1109/iccd53106.2021.00044 ◽

2021 ◽

Author(s):

Suzhen Wu ◽

Jiapeng Wu ◽

Zhirong Shen ◽

Zhihao Zhang ◽

Zuocheng Wang ◽

...

Keyword(s):

Main Memory ◽

Improve Performance ◽

Encoding Scheme

ON THE CORRECTNESS OF HARDWARE SCHEDULING MECHANISMS FOR OUT-OF-ORDER EXECUTION

Journal of Circuits System and Computers ◽

10.1142/s0218126698000134 ◽

1998 ◽

Vol 08 (02) ◽

pp. 301-314

Author(s):

SILVIA M. MUELLER ◽

WOLFGANG J. PAUL

Keyword(s):

Sufficient Conditions ◽

Instruction Level Parallelism ◽

Scheduling Mechanisms ◽

Level Parallelism ◽

Order Execution ◽

Hardware Scheduling

Hardware scheduling mechanisms are commonly used in current processors in order to make better use of instruction level parallelism. So far, such a mechanism is considered to be correct, if it avoids the standard structural and data hazards. However, based on two classical scheduling mechanisms, it will be shown that this condition is neither sufficient nor necessary for the correctness of such a mechanism, and that deadlocks are a serious matter in out-of-order execution as well. In addition, the paper provides sufficient conditions for the correctness of scheduling mechanisms.

Research on Visualization Conference Cooperative Work Platform Based on CSCW

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.347-350.2850 ◽

2013 ◽

Vol 347-350 ◽

pp. 2850-2855

Author(s):

Zhen Feng Wu ◽

Xuan Qin ◽

Xiao Li Ding

Keyword(s):

High Efficiency ◽

Cooperative Work ◽

Improve Performance ◽

Computer Support Cooperative Work ◽

Organization Mode ◽

Key Points ◽

Operation Control ◽

Multi Media ◽

Easy Integration ◽

Management Operation

Based on the concept and organization mode of CSCW (computer support cooperative work), this paper is to describe the research on establishing an organization structure which has one center and multiple cooperative work nodes. By employing advanced multi-media processing technology and method, it shall constitute a cooperative work platform providing remote video conference capability. Moreover, it also focus on some key points including nodes management, operation information sharing and operation control of remote nodes etc. which shall improve performance of work platform including flexible structure, easy integration and high-efficiency of operation.

Uniform Pooling for Graph Networks

Applied Sciences ◽

10.3390/app10186287 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6287

Author(s):

Jian Qin ◽

Li Liu ◽

Hui Shen ◽

Dewen Hu

Keyword(s):

High Efficiency ◽

Brain Connectivity ◽

Graph Embedding ◽

Point Of View ◽

Graph Classification ◽

Improve Performance ◽

Classification Tasks ◽

Multiple Graph ◽

The Brain ◽

Protein Graphs

The graph convolution network has received a lot of attention because it extends the convolution to non-Euclidean domains. However, the graph pooling method is still less concerned, which can learn coarse graph embedding to facilitate graph classification. Previous pooling methods were based on assigning a score to each node and then pooling only the highest-scoring nodes, which might throw away whole neighbourhoods of nodes and therefore information. Here, we proposed a novel pooling method UGPool with a new point-of-view on selecting nodes. UGPool learns node scores based on node features and uniformly pools neighboring nodes instead of top nodes in the score-space, resulting in a uniformly coarsened graph. In multiple graph classification tasks, including the protein graphs, the biological graphs and the brain connectivity graphs, we demonstrated that UGPool outperforms other graph pooling methods while maintaining high efficiency. Moreover, we also show that UGPool can be integrated with multiple graph convolution networks to effectively improve performance compared to no pooling.

Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the Intel Xeon Phi

Computers & Electrical Engineering ◽

10.1016/j.compeleceng.2015.06.009 ◽

2015 ◽

Vol 46 ◽

pp. 95-111 ◽

Cited By ~ 6

Author(s):

Manuel F. Dolz ◽

Francisco D. Igual ◽

Thomas Ludwig ◽

Luis Piñuel ◽

Enrique S. Quintana-Ortí

Keyword(s):

Energy Consumption ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Improve Performance ◽

Matrix Computations ◽

Level Parallelism ◽

Data Level ◽

Intel Xeon

High-Efficiency Parallel Cryptographic Accelerator for Real-Time Guaranteeing Dynamic Data Security in Embedded Systems

Micromachines ◽

10.3390/mi12050560 ◽

2021 ◽

Vol 12 (5) ◽

pp. 560

Author(s):

Zhun Zhang ◽

Xiang Wang ◽

Qiang Hao ◽

Dongdong Xu ◽

Jinlei Zhang ◽

...

Keyword(s):

Embedded Systems ◽

Data Processing ◽

Data Security ◽

Data Exchange ◽

High Efficiency ◽

Main Memory ◽

Processing Efficiency ◽

Dynamic Data ◽

Field Programmable ◽

On Chip

Dynamic data security in embedded systems is raising more and more concerns in numerous safety-critical applications. In particular, the data exchanges in embedded Systems-on-Chip (SoCs) using main memory are exposing many security vulnerabilities to external attacks, which will cause confidential information leakages and program execution failures for SoCs at key points. Therefore, this paper presents a security SoC architecture with integrating a four-parallel Advanced Encryption Standard-Galois/Counter Mode (AES-GCM) cryptographic accelerator for achieving high-efficiency data processing to guarantee data exchange security between the SoC and main memory against bus monitoring, off-line analysis, and data tampering attacks. The architecture design has been implemented and verified on a Xilinx Virtex-5 Field Programmable Gate Array (FPGA) platform. Based on evaluation of the cryptographic accelerator in terms of performance overhead, security capability, processing efficiency, and resource consumption, experimental results show that the parallel cryptographic accelerator does not incur significant performance overhead on providing confidentiality and integrity protections for exchanged data; its average performance overhead reduces to as low as 2.65% on typical 8-KB I/D-Caches, and its data processing efficiency is around 3 times that of the pipelined AES-GCM construction. The reinforced SoC under the data tampering attacks and benchmark tests confirms the effectiveness against external physical attacks and satisfies a good trade-off between high-efficiency and hardware overhead.

MAC OR NON-MAC: NOT A PROBLEM

Journal of Circuits System and Computers ◽

10.1142/s0218126614500686 ◽

2014 ◽

Vol 23 (05) ◽

pp. 1450068

Author(s):

LIBO HUANG ◽

LI SHEN ◽

YASHUAI LV ◽

ZHIYING WANG ◽

KUI DAI

Keyword(s):

Performance Improvement ◽

Power Efficiency ◽

High Performance ◽

High Efficiency ◽

Instruction Set ◽

Hardware Cost ◽

Trade Off ◽

Processor Core ◽

Narrow Bandwidth ◽

Processor Designs

Multicore designs have become the dominant organization for future high performance microprocessors. Instead of increasing cache sizes, clock frequencies, pipeline depths or register file (RF) ports, multicore designs tend to make each processor core simple but highly efficient. This new dimension for improving performance and power efficiency in multicore requires us to rethink processor architecture. Multiply-accumulate (MAC) operation is such a performance improvement technique that needs to be reviewed. MAC operation is the fundamentals of many DSP and multimedia applications, but it tends to be awkward to implement in an orthogonal instruction set architecture (ISA) because of operand bandwidth problem, instruction encoding problem, and hardware cost problem. So a big question is that whether we should support MAC or not in high-efficiency processor designs? This paper does a comparative study on this question and introduce data bandwidth relaxing techniques to eliminate narrow bandwidth provided by two-port RFs. The trade-off are also made to solve the instruction coding and hardware cost problem. So, the new design wisdom becomes that if you support multiply (MUL) operation, then support MAC operation.

THE INTERACTION AND RELATIVE EFFECTIVENESS OF HARDWARE AND SOFTWARE DATA PREFETCH

Journal of Circuits System and Computers ◽

10.1142/s0218126612400026 ◽

2012 ◽

Vol 21 (02) ◽

pp. 1240002

Author(s):

SANTHOSH VERMA ◽

DAVID M. KOPPELMAN

Keyword(s):

Relative Effectiveness ◽

The Other ◽

Data Cache ◽

Improve Performance ◽

Code Analysis ◽

Average Analysis ◽

Hardware Prefetching ◽

Access Patterns ◽

Compiler Techniques ◽

Selection Of

A major performance limiter in modern processors is the long latencies caused by data cache misses. Both compiler- and hardware-based prefetching schemes help hide these latencies and so improve performance. Compiler techniques infer memory access patterns through code analysis, and insert appropriate prefetch instructions. Hardware prefetching techniques work independently from the compiler by monitoring an access stream, detecting patterns in this stream and issuing prefetches based on these patterns. This paper looks at the interplay between compiler and hardware architecture-based prefetching techniques. Does either technique make the other one unnecessary? First, compilers' ability to achieve good results without extreme expertise is evaluated by preparing binaries with no prefetch, one-flag prefetch (no tuning), and expertly tuned prefetch. From runs of SPECcpu2006 binaries, we find that expertise avoids minor slowdown in a few benchmarks and provides substantial speedup in others. We compare software schemes to hardware prefetching schemes and our simulations show software alone substantially outperforms hardware alone on about half of a selection of benchmarks. While hardware matches or exceeds software in a few cases, software is better on average. Analysis reveals that in many cases hardware is not prefetching access patterns that it is capable of recognizing, due to some irregularities in the observed miss sequence. Hardware outperforms software on address sequences that the compiler would not guess. In general, while software is better at prefetching individual loads, hardware partly compensates for this by identifying more loads to prefetch. Using the two schemes together provides further benefits, but less than the sum of the contributions of each alone.