scholarly journals NCOR: An FPGA-Friendly Nonblocking Data Cache for Soft Processors with Runahead Execution

2012 ◽  
Vol 2012 ◽  
pp. 1-12 ◽  
Author(s):  
Kaveh Aasaraai ◽  
Andreas Moshovos

Soft processors often use data caches to reduce the gap between processor and main memory speeds. To achieve high efficiency, simple, blocking caches are used. Such caches are not appropriate for processor designs such as Runahead and out-of-order execution that require nonblocking caches to tolerate main memory latencies. Instead, these processors use non-blocking caches to extract memory level parallelism and improve performance. However, conventional non-blocking cache designs are expensive and slow on FPGAs as they use content-addressable memories (CAMs). This work proposes NCOR, an FPGA-friendly non-blocking cache that exploits the key properties of Runahead execution. NCOR does not require CAMs and utilizes smart cache controllers. A 4 KB NCOR operates at 329 MHz on Stratix III FPGAs while it uses only 270 logic elements. A 32 KB NCOR operates at 278 Mhz and uses 269 logic elements.


2021 ◽  
Author(s):  
Bashar Romanous ◽  
Skyler Windh ◽  
Ildar Absalyamov ◽  
Prerna Budhkar ◽  
Robert Halstead ◽  
...  

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.



2011 ◽  
Vol 130-134 ◽  
pp. 2907-2910
Author(s):  
Hong Yan Li

The important method of studying cipher coprocessor is focus on system architecture of processor in combination with reconfigurable design technique. How to improve performance of cipher coprocessor is important. Based on very long instruction word (VLIW) structure and reconfigurable design technique, specific instruction cipher coprocessor is designed. In this paper, the cipher coprocessor instruction level parallelism compilation technique is studied to enhance the cipher coprocessor performance by increasing the instruction level parallelism.



2021 ◽  
Author(s):  
Suzhen Wu ◽  
Jiapeng Wu ◽  
Zhirong Shen ◽  
Zhihao Zhang ◽  
Zuocheng Wang ◽  
...  


1998 ◽  
Vol 08 (02) ◽  
pp. 301-314
Author(s):  
SILVIA M. MUELLER ◽  
WOLFGANG J. PAUL

Hardware scheduling mechanisms are commonly used in current processors in order to make better use of instruction level parallelism. So far, such a mechanism is considered to be correct, if it avoids the standard structural and data hazards. However, based on two classical scheduling mechanisms, it will be shown that this condition is neither sufficient nor necessary for the correctness of such a mechanism, and that deadlocks are a serious matter in out-of-order execution as well. In addition, the paper provides sufficient conditions for the correctness of scheduling mechanisms.



2013 ◽  
Vol 347-350 ◽  
pp. 2850-2855
Author(s):  
Zhen Feng Wu ◽  
Xuan Qin ◽  
Xiao Li Ding

Based on the concept and organization mode of CSCW (computer support cooperative work), this paper is to describe the research on establishing an organization structure which has one center and multiple cooperative work nodes. By employing advanced multi-media processing technology and method, it shall constitute a cooperative work platform providing remote video conference capability. Moreover, it also focus on some key points including nodes management, operation information sharing and operation control of remote nodes etc. which shall improve performance of work platform including flexible structure, easy integration and high-efficiency of operation.



2020 ◽  
Vol 10 (18) ◽  
pp. 6287
Author(s):  
Jian Qin ◽  
Li Liu ◽  
Hui Shen ◽  
Dewen Hu

The graph convolution network has received a lot of attention because it extends the convolution to non-Euclidean domains. However, the graph pooling method is still less concerned, which can learn coarse graph embedding to facilitate graph classification. Previous pooling methods were based on assigning a score to each node and then pooling only the highest-scoring nodes, which might throw away whole neighbourhoods of nodes and therefore information. Here, we proposed a novel pooling method UGPool with a new point-of-view on selecting nodes. UGPool learns node scores based on node features and uniformly pools neighboring nodes instead of top nodes in the score-space, resulting in a uniformly coarsened graph. In multiple graph classification tasks, including the protein graphs, the biological graphs and the brain connectivity graphs, we demonstrated that UGPool outperforms other graph pooling methods while maintaining high efficiency. Moreover, we also show that UGPool can be integrated with multiple graph convolution networks to effectively improve performance compared to no pooling.



2015 ◽  
Vol 46 ◽  
pp. 95-111 ◽  
Author(s):  
Manuel F. Dolz ◽  
Francisco D. Igual ◽  
Thomas Ludwig ◽  
Luis Piñuel ◽  
Enrique S. Quintana-Ortí


Micromachines ◽  
2021 ◽  
Vol 12 (5) ◽  
pp. 560
Author(s):  
Zhun Zhang ◽  
Xiang Wang ◽  
Qiang Hao ◽  
Dongdong Xu ◽  
Jinlei Zhang ◽  
...  

Dynamic data security in embedded systems is raising more and more concerns in numerous safety-critical applications. In particular, the data exchanges in embedded Systems-on-Chip (SoCs) using main memory are exposing many security vulnerabilities to external attacks, which will cause confidential information leakages and program execution failures for SoCs at key points. Therefore, this paper presents a security SoC architecture with integrating a four-parallel Advanced Encryption Standard-Galois/Counter Mode (AES-GCM) cryptographic accelerator for achieving high-efficiency data processing to guarantee data exchange security between the SoC and main memory against bus monitoring, off-line analysis, and data tampering attacks. The architecture design has been implemented and verified on a Xilinx Virtex-5 Field Programmable Gate Array (FPGA) platform. Based on evaluation of the cryptographic accelerator in terms of performance overhead, security capability, processing efficiency, and resource consumption, experimental results show that the parallel cryptographic accelerator does not incur significant performance overhead on providing confidentiality and integrity protections for exchanged data; its average performance overhead reduces to as low as 2.65% on typical 8-KB I/D-Caches, and its data processing efficiency is around 3 times that of the pipelined AES-GCM construction. The reinforced SoC under the data tampering attacks and benchmark tests confirms the effectiveness against external physical attacks and satisfies a good trade-off between high-efficiency and hardware overhead.



2014 ◽  
Vol 23 (05) ◽  
pp. 1450068
Author(s):  
LIBO HUANG ◽  
LI SHEN ◽  
YASHUAI LV ◽  
ZHIYING WANG ◽  
KUI DAI

Multicore designs have become the dominant organization for future high performance microprocessors. Instead of increasing cache sizes, clock frequencies, pipeline depths or register file (RF) ports, multicore designs tend to make each processor core simple but highly efficient. This new dimension for improving performance and power efficiency in multicore requires us to rethink processor architecture. Multiply-accumulate (MAC) operation is such a performance improvement technique that needs to be reviewed. MAC operation is the fundamentals of many DSP and multimedia applications, but it tends to be awkward to implement in an orthogonal instruction set architecture (ISA) because of operand bandwidth problem, instruction encoding problem, and hardware cost problem. So a big question is that whether we should support MAC or not in high-efficiency processor designs? This paper does a comparative study on this question and introduce data bandwidth relaxing techniques to eliminate narrow bandwidth provided by two-port RFs. The trade-off are also made to solve the instruction coding and hardware cost problem. So, the new design wisdom becomes that if you support multiply (MUL) operation, then support MAC operation.



2012 ◽  
Vol 21 (02) ◽  
pp. 1240002
Author(s):  
SANTHOSH VERMA ◽  
DAVID M. KOPPELMAN

A major performance limiter in modern processors is the long latencies caused by data cache misses. Both compiler- and hardware-based prefetching schemes help hide these latencies and so improve performance. Compiler techniques infer memory access patterns through code analysis, and insert appropriate prefetch instructions. Hardware prefetching techniques work independently from the compiler by monitoring an access stream, detecting patterns in this stream and issuing prefetches based on these patterns. This paper looks at the interplay between compiler and hardware architecture-based prefetching techniques. Does either technique make the other one unnecessary? First, compilers' ability to achieve good results without extreme expertise is evaluated by preparing binaries with no prefetch, one-flag prefetch (no tuning), and expertly tuned prefetch. From runs of SPECcpu2006 binaries, we find that expertise avoids minor slowdown in a few benchmarks and provides substantial speedup in others. We compare software schemes to hardware prefetching schemes and our simulations show software alone substantially outperforms hardware alone on about half of a selection of benchmarks. While hardware matches or exceeds software in a few cases, software is better on average. Analysis reveals that in many cases hardware is not prefetching access patterns that it is capable of recognizing, due to some irregularities in the observed miss sequence. Hardware outperforms software on address sequences that the compiler would not guess. In general, while software is better at prefetching individual loads, hardware partly compensates for this by identifying more loads to prefetch. Using the two schemes together provides further benefits, but less than the sum of the contributions of each alone.



Sign in / Sign up

Export Citation Format

Share Document