Augmenting loop tiling with data alignment for improved cache performance

Embedded systems are designed for a variety of applications ranging from Hard Real Time applications to mobile computing, which demands various types of cache designs for better performance. Since real-time applications place stringent requirements on performance, the role of the cache subsystem assumes significance. Reconfigurable caches meet performance requirements under this context. Existing reconfigurable caches tend to use associativity and size for maximizing cache performance. This article proposes a novel approach of a reconfigurable and intelligent data cache (L1) based on replacement algorithms. An intelligent embedded data cache and a dynamic reconfigurable intelligent embedded data cache have been implemented using Verilog 2001 and tested for cache performance. Data collected by enabling the cache with two different replacement strategies have shown that the hit rate improves by 40% when compared to LRU and 21% when compared to MRU for sequential applications which will significantly improve performance of embedded real time application.

Download Full-text

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Journal of Signal Processing Systems ◽

10.1007/s11265-013-0754-2 ◽

2013 ◽

Vol 74 (2) ◽

pp. 137-150

Author(s):

Yi Wang ◽

Linfeng Pan ◽

Zili Shao ◽

Yong Guan ◽

Minyi Guo

Keyword(s):

Data Alignment ◽

Simd Processors

Download Full-text

Combining Retiming and Scheduling Techniques for Loop Parallelization and Loop Tiling

Parallel Processing Letters ◽

10.1142/s0129626497000383 ◽

1997 ◽

Vol 07 (04) ◽

pp. 379-392 ◽

Cited By ~ 20

Author(s):

Alain Darte ◽

Georges-André Silber ◽

Frédéric Vivien

Keyword(s):

Dependence Graph ◽

Loop Tiling ◽

Loop Parallelization ◽

Fine Grain ◽

Loop Body ◽

Nested Loops ◽

Single Block ◽

The Way

Tiling is a technique used for exploiting medium-grain parallelism in nested loops. It relies on a first step that detects sets of permutable nested loops. All algorithms developed so far consider the statements of the loop body as a single block, in other words, they are not able to take advantage of the structure of dependences between different statements. In this paper, we overcame this limitation by showing how the structure of the reduced dependence graph can be taken into account for detecting more permutable loops. Our method combines graph retiming techniques and graph scheduling techniques. It can be viewed as an extension of Wolf and Lam's algorithm to the case of loops with multiple statements. Loan independent dependences play a particular role in our study, and we show how the way we handle them can be useful for fine-grain loop parallelization as well.

Download Full-text

Optimizing software cache performance of packet processing applications

ACM SIGPLAN Notices ◽

10.1145/1273444.1254808 ◽

2007 ◽

Vol 42 (7) ◽

pp. 227-236

Author(s):

Qin Wang ◽

Junpu Chen ◽

Weihua Zhang ◽

Min Yang ◽

Binyu Zang

Keyword(s):

Packet Processing ◽

Cache Performance ◽

Software Cache

Download Full-text