Study on Explicit Memory Management for CBEA Green Computing Architecture

Heterogeneous multi-core processors are attractive for power efficient green computing because of their ability to meet varied resource requirements. The multi-level memory hierarchy of Cell Broadband Engine Architecture (CBEA) which requires explicit management by software poses significant challenges to performance increasing and programming. In this paper, with analysis of characteristic of the architecture, we implemented four access methods and a corresponding access library with a uniform memory access interface. Besides getting performance boosts beyond current level technology, the memory access library with uniform access interface could collect profile information of memory management for further performance optimization. Experimental results show the performance of proposed method is better than related works and profile information provided by the method is helpful for programmer to optimize application performance.

Download Full-text

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Scientific Programming ◽

10.1155/2014/571902 ◽

2014 ◽

Vol 22 (2) ◽

pp. 75-91 ◽

Cited By ~ 12

Author(s):

Robert Gerstenberger ◽

Maciej Besta ◽

Torsten Hoefler

Keyword(s):

Message Passing ◽

Direct Memory Access ◽

Memory Access ◽

Remote Memory ◽

Memory Consumption ◽

Performance Models ◽

Application Performance ◽

Performance Improvements ◽

Programming Interface ◽

Better Than

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.

Download Full-text

On the Applicability of PEBS based Online Memory Access Tracking for Heterogeneous Memory Management at Scale

Proceedings of the Workshop on Memory Centric High Performance Computing ◽

10.1145/3286475.3286477 ◽

2018 ◽

Author(s):

Aleix Roca Nonell ◽

Balazs Gerofi ◽

Leonardo Bautista-Gomez ◽

Dominique Martinet ◽

Vicenç Beltran Querol ◽

...

Keyword(s):

Memory Management ◽

Memory Access

Download Full-text

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/3433687 ◽

2021 ◽

Vol 5 (4) ◽

pp. 1-28

Author(s):

Eduardo H. M. Cruz ◽

Matthias Diener ◽

Laércio L. Pilla ◽

Philippe O. A. Navaux

Keyword(s):

Energy Efficiency ◽

Memory Management ◽

Substantial Reduction ◽

Management Unit ◽

Memory Access ◽

Parallel Applications ◽

Data Mapping ◽

Wide Range ◽

Memory Accesses ◽

Level Parallelism

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.

Download Full-text

Impact of CC-NUMA memory management policies on the application performance of multistage switching networks

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/71.841740 ◽

2000 ◽

Vol 11 (3) ◽

pp. 230-246 ◽

Cited By ~ 7

Author(s):

L.N. Bhuyan ◽

R. Iyer ◽

H.-J. Wang ◽

A. Kumar

Keyword(s):

Memory Management ◽

Switching Networks ◽

Application Performance ◽

Management Policies

Download Full-text

Experimental Demonstration of Disaggregating Optical Interconnection Network to Enable Application Performance Optimization

2018 Photonics in Switching and Computing (PSC) ◽

10.1109/ps.2018.8751263 ◽

2018 ◽

Author(s):

Cen Wang ◽

Xiong Gao ◽

Takehiro Tsuritani ◽

Seiya Sumita ◽

Hongxiang Guo ◽

...

Keyword(s):

Performance Optimization ◽

Interconnection Network ◽

Optical Interconnection ◽

Experimental Demonstration ◽

Application Performance

Download Full-text

Application Performance of Super Ferrite Stainless Steel on Condenser in Power Plant

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.651-653.3 ◽

2014 ◽

Vol 651-653 ◽

pp. 3-6

Author(s):

Xiao Rui Guan ◽

Da Lei Zhang ◽

You Hai Jin

Keyword(s):

Stainless Steel ◽

Corrosion Resistance ◽

Power Plant ◽

Mechanical Test ◽

Cooling Water ◽

Electrochemical Test ◽

Application Performance ◽

Shock Resistance ◽

Rate Of Cooling ◽

Better Than

The application performances of TA2 titanium\S44660 super ferrite stainless steel\B30 cupronickel, which are widely used in power plant, were researched using electrochemical test and mechanical test. The results show that corrosion resistance of B30 is significantly lower than TA2 and S44660. Besides, corrosion resistance of S44660 is superior to TA2. Yield strength and tensile strength of S44660 is higher than TA2 and B30. When considering thickness of cooling tubes, flow rate of cooling water and clean coefficient, the thermal conductivity of three materials have little differences. The shock resistance of S44660 is better than TA2 and B30. S44660 contains a small amount of Ni, which improves greatly the anti-cracking ability of the base metal and welding bead.

Download Full-text

Learning “How to Learn”: Super Declarative Motor Learning Is Impaired in Parkinson’s Disease

Neural Plasticity ◽

10.1155/2017/3162087 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Lucio Marinelli ◽

Carlo Trompetto ◽

Stefania Canneva ◽

Laura Mori ◽

Flavio Nobili ◽

...

Keyword(s):

Parkinson’S Disease ◽

Parkinson's Disease ◽

Performance Optimization ◽

Learning Task ◽

Daily Activities ◽

Learning Capacity ◽

Learning Rates ◽

New Information ◽

Drug Naïve ◽

Better Than

Learning new information is crucial in daily activities and occurs continuously during a subject’s lifetime. Retention of learned material is required for later recall and reuse, although learning capacity is limited and interference between consecutively learned information may occur. Learning processes are impaired in Parkinson’s disease (PD); however, little is known about the processes related to retention and interference. The aim of this study is to investigate the retention and anterograde interference using a declarative sequence learning task in drug-naive patients in the disease’s early stages. Eleven patients with PD and eleven age-matched controls learned a visuomotor sequence, SEQ1, during Day1; the following day, retention of SEQ1 was assessed and, immediately after, a new sequence of comparable complexity, SEQ2, was learned. The comparison of the learning rates of SEQ1 on Day1 and SEQ2 on Day2 assessed the anterograde interference of SEQ1 on SEQ2. We found that SEQ1 performance improved in both patients and controls on Day2. Surprisingly, controls learned SEQ2 better than SEQ1, suggesting the absence of anterograde interference and the occurrence of learning optimization, a process that we defined as “learning how to learn.” Patients with PD lacked such improvement, suggesting defective performance optimization processes.

Download Full-text

FPL: fast Presburger arithmetic through transprecision

Proceedings of the ACM on Programming Languages ◽

10.1145/3485539 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-26

Author(s):

Arjun Pitchanathan ◽

Christian Ulmann ◽

Michel Weber ◽

Torsten Hoefler ◽

Tobias Grosser

Keyword(s):

Performance Optimization ◽

Memory Management ◽

State Of The Art ◽

Computational Cost ◽

End User ◽

Loop Optimization ◽

Presburger Arithmetic ◽

Benchmark Suite ◽

Compilation Techniques ◽

High Computational Cost

Presburger arithmetic provides the mathematical core for the polyhedral compilation techniques that drive analytical cache models, loop optimization for ML and HPC, formal verification, and even hardware design. Polyhedral compilation is widely regarded as being slow due to the potentially high computational cost of the underlying Presburger libraries. Researchers typically use these libraries as powerful black-box tools, but the perceived internal complexity of these libraries, caused by the use of C as the implementation language and a focus on end-user-facing documentation, holds back broader performance-optimization efforts. With FPL, we introduce a new library for Presburger arithmetic built from the ground up in modern C++. We carefully document its internal algorithmic foundations, use lightweight C++ data structures to minimize memory management costs, and deploy transprecision computing across the entire library to effectively exploit machine integers and vector instructions. On a newly-developed comprehensive benchmark suite for Presburger arithmetic, we show a 5.4x speedup in total runtime over the state-of-the-art library isl in its default configuration and 3.6x over a variant of isl optimized with element-wise transprecision computing. We expect that the availability of a well-documented and fast Presburger library will accelerate the adoption of polyhedral compilation techniques in production compilers.

Download Full-text

Seismic Isolation Performance Evaluation for a Class of Inerter-Based Low-Complexity Isolators

Shock and Vibration ◽

10.1155/2020/8837822 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Fei Cao ◽

Michael Z. Q. Chen ◽

Yinlong Hu

Keyword(s):

Performance Optimization ◽

Seismic Isolation ◽

Base Isolation ◽

Low Complexity ◽

Mass Ratio ◽

Mass Damper ◽

H2 Performance ◽

Seismic Base Isolation ◽

Isolation Performance ◽

Better Than

In this paper, the seismic base isolation problem for all low-complexity networks containing one inerter, one spring, and one damper is studied based on a multi-degree-of-freedom model. The analytical solutions for the H2 performance optimization are derived, and the traditional tuned mass damper (TMD) is employed for comparison. Extensive numerical simulations are performed to verify the effectiveness of the obtained results. The results show that for different seismic wave excitations, some isolators are better than TMD in controlling the displacement of the main structure. Moreover, with the increase of the TMD mass ratio, the isolation performances of the inerter-based isolators are increasingly better than that of TMD.

Download Full-text

Efficient memory access methods for framebuffer-less video processing applications

2013 IEEE International Symposium on Circuits and Systems (ISCAS2013) ◽

10.1109/iscas.2013.6572516 ◽

2013 ◽

Cited By ~ 1

Author(s):

Chao-Yang Chang ◽

Chung-Hsun Huang ◽

Yuan-Sun Chu

Keyword(s):

Video Processing ◽

Memory Access ◽

Access Methods ◽

Efficient Memory

Download Full-text