Implementing and Optimizing DES on Stream Processor

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.532-533.714 ◽

2012 ◽

Vol 532-533 ◽

pp. 714-718

Author(s):

Liu Yang ◽

Xiao Qiang Ni ◽

Heng Zhu Liu

Keyword(s):

Programming Model ◽

Data Locality ◽

The Novel ◽

Stream Processor ◽

Stream Programming ◽

Stream Architecture ◽

On Chip

Processors using stream architecture can make good use of the on-chip resources and explore the data locality and parallelism. DES algorithm is one of the most popular cipher algorithms. This paper proposes the novel implementation of DES algorithm on stream architecture based on both stream programming model and DES algorithm and the speedup is 1.27 times.

Download Full-text

Simba

Communications of the ACM ◽

10.1145/3460227 ◽

2021 ◽

Vol 64 (6) ◽

pp. 107-116

Author(s):

Yakun Sophia Shao ◽

Jason Cemons ◽

Rangharajan Venkatesan ◽

Brian Zimmer ◽

Matthew Fojtik ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Data Locality ◽

Coarse Grained ◽

Batch Size ◽

Peak Performance ◽

Large Scale Systems ◽

High Area ◽

On Chip ◽

And Storage

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

Download Full-text

PARALLEL FFT ALGORITHMS ON NETWORK-ON-CHIPS

Journal of Circuits System and Computers ◽

10.1142/s0218126609005046 ◽

2009 ◽

Vol 18 (02) ◽

pp. 255-269 ◽

Cited By ~ 1

Author(s):

JUN HO BAHN ◽

JUNG SOOK YANG ◽

WEN-HSIANG HU ◽

NADER BAGHERZADEH

Keyword(s):

Data Communication ◽

Digital Signal ◽

Variable Number ◽

Data Locality ◽

Communication Traffic ◽

On Chip ◽

Fft Algorithms ◽

Signal Processors ◽

Parallel Fft ◽

Parallel Fft Algorithm

This paper presents parallel FFT algorithms with different degree of computation and communication overheads for multiprocessors in a Network-on-Chip (NoC) environment. Of the three parallel FFT algorithms presented in this paper, we propose two parallel FFT algorithms for a 2D NoC that can contain a variable number of processing elements (PEs) and one is a reference parallel FFT algorithm for comparison. A parallel FFT algorithm we propose increases performance by assigning well-balanced computation tasks to PEs. The execution times are reduced because the algorithm uses data locality well to avoid unnecessary data exchanges among PEs and removes the overall idle periods by2 a balanced task scheduling. An enhanced version of this algorithm is suggested in which communication traffic is reduced. In this algorithm, returning transformed data to an original PE after one computation stage before sending them to a next PE for the following stage is removed. Instead, we propose a method that enables to keep regularity of the data communication and computations with twiddle factors. According to the simulation result from our cycle-accurate SystemC NoC model with a parametrizable 2-D mesh architecture, and the analysis of the algorithms in time and complexity, our proposed algorithms are shown to outperform the reference parallel FFT algorithm and FFT implementations on TI Digital Signal Processors (DSPs) that have similar specifications to our simulation environment.

Download Full-text

Fully Distributed On-chip Instruction Memory Design for Stream Architecture Based on Field-Divided VLIW Compression

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems ◽

10.1109/hpcc.2012.14 ◽

2012 ◽

Author(s):

Yi He ◽

Maolin Guan ◽

Chunyuan Zhang ◽

Tian Tian ◽

Qianming Yang

Keyword(s):

Memory Design ◽

Stream Architecture ◽

On Chip ◽

Instruction Memory

Download Full-text

A Transparent Runtime Data Distribution Engine for OpenMP

Scientific Programming ◽

10.1155/2000/417570 ◽

2000 ◽

Vol 8 (3) ◽

pp. 143-162 ◽

Cited By ~ 4

Author(s):

Dimitrios S. Nikolopoulos ◽

Theodore S. Papatheodorou ◽

Constantine D. Polychronopoulos ◽

Jesús Labarta ◽

Eduard Ayguadé

Keyword(s):

High Performance ◽

Programming Model ◽

Data Distribution ◽

Data Locality ◽

Remote Memory ◽

Main Body ◽

Performance Loss ◽

Page Migration ◽

Runtime Environment ◽

Memory Accesses

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.

Download Full-text

An Auto-Scaling Framework for Heterogeneous Hadoop Systems

International Journal of Cooperative Information Systems ◽

10.1142/s0218843017500046 ◽

2017 ◽

Vol 26 (04) ◽

pp. 1750004 ◽

Cited By ~ 2

Author(s):

J. V. Benifa Bibal ◽

D. Dejey

Keyword(s):

Large Scale ◽

Performance Metrics ◽

Programming Model ◽

Current System ◽

Data Locality ◽

Time Data ◽

System Load ◽

Average Completion Time ◽

On Demand ◽

Auto Scaling

The scalability of the cloud infrastructure is essential to perform large-scale data processing using MapReduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing MapReduce model shows performance degradation while getting adapted to heterogeneous environments since sufficient techniques are not available to scale the resources on demand and the scheduling algorithms would not cooperate as the resources are configured dynamically. An Auto-Scaling Framework (ASF) is presented in this article to configure the resources automatically based on the current system load in a heterogeneous Hadoop environment. The scheduling of data and task is done in a data-local manner that adapts while new resources are configured, or the existing resources are removed. A monitoring module is integrated with the JobTracker to observe the status of physical machines, compute the system load and provide automated provisioning of the resources. Then, Replica Tracker is utilized to track the replica objects for efficient scheduling of the task in the physical machines. The experiments are conducted in a commercial cloud environment using diverse workload characteristics, and the observations show that the proposed framework outperforms the existing scheduling mechanisms by the performance metrics such as average completion time, scheduling time, data locality, resource utilization and throughput.

Download Full-text

Simplified Thermal Model of Silicon Thermoelectric Microcooler for On-Chip Hot Spot Remediation

ASME 2007 InterPACK Conference, Volume 2 ◽

10.1115/ipack2007-33940 ◽

2007 ◽

Cited By ~ 1

Author(s):

Peng Wang ◽

Avram Bar-Cohen

Keyword(s):

Thermal Management ◽

Material Properties ◽

Hot Spots ◽

Hot Spot ◽

Design Tool ◽

The Novel ◽

Thermo Electric ◽

Design And Optimization ◽

Conventional Cooling ◽

On Chip

Thermal management of on-chip hot spots has become an increasing challenge in recent years because such localized high flux hot spots can not be effectively removed by conventional cooling techniques. The authors have recently explored the novel use of the silicon chip itself as a solid state thermoelectric micrcooler (μTEC) for hot spot thermal management. This paper describes the development and application of a thermo-electric design tool based on closed-form equations for the primary variables. This tool can be used to effectively reduce the complexity and required time for the design and optimization of the silicon microcooler geometry and material properties for on-chip hot spot remediation.

Download Full-text

Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?

Lecture Notes in Computer Science - Compiler Construction ◽

10.1007/978-3-642-11970-5_15 ◽

2010 ◽

pp. 264-282 ◽

Cited By ~ 43

Author(s):

Yunlian Jiang ◽

Eddy Z. Zhang ◽

Kai Tian ◽

Xipeng Shen

Keyword(s):

Chip Multiprocessors ◽

Data Locality ◽

Reuse Distance ◽

On Chip

Download Full-text

Cell Processing for Two Scientific Computing Kernels

Handbook of Research on Scalable Computing Technologies ◽

10.4018/978-1-60566-661-7.ch014 ◽

2010 ◽

pp. 312-336

Author(s):

Meilian Xu ◽

Parimala Thulasiraman ◽

Ruppa K. Thulasiram

Keyword(s):

High Speed ◽

Scientific Computing ◽

Building Blocks ◽

Data Locality ◽

Data Mapping ◽

Single Chip ◽

Data Intensive ◽

Synchronization Overhead ◽

Simd Processing ◽

On Chip

This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It describes the limitation of the current parallel systems using single-core processors as building blocks. The limitation deteriorates the performance of applications which have data-intensive and computationintensive kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT). FDTD is a regular problem with nearest neighbour comminuncation pattern under synchronization constraint. FFT based on indirect swap network (ISN) modifies the data mapping in traditional Cooley- Tukey butterfly network to improve data locality, hence reducing the communication and synchronization overhead. The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based on ISN by taking into account unique features of Cell/B.E. such as its eight SIMD processing units on the single chip and its high-speed on-chip bus.

Download Full-text

Design Aspects of Self-Organizing Heterogeneous Multi-Core Architectures Entwurfsaspekte selbstorganisierender, heterogener Multicore-Architekturen

it - Information Technology ◽

10.1524/itit.2008.0498 ◽

2008 ◽

Vol 50 (5) ◽

Cited By ~ 1

Author(s):

Rainer Buchty ◽

Wolfgang Karl

Keyword(s):

Real Time ◽

High Performance ◽

Data Locality ◽

Self Organization ◽

System Management ◽

Event Correlation ◽

System Complexity ◽

Application Mapping ◽

Current Production ◽

On Chip

AbstractAlready today we face architectures featuring up to several hundreds of processors, being able to manage several thousand concurrent threads. Future architectures, however, will not only see an increase in parallelism but also feature an increase in heterogeneity and reconfigurability. Judging from current production and prototype architectures, we also see that such systems will be tiled, i. e., individual cores with local memory interconnected through some means of on-chip communication. Current discussions show that existing approaches to application mapping, parallelization, data locality optimization, and system management do not match these upcoming architectures well, thus rather hampering than harnessing the power of future systems. We will therefore outline the requirements of upcoming architectures and demonstrate how self-organization, including bio-inspired, techniques may help to manage system complexity. Key to these techniques is a sophisticated decentralized, hierarchical monitoring approach suitable for sustained real-time monitoring and event correlation for current and future high-performance architectures.

Download Full-text

A theoretical analysis of the energetic costs and consequences of parental care decisions

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2001.0934 ◽

2002 ◽

Vol 357 (1419) ◽

pp. 331-340 ◽

Cited By ~ 55

Author(s):

J. N. Webb ◽

T. Székely ◽

A. I. Houston ◽

J. M. McNamara

Keyword(s):

Parental Care ◽

Breeding Season ◽

Programming Model ◽

Average Duration ◽

The Novel ◽

Trade Offs ◽

Consistent Manner ◽

Future Reproduction ◽

Dynamic Programming Model ◽

Energetic Reserves

Should a parent care for its young or abandon them before they reach independence? We consider parental care behaviour as an adaptive decision, involving trade–offs between current and future reproduction. The condition of the parent is expected to influence these trade–offs. Using a dynamic programming model we explore how changes in the levels of energetic reserves, and time in the season, determine changes in parental care decisions. The novel feature of our model is that we have included the possibility of remating within the current breeding season in a consistent manner by explicitly modelling the behaviour of unmated animals. We show that there may be several fluctuations in the average duration of care during the breeding season. We also show that, because of the dependence of parental care behaviour on both the condition of the parent and time during the breeding season, changing some of the costs of care may increase the duration of care during one part of the season and decrease it at another. The model also shows that the conditions prevailing for animals with dependent offspring can affect the way in which an unmated animal behaves. For example, the behaviour of unmated animals may change to compensate (partly) for increases in the costs of raising offspring, which are produced at a later date (for example, by increasing the duration of foraging between breeding attempts). Overall, the model provides a good framework for understanding how various ecological and life–history variables should influence parental care behaviour during a breeding season.

Download Full-text