Implementing and Optimizing DES on Stream Processor

2012 ◽  
Vol 532-533 ◽  
pp. 714-718
Author(s):  
Liu Yang ◽  
Xiao Qiang Ni ◽  
Heng Zhu Liu

Processors using stream architecture can make good use of the on-chip resources and explore the data locality and parallelism. DES algorithm is one of the most popular cipher algorithms. This paper proposes the novel implementation of DES algorithm on stream architecture based on both stream programming model and DES algorithm and the speedup is 1.27 times.

2021 ◽  
Vol 64 (6) ◽  
pp. 107-116
Author(s):  
Yakun Sophia Shao ◽  
Jason Cemons ◽  
Rangharajan Venkatesan ◽  
Brian Zimmer ◽  
Matthew Fojtik ◽  
...  

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.


2009 ◽  
Vol 18 (02) ◽  
pp. 255-269 ◽  
Author(s):  
JUN HO BAHN ◽  
JUNG SOOK YANG ◽  
WEN-HSIANG HU ◽  
NADER BAGHERZADEH

This paper presents parallel FFT algorithms with different degree of computation and communication overheads for multiprocessors in a Network-on-Chip (NoC) environment. Of the three parallel FFT algorithms presented in this paper, we propose two parallel FFT algorithms for a 2D NoC that can contain a variable number of processing elements (PEs) and one is a reference parallel FFT algorithm for comparison. A parallel FFT algorithm we propose increases performance by assigning well-balanced computation tasks to PEs. The execution times are reduced because the algorithm uses data locality well to avoid unnecessary data exchanges among PEs and removes the overall idle periods by2 a balanced task scheduling. An enhanced version of this algorithm is suggested in which communication traffic is reduced. In this algorithm, returning transformed data to an original PE after one computation stage before sending them to a next PE for the following stage is removed. Instead, we propose a method that enables to keep regularity of the data communication and computations with twiddle factors. According to the simulation result from our cycle-accurate SystemC NoC model with a parametrizable 2-D mesh architecture, and the analysis of the algorithms in time and complexity, our proposed algorithms are shown to outperform the reference parallel FFT algorithm and FFT implementations on TI Digital Signal Processors (DSPs) that have similar specifications to our simulation environment.


2000 ◽  
Vol 8 (3) ◽  
pp. 143-162 ◽  
Author(s):  
Dimitrios S. Nikolopoulos ◽  
Theodore S. Papatheodorou ◽  
Constantine D. Polychronopoulos ◽  
Jesús Labarta ◽  
Eduard Ayguadé

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.


2017 ◽  
Vol 26 (04) ◽  
pp. 1750004 ◽  
Author(s):  
J. V. Benifa Bibal ◽  
D. Dejey

The scalability of the cloud infrastructure is essential to perform large-scale data processing using MapReduce programming model by automatically provisioning and de-provisioning the resources on demand. The existing MapReduce model shows performance degradation while getting adapted to heterogeneous environments since sufficient techniques are not available to scale the resources on demand and the scheduling algorithms would not cooperate as the resources are configured dynamically. An Auto-Scaling Framework (ASF) is presented in this article to configure the resources automatically based on the current system load in a heterogeneous Hadoop environment. The scheduling of data and task is done in a data-local manner that adapts while new resources are configured, or the existing resources are removed. A monitoring module is integrated with the JobTracker to observe the status of physical machines, compute the system load and provide automated provisioning of the resources. Then, Replica Tracker is utilized to track the replica objects for efficient scheduling of the task in the physical machines. The experiments are conducted in a commercial cloud environment using diverse workload characteristics, and the observations show that the proposed framework outperforms the existing scheduling mechanisms by the performance metrics such as average completion time, scheduling time, data locality, resource utilization and throughput.


Author(s):  
Peng Wang ◽  
Avram Bar-Cohen

Thermal management of on-chip hot spots has become an increasing challenge in recent years because such localized high flux hot spots can not be effectively removed by conventional cooling techniques. The authors have recently explored the novel use of the silicon chip itself as a solid state thermoelectric micrcooler (μTEC) for hot spot thermal management. This paper describes the development and application of a thermo-electric design tool based on closed-form equations for the primary variables. This tool can be used to effectively reduce the complexity and required time for the design and optimization of the silicon microcooler geometry and material properties for on-chip hot spot remediation.


Author(s):  
Meilian Xu ◽  
Parimala Thulasiraman ◽  
Ruppa K. Thulasiram

This chapter uses two scientific computing kernels to illustrate challenges of designing parallel algorithms for one heterogeneous multi-core processor, the Cell Broadband Engine processor (Cell/B.E.). It describes the limitation of the current parallel systems using single-core processors as building blocks. The limitation deteriorates the performance of applications which have data-intensive and computationintensive kernels such as Finite Difference Time Domain (FDTD) and Fast Fourier Transform (FFT). FDTD is a regular problem with nearest neighbour comminuncation pattern under synchronization constraint. FFT based on indirect swap network (ISN) modifies the data mapping in traditional Cooley- Tukey butterfly network to improve data locality, hence reducing the communication and synchronization overhead. The authors hope to unleash the Cell/B.E. and design parallel FDTD and parallel FFT based on ISN by taking into account unique features of Cell/B.E. such as its eight SIMD processing units on the single chip and its high-speed on-chip bus.


2008 ◽  
Vol 50 (5) ◽  
Author(s):  
Rainer Buchty ◽  
Wolfgang Karl

AbstractAlready today we face architectures featuring up to several hundreds of processors, being able to manage several thousand concurrent threads. Future architectures, however, will not only see an increase in parallelism but also feature an increase in heterogeneity and reconfigurability. Judging from current production and prototype architectures, we also see that such systems will be tiled, i. e., individual cores with local memory interconnected through some means of on-chip communication. Current discussions show that existing approaches to application mapping, parallelization, data locality optimization, and system management do not match these upcoming architectures well, thus rather hampering than harnessing the power of future systems. We will therefore outline the requirements of upcoming architectures and demonstrate how self-organization, including bio-inspired, techniques may help to manage system complexity. Key to these techniques is a sophisticated decentralized, hierarchical monitoring approach suitable for sustained real-time monitoring and event correlation for current and future high-performance architectures.


2002 ◽  
Vol 357 (1419) ◽  
pp. 331-340 ◽  
Author(s):  
J. N. Webb ◽  
T. Székely ◽  
A. I. Houston ◽  
J. M. McNamara

Should a parent care for its young or abandon them before they reach independence? We consider parental care behaviour as an adaptive decision, involving trade–offs between current and future reproduction. The condition of the parent is expected to influence these trade–offs. Using a dynamic programming model we explore how changes in the levels of energetic reserves, and time in the season, determine changes in parental care decisions. The novel feature of our model is that we have included the possibility of remating within the current breeding season in a consistent manner by explicitly modelling the behaviour of unmated animals. We show that there may be several fluctuations in the average duration of care during the breeding season. We also show that, because of the dependence of parental care behaviour on both the condition of the parent and time during the breeding season, changing some of the costs of care may increase the duration of care during one part of the season and decrease it at another. The model also shows that the conditions prevailing for animals with dependent offspring can affect the way in which an unmated animal behaves. For example, the behaviour of unmated animals may change to compensate (partly) for increases in the costs of raising offspring, which are produced at a later date (for example, by increasing the duration of foraging between breeding attempts). Overall, the model provides a good framework for understanding how various ecological and life–history variables should influence parental care behaviour during a breeding season.


Sign in / Sign up

Export Citation Format

Share Document