heterogeneous processing
Recently Published Documents


TOTAL DOCUMENTS

79
(FIVE YEARS 12)

H-INDEX

13
(FIVE YEARS 0)

Author(s):  
David Broneske ◽  
Anna Drewes ◽  
Bala Gurumurthy ◽  
Imad Hajjar ◽  
Thilo Pionteck ◽  
...  

AbstractClassical database systems are now facing the challenge of processing high-volume data feeds at unprecedented rates as efficiently as possible while also minimizing power consumption. Since CPU-only machines hit their limits, co-processors like GPUs and FPGAs are investigated by database system designers for their distinct capabilities. As a result, database systems over heterogeneous processing architectures are on the rise. In order to better understand their potentials and limitations, in-depth performance analyses are vital. This paper provides interesting performance data by benchmarking a portable operator set for column-based systems on CPU, GPU, and FPGA – all available processing devices within the same system. We consider TPC‑H query Q6 and additionally a hash join to profile the execution across the systems. We show that system memory access and/or buffer management remains the main bottleneck for device integration, and that architecture-specific execution engines and operators offer significantly higher performance.


2021 ◽  
Vol 20 (5) ◽  
pp. 1-31
Author(s):  
Sanjit Kumar Roy ◽  
Rajesh Devaraj ◽  
Arnab Sarkar ◽  
Debabrata Senapati

Continuous demands for higher performance and reliability within stringent resource budgets is driving a shift from homogeneous to heterogeneous processing platforms for the implementation of today’s cyber-physical systems (CPSs). These CPSs are typically represented as Directed-acyclic Task Graph (DTG) due to the complex interactions between their functional components that are often distributed in nature. In this article, we consider the problem of scheduling a real-time application modelled as a single DTG, where tasks may have multiple implementations designated as quality-levels, with higher quality-levels producing more accurate results and contributing to higher rewards/Quality-of-Service for the system. First, we introduce an optimal solution using Integer Linear Programming (ILP) for a DTG with multiple quality-levels, to be executed on a heterogeneous distributed platform . However, this ILP-based optimal solution exhibits high computational complexity and does not scale for moderately large problem sizes. Hence, we propose two low-overhead heuristic algorithms called Global Slack Aware Quality-level Allocator ( G-SLAQA ) and Total Slack Aware Quality-level Allocator ( T-SLAQA ), which are able to produce satisfactorily efficient as well as fast solutions within a reasonable time. G-SLAQA , the baseline heuristic, is greedier and faster than its counter-part T-SLAQA , whose performance is at least as efficient as G-SLAQA . The efficiency of all the proposed schemes have been extensively evaluated through simulation-based experiments using benchmark and randomly generated DTGs. Through the case study of a real-world automotive traction controller , we generate schedules using our proposed schemes to demonstrate their practical applicability.


2021 ◽  
Author(s):  
Usman Ahmed

Hardware software co-synthesis problem is related to finding an architecture, subject to certain constraints, for a given set of tasks that are related through data dependencies. The architecture consists of a set of heterogeneous processing elements and a communication structure between these processing elements. In this thesis, a new algorithm for co-synthesis is presented that targets distributed memory architectures. The algorithm consists of four distinct phases namely, processing element selection, pipelined task allocation, scheduling and best topology selection. Selected processing elements are finally mapped to a regular distributed memory architecture comprising of mesh, hypercube or quad-tree topology. The co-synthesis method is demonstrated by applying it to MPEG encoder application and various size large random graphs.


2021 ◽  
Author(s):  
Di Yang ◽  
Jinquan Ma ◽  
Chunsheng Yue ◽  
ZHICHONG Shen ◽  
Xiaolong Shen

2021 ◽  
Author(s):  
Usman Ahmed

Hardware software co-synthesis problem is related to finding an architecture, subject to certain constraints, for a given set of tasks that are related through data dependencies. The architecture consists of a set of heterogeneous processing elements and a communication structure between these processing elements. In this thesis, a new algorithm for co-synthesis is presented that targets distributed memory architectures. The algorithm consists of four distinct phases namely, processing element selection, pipelined task allocation, scheduling and best topology selection. Selected processing elements are finally mapped to a regular distributed memory architecture comprising of mesh, hypercube or quad-tree topology. The co-synthesis method is demonstrated by applying it to MPEG encoder application and various size large random graphs.


2021 ◽  
Vol 18 (1) ◽  
pp. 1-27
Author(s):  
Sooraj Puthoor ◽  
Mikko H. Lipasti

Sequential consistency (SC) is the most intuitive memory consistency model and the easiest for programmers and hardware designers to reason about. However, the strict memory ordering restrictions imposed by SC make it less attractive from a performance standpoint. Additionally, prior high-performance SC implementations required complex hardware structures to support speculation and recovery. In this article, we introduce the lockstep SC consistency model (LSC), a new memory model based on SC but carefully defined to accommodate the data parallel lockstep execution paradigm of GPUs. We also describe an efficient LSC implementation for an APU system-on-chip (SoC) and show that our implementation performs close to the baseline relaxed model. Evaluation of our implementation shows that the geometric mean performance cost for lockstep SC is just 0.76% for GPU execution and 6.11% for the entire APU SoC compared to a baseline with a weaker memory consistency model. Adoption of LSC in future APU and SoC designs will reduce the burden on programmers trying to write correct parallel programs, while also simplifying the implementation and verification of systems with heterogeneous processing elements and complex memory hierarchies. 1


2020 ◽  
Author(s):  
Yuluan Wang ◽  
Charlotte Soneson ◽  
Anna L Malinowska ◽  
Artur Laski ◽  
Souvik Ghosh ◽  
...  

Abstract Many microRNAs regulate gene expression via atypical mechanisms, which are difficult to discern using native cross-linking methods. To ascertain the scope of non-canonical miRNA targeting, methods are needed that identify all targets of a given miRNA. We designed a new class of miR-CLIP probe, whereby psoralen is conjugated to the 3p arm of a pre-microRNA to capture targetomes of miR-124 and miR-132 in HEK293T cells. Processing of pre-miR-124 yields miR-124 and a 5′-extended isoform, iso-miR-124. Using miR-CLIP, we identified overlapping targetomes from both isoforms. From a set of 16 targets, 13 were differently inhibited at mRNA/protein levels by the isoforms. Moreover, delivery of pre-miR-124 into cells repressed these targets more strongly than individual treatments with miR-124 and iso-miR-124, suggesting that isomirs from one pre-miRNA may function synergistically. By mining the miR-CLIP targetome, we identified nine G-bulged target-sites that are regulated at the protein level by miR-124 but not isomiR-124. Using structural data, we propose a model involving AGO2 helix-7 that suggests why only miR-124 can engage these sites. In summary, access to the miR-124 targetome via miR-CLIP revealed for the first time how heterogeneous processing of miRNAs combined with non-canonical targeting mechanisms expand the regulatory range of a miRNA.


2020 ◽  
Author(s):  
Marcelo Brandalero ◽  
Luigi Carro ◽  
Antonio Carlos Schneider Beck

With recent changes in transistor scaling trends, the design of all types of processing systems has become increasingly constrained by power consumption. At the same time, driven by the needs of fast response times, many applications are migrating from the cloud to the edge, pushing for the challenge of increasing the performance of these already power-constrained devices. The key to addressing this problem is to design application-specific processors that perfectly match the application's requirements and avoid unnecessary energy consumption. However, such dedicated platforms require significant design time and are thus unable to match the pace of fast-evolving applications that are deployed in the Internet-of-Things (IoT) every day. Motivated by the need for high energy efficiency and high flexibility in hardware platforms, this thesis paves the way to a new class of low-power adaptive processors that can achieve these goals by automatically modifying their structure at run time to match different applications' resource requirements. The proposed Multi-Target Adaptive Reconfigurable Architecture (MuTARe) is based upon a Coarse-Grained Reconfigurable Architecture (CGRA) that can transparently accelerate already-deployed applications, but incorporates novel compute paradigms such as Approximate Computing (AxC) and Near-Threshold Voltage Computing (NTC) to improve its efficiency. Compared to a traditional system of heterogeneous processing cores (similar to ARM's big.LITTLE), the base MuTARe architecture can (without any change to the existing software) improve the execution time by up to $1.3\times$, adapt to the same task deadline with $1.6\times$ smaller energy consumption or adapt to the same low energy budget with $2.3\times$ better performance. When extended for AxC, MuTARe's power savings can be further improved by up to $50\%$ in error-tolerant applications, and when extended for NTC, MuTARe can save further $30\%$ energy in memory-intensive workloads.


Sign in / Sign up

Export Citation Format

Share Document