scholarly journals Scalable Phylogeny Reconstruction with Disaggregated Near-memory Processing

2022 ◽  
Vol 15 (3) ◽  
pp. 1-32
Author(s):  
Nikolaos Alachiotis ◽  
Panagiotis Skrimponis ◽  
Manolis Pissadakis ◽  
Dionisios Pnevmatikatos

Disaggregated computer architectures eliminate resource fragmentation in next-generation datacenters by enabling virtual machines to employ resources such as CPUs, memory, and accelerators that are physically located on different servers. While this paves the way for highly compute- and/or memory-intensive applications to potentially deploy all CPUs and/or memory resources in a datacenter, it poses a major challenge to the efficient deployment of hardware accelerators: input/output data can reside on different servers than the ones hosting accelerator resources, thereby requiring time- and energy-consuming remote data transfers that diminish the gains of hardware acceleration. Targeting a disaggregated datacenter architecture similar to the IBM dReDBox disaggregated datacenter prototype, the present work explores the potential of deploying custom acceleration units adjacently to the disaggregated-memory controller on memory bricks (in dReDBox terminology), which is implemented on FPGA technology, to reduce data movement and improve performance and energy efficiency when reconstructing large phylogenies (evolutionary relationships among organisms). A fundamental computational kernel is the Phylogenetic Likelihood Function (PLF), which dominates the total execution time (up to 95%) of widely used maximum-likelihood methods. Numerous efforts to boost PLF performance over the years focused on accelerating computation; since the PLF is a data-intensive, memory-bound operation, performance remains limited by data movement, and memory disaggregation only exacerbates the problem. We describe two near-memory processing models, one that addresses the problem of workload distribution to memory bricks, which is particularly tailored toward larger genomes (e.g., plants and mammals), and one that reduces overall memory requirements through memory-side data interpolation transparently to the application, thereby allowing the phylogeny size to scale to a larger number of organisms without requiring additional memory.

2020 ◽  
Author(s):  
Ameer Haj-Ali ◽  
Nimrod Wald ◽  
Ronny Ronen ◽  
Shahar Kvatinsky ◽  
Rotem Ben-Hur

<div>Data movement between processing and memory is</div><div>the root cause of the limited performance and energy</div><div>efficiency in modern von Neumann systems. To</div><div>overcome the data-movement bottleneck, we present</div><div>the memristive Memory Processing Unit (mMPU)—a</div><div>real processing-in-memory system in which the computation is done directly in the</div><div>memory cells, thus eliminating the necessity for data transfer. Furthermore, with its</div><div>enormous inner parallelism, this system is ideal for data-intensive applications that are</div><div>based on single instruction, multiple data (SIMD)—providing high throughput and</div><div>energy-efficiency.</div>


2020 ◽  
Author(s):  
Ameer Haj-Ali ◽  
Nimrod Wald ◽  
Ronny Ronen ◽  
Shahar Kvatinsky ◽  
Rotem Ben-Hur

<div>Data movement between processing and memory is</div><div>the root cause of the limited performance and energy</div><div>efficiency in modern von Neumann systems. To</div><div>overcome the data-movement bottleneck, we present</div><div>the memristive Memory Processing Unit (mMPU)—a</div><div>real processing-in-memory system in which the computation is done directly in the</div><div>memory cells, thus eliminating the necessity for data transfer. Furthermore, with its</div><div>enormous inner parallelism, this system is ideal for data-intensive applications that are</div><div>based on single instruction, multiple data (SIMD)—providing high throughput and</div><div>energy-efficiency.</div>


Author(s):  
Daqi Lin ◽  
Elena Vasiou ◽  
Cem Yuksel ◽  
Daniel Kopta ◽  
Erik Brunvand

Bounding volume hierarchies (BVH) are the most widely used acceleration structures for ray tracing due to their high construction and traversal performance. However, the bounding planes shared between parent and children bounding boxes is an inherent storage redundancy that limits further improvement in performance due to the memory cost of reading these redundant planes. Dual-split trees can create identical space partitioning as BVHs, but in a compact form using less memory by eliminating the redundancies of the BVH structure representation. This reduction in memory storage and data movement translates to faster ray traversal and better energy efficiency. Yet, the performance benefits of dual-split trees are undermined by the processing required to extract the necessary information from their compact representation. This involves bit manipulations and branching instructions which are inefficient in software. We introduce hardware acceleration for dual-split trees and show that the performance advantages over BVHs are emphasized in a hardware ray tracing context that can take advantage of such acceleration. We provide details on how the operations needed for decoding dual-split tree nodes can be implemented in hardware and present experiments in a number of scenes with different sizes using path tracing. In our experiments, we have observed up to 31% reduction in render time and 38% energy saving using dual-split trees as compared to binary BVHs representing identical space partitioning.


Author(s):  
Valentin Tablan ◽  
Ian Roberts ◽  
Hamish Cunningham ◽  
Kalina Bontcheva

Cloud computing is increasingly being regarded as a key enabler of the ‘democratization of science’, because on-demand, highly scalable cloud computing facilities enable researchers anywhere to carry out data-intensive experiments. In the context of natural language processing (NLP), algorithms tend to be complex, which makes their parallelization and deployment on cloud platforms a non-trivial task. This study presents a new, unique, cloud-based platform for large-scale NLP research—GATECloud. net. It enables researchers to carry out data-intensive NLP experiments by harnessing the vast, on-demand compute power of the Amazon cloud. Important infrastructural issues are dealt with by the platform, completely transparently for the researcher: load balancing, efficient data upload and storage, deployment on the virtual machines, security and fault tolerance. We also include a cost–benefit analysis and usage evaluation.


Author(s):  
Kurmachalam Ajay Kumar ◽  
Saritha Vemuri ◽  
Ralla Suresh

High speed bulk data transfer is an important part of many data-intensive scientific applications. TCP fails for the transfer of large amounts of data over long distance across high-speed dedicated network links. Due to system hardware is incapable of saturating the bandwidths supported by the network and rise buffer overflow and packet-loss in the system. To overcome this there is a necessity to build a Performance Adaptive-UDP (PA-UDP) protocol for dynamically maximizing the implementation under different systems. A mathematical model and algorithms are used for effective buffer and CPU management. Performance Adaptive-UDP is a supreme protocol than other protocols by maintaining memory processing, packetloss processing and CPU utilization. Based on this protocol bulk data transfer is processed with high speed over the dedicated network links.


Cloud computing, one of the fastest growing fields, is the the delivery of computing resources and services. Load balancing is a key problem in cloud computing (CC) that deals with the even distribution of work load across multiple virtual machines to ensure that no machine is overloaded or underutilized during the task computation. The load balancing optimization problem is an NP-hard problem, hence, for the optimal usage of available resources, we propose a new efficient user-priority multi-agent genetic algorithm (GA). Our algorithm takes the “users’ priority and earliest job finishing time” into consideration for minimizing the response time and energy. We simulate our algorithm using Cloud-Analyst and show that our algorithm outperforms the existing algorithms for load balancing.


2022 ◽  
Vol 15 (2) ◽  
pp. 1-31
Author(s):  
Joel Mandebi Mbongue ◽  
Danielle Tchuinkou Kwadjo ◽  
Alex Shuping ◽  
Christophe Bobda

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.


2020 ◽  
Vol 10 (4) ◽  
pp. 30
Author(s):  
Kamil Khan ◽  
Sudeep Pasricha ◽  
Ryan Gary Kim

Due to the amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become a bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types of applications by moving the computation closer to the data. Over the past few years, researchers have proposed various memory architectures that enable DCC systems, such as logic layers in 3D-stacked memories or charge-sharing-based bitwise operations in dynamic random-access memory (DRAM). However, application-specific memory access patterns, power and thermal concerns, memory technology limitations, and inconsistent performance gains complicate the offloading of computation in DCC systems. Therefore, designing intelligent resource management techniques for computation offloading is vital for leveraging the potential offered by this new paradigm. In this article, we survey the major trends in managing PIM and NMP-based DCC systems and provide a review of the landscape of resource management techniques employed by system designers for such systems. Additionally, we discuss the future challenges and opportunities in DCC management.


Sign in / Sign up

Export Citation Format

Share Document