scholarly journals Not in Name Alone: A Memristive Memory Processing Unit for Real In-Memory Processing

Author(s):  
Ameer Haj-Ali ◽  
Nimrod Wald ◽  
Ronny Ronen ◽  
Shahar Kvatinsky ◽  
Rotem Ben-Hur

<div>Data movement between processing and memory is</div><div>the root cause of the limited performance and energy</div><div>efficiency in modern von Neumann systems. To</div><div>overcome the data-movement bottleneck, we present</div><div>the memristive Memory Processing Unit (mMPU)—a</div><div>real processing-in-memory system in which the computation is done directly in the</div><div>memory cells, thus eliminating the necessity for data transfer. Furthermore, with its</div><div>enormous inner parallelism, this system is ideal for data-intensive applications that are</div><div>based on single instruction, multiple data (SIMD)—providing high throughput and</div><div>energy-efficiency.</div>

2020 ◽  
Author(s):  
Ameer Haj-Ali ◽  
Nimrod Wald ◽  
Ronny Ronen ◽  
Shahar Kvatinsky ◽  
Rotem Ben-Hur

<div>Data movement between processing and memory is</div><div>the root cause of the limited performance and energy</div><div>efficiency in modern von Neumann systems. To</div><div>overcome the data-movement bottleneck, we present</div><div>the memristive Memory Processing Unit (mMPU)—a</div><div>real processing-in-memory system in which the computation is done directly in the</div><div>memory cells, thus eliminating the necessity for data transfer. Furthermore, with its</div><div>enormous inner parallelism, this system is ideal for data-intensive applications that are</div><div>based on single instruction, multiple data (SIMD)—providing high throughput and</div><div>energy-efficiency.</div>


Author(s):  
Kurmachalam Ajay Kumar ◽  
Saritha Vemuri ◽  
Ralla Suresh

High speed bulk data transfer is an important part of many data-intensive scientific applications. TCP fails for the transfer of large amounts of data over long distance across high-speed dedicated network links. Due to system hardware is incapable of saturating the bandwidths supported by the network and rise buffer overflow and packet-loss in the system. To overcome this there is a necessity to build a Performance Adaptive-UDP (PA-UDP) protocol for dynamically maximizing the implementation under different systems. A mathematical model and algorithms are used for effective buffer and CPU management. Performance Adaptive-UDP is a supreme protocol than other protocols by maintaining memory processing, packetloss processing and CPU utilization. Based on this protocol bulk data transfer is processed with high speed over the dedicated network links.


Science ◽  
2019 ◽  
Vol 366 (6462) ◽  
pp. 210-215 ◽  
Author(s):  
Keyuan Ding ◽  
Jiangjing Wang ◽  
Yuxing Zhou ◽  
He Tian ◽  
Lu Lu ◽  
...  

Artificial intelligence and other data-intensive applications have escalated the demand for data storage and processing. New computing devices, such as phase-change random access memory (PCRAM)–based neuro-inspired devices, are promising options for breaking the von Neumann barrier by unifying storage with computing in memory cells. However, current PCRAM devices have considerable noise and drift in electrical resistance that erodes the precision and consistency of these devices. We designed a phase-change heterostructure (PCH) that consists of alternately stacked phase-change and confinement nanolayers to suppress the noise and drift, allowing reliable iterative RESET and cumulative SET operations for high-performance neuro-inspired computing. Our PCH architecture is amenable to industrial production as an intrinsic materials solution, without complex manufacturing procedure or much increased fabrication cost.


2020 ◽  
Author(s):  
Navjot Kukreja ◽  
Jan Hückelheim ◽  
Mathias Louboutin ◽  
John Washbourne ◽  
Paul H. J. Kelly ◽  
...  

Abstract. This paper proposes a new method that combines check- pointing methods with error-controlled lossy compression for large-scale high-performance Full-Waveform Inversion (FWI), an inverse problem commonly used in geophysical exploration. This combination can signif- icantly reduce data movement, allowing a reduction in run time as well as peak memory. In the Exascale computing era, frequent data transfer (e.g., memory bandwidth, PCIe bandwidth for GPUs, or network) is the performance bottleneck rather than the peak FLOPS of the processing unit. Like many other adjoint-based optimization problems, FWI is costly in terms of the number of floating-point operations, large memory foot- print during backpropagation, and data transfer overheads. Past work for adjoint methods has developed checkpointing methods that reduce the peak memory requirements during backpropagation at the cost of additional floating-point computations. Combining this traditional checkpointing with error-controlled lossy compression, we explore the three-way tradeoff between memory, precision, and time to solution. We investigate how approximation errors introduced by lossy compression of the forward solution impact the objective function gradient and final inverted solution. Empirical results from these numerical experiments indicate that high lossy-compression rates (compression factors ranging up to 100) have a relatively minor impact on convergence rates and the quality of the final solution.


2022 ◽  
Vol 15 (3) ◽  
pp. 1-32
Author(s):  
Nikolaos Alachiotis ◽  
Panagiotis Skrimponis ◽  
Manolis Pissadakis ◽  
Dionisios Pnevmatikatos

Disaggregated computer architectures eliminate resource fragmentation in next-generation datacenters by enabling virtual machines to employ resources such as CPUs, memory, and accelerators that are physically located on different servers. While this paves the way for highly compute- and/or memory-intensive applications to potentially deploy all CPUs and/or memory resources in a datacenter, it poses a major challenge to the efficient deployment of hardware accelerators: input/output data can reside on different servers than the ones hosting accelerator resources, thereby requiring time- and energy-consuming remote data transfers that diminish the gains of hardware acceleration. Targeting a disaggregated datacenter architecture similar to the IBM dReDBox disaggregated datacenter prototype, the present work explores the potential of deploying custom acceleration units adjacently to the disaggregated-memory controller on memory bricks (in dReDBox terminology), which is implemented on FPGA technology, to reduce data movement and improve performance and energy efficiency when reconstructing large phylogenies (evolutionary relationships among organisms). A fundamental computational kernel is the Phylogenetic Likelihood Function (PLF), which dominates the total execution time (up to 95%) of widely used maximum-likelihood methods. Numerous efforts to boost PLF performance over the years focused on accelerating computation; since the PLF is a data-intensive, memory-bound operation, performance remains limited by data movement, and memory disaggregation only exacerbates the problem. We describe two near-memory processing models, one that addresses the problem of workload distribution to memory bricks, which is particularly tailored toward larger genomes (e.g., plants and mammals), and one that reduces overall memory requirements through memory-side data interpolation transparently to the application, thereby allowing the phylogeny size to scale to a larger number of organisms without requiring additional memory.


2017 ◽  
Author(s):  
fenglai liu ◽  
Jing Kong

In this work we present an efficient semi-numerical integral implementation specially designed for the Intel Phi processor to calculate the Hartree-Fock exchange matrix and the energy. Compared with the implementation for the CPU platform, to achieve a productive implementation one needs to focus on the efficient utilization of the SIMD(Single instruction, multiple data) processing unit and maximum cache usage in the Phi processor. For evaluating the efficiency of the implementation, we performed benchmark calculations on the buckyball molecules C60, C100, C180 and C240. For the calculations with basis set 6-311G(2df) and cc-pvtz the benchmark test shows 7-12 times of speedup on the Knight Landing Phi processor 7250 in comparison with traditional four-center electron repulsion integral calculation performed on a six-core Xeon E5-1650 CPU.<br>


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1709
Author(s):  
Agbotiname Lucky Imoize ◽  
Oluwadara Adedeji ◽  
Nistha Tandiya ◽  
Sachin Shetty

The 5G wireless communication network is currently faced with the challenge of limited data speed exacerbated by the proliferation of billions of data-intensive applications. To address this problem, researchers are developing cutting-edge technologies for the envisioned 6G wireless communication standards to satisfy the escalating wireless services demands. Though some of the candidate technologies in the 5G standards will apply to 6G wireless networks, key disruptive technologies that will guarantee the desired quality of physical experience to achieve ubiquitous wireless connectivity are expected in 6G. This article first provides a foundational background on the evolution of different wireless communication standards to have a proper insight into the vision and requirements of 6G. Second, we provide a panoramic view of the enabling technologies proposed to facilitate 6G and introduce emerging 6G applications such as multi-sensory–extended reality, digital replica, and more. Next, the technology-driven challenges, social, psychological, health and commercialization issues posed to actualizing 6G, and the probable solutions to tackle these challenges are discussed extensively. Additionally, we present new use cases of the 6G technology in agriculture, education, media and entertainment, logistics and transportation, and tourism. Furthermore, we discuss the multi-faceted communication capabilities of 6G that will contribute significantly to global sustainability and how 6G will bring about a dramatic change in the business arena. Finally, we highlight the research trends, open research issues, and key take-away lessons for future research exploration in 6G wireless communication.


2021 ◽  
Vol 55 (1) ◽  
pp. 88-98
Author(s):  
Mohammed Islam Naas ◽  
François Trahay ◽  
Alexis Colin ◽  
Pierre Olivier ◽  
Stéphane Rubini ◽  
...  

Tracing is a popular method for evaluating, investigating, and modeling the performance of today's storage systems. Tracing has become crucial with the increase in complexity of modern storage applications/systems, that are manipulating an ever-increasing amount of data and are subject to extreme performance requirements. There exists many tracing tools focusing either on the user-level or the kernel-level, however we observe the lack of a unified tracer targeting both levels: this prevents a comprehensive understanding of modern applications' storage performance profiles. In this paper, we present EZIOTracer, a unified I/O tracer for both (Linux) kernel and user spaces, targeting data intensive applications. EZIOTracer is composed of a userland as well as a kernel space tracer, complemented with a trace analysis framework able to merge the output of the two tracers, and in particular to relate user-level events to kernel-level ones, and vice-versa. On the kernel side, EZIOTracer relies on eBPF to offer safe, low-overhead, low memory footprint, and flexible tracing capabilities. We demonstrate using FIO benchmark the ability of EZIOTracer to track down I/O performance issues by relating events recorded at both the kernel and user levels. We show that this can be achieved with a relatively low overhead that ranges from 2% to 26% depending on the I/O intensity.


Sign in / Sign up

Export Citation Format

Share Document