data locality Latest Research Papers

One widely used metric that measures data locality is reuse distance —the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. State-of-the-art techniques that measure reuse distance in parallel applications rely on simulators or binary instrumentation tools that incur large performance and memory overheads. Moreover, the existing sampling-based tools are limited to measuring reuse distances of a single thread and discard interactions among threads in multi-threaded programs. In this work, we propose ReuseTracker —a fast and accurate reuse distance analyzer that leverages existing hardware features in commodity CPUs. ReuseTracker is designed for multi-threaded programs and takes cache-coherence effects into account. By utilizing hardware features like performance monitoring units and debug registers, ReuseTracker can accurately profile reuse distance in parallel applications with much lower overheads than existing tools. It introduces only 2.9× runtime and 2.8× memory overheads. Our tool achieves 92% accuracy when verified against a newly developed configurable benchmark that can generate a variety of different reuse distance patterns. We demonstrate the tool’s functionality with two use-case scenarios using PARSEC, Rodinia, and Synchrobench benchmark suites where ReuseTracker guides code refactoring in these benchmarks by detecting spatial reuses in shared caches that are also false sharing and successfully predicts whether some benchmarks in these suites can benefit from adjacent cache line prefetch optimization.

Download Full-text

Big Data Workflows: Locality-Aware Orchestration Using Software Containers

Sensors ◽

10.3390/s21248212 ◽

2021 ◽

Vol 21 (24) ◽

pp. 8212

Author(s):

Andrei-Alin Corodescu ◽

Nikolay Nikolov ◽

Akif Quddus Khan ◽

Ahmet Soylu ◽

Mihhail Matskin ◽

...

Keyword(s):

Big Data ◽

Data Processing ◽

Data Locality ◽

Computing Paradigm ◽

Limited Support ◽

Execution Speed ◽

Significant Performance ◽

Geographically Distributed ◽

Workflow Orchestration ◽

Data Centres

The emergence of the edge computing paradigm has shifted data processing from centralised infrastructures to heterogeneous and geographically distributed infrastructures. Therefore, data processing solutions must consider data locality to reduce the performance penalties from data transfers among remote data centres. Existing big data processing solutions provide limited support for handling data locality and are inefficient in processing small and frequent events specific to the edge environments. This article proposes a novel architecture and a proof-of-concept implementation for software container-centric big data workflow orchestration that puts data locality at the forefront. The proposed solution considers the available data locality information, leverages long-lived containers to execute workflow steps, and handles the interaction with different data sources through containers. We compare the proposed solution with Argo workflows and demonstrate a significant performance improvement in the execution speed for processing the same data units. Finally, we carry out experiments with the proposed solution under different configurations and analyze individual aspects affecting the performance of the overall solution.

Download Full-text

LolliRAM: A Cross-Layer Design to Exploit Data Locality in Oblivious RAM

10.1109/dac18074.2021.9586126 ◽

2021 ◽

Author(s):

Yi Wang ◽

Weixuan Chen ◽

Xianhua Wang ◽

Rui Mao

Keyword(s):

Data Locality ◽

Cross Layer ◽

Cross Layer Design ◽

Oblivious Ram

Download Full-text

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

ACM Transactions on Storage ◽

10.1145/3459626 ◽

2021 ◽

Vol 17 (4) ◽

pp. 1-23

Author(s):

Datong Zhang ◽

Yuhui Deng ◽

Yi Zhou ◽

Yifeng Zhu ◽

Xiao Qin

Keyword(s):

High Efficiency ◽

False Positive Rate ◽

Bloom Filter ◽

Data Locality ◽

Data Deduplication ◽

Backup System ◽

Memory Overhead ◽

Data Fragmentation ◽

Positive Rate ◽

Salient Features

Data deduplication techniques construct an index consisting of fingerprint entries to identify and eliminate duplicated copies of repeating data. The bottleneck of disk-based index lookup and data fragmentation caused by eliminating duplicated chunks are two challenging issues in data deduplication. Deduplication-based backup systems generally employ containers storing contiguous chunks together with their fingerprints to preserve data locality for alleviating the two issues, which is still inadequate. To address these two issues, we propose a container utilization based hot fingerprint entry distilling strategy to improve the performance of deduplication-based backup systems. We divide the index into three parts: hot fingerprint entries, fragmented fingerprint entries, and useless fingerprint entries. A container with utilization smaller than a given threshold is called a sparse container . Fingerprint entries that point to non-sparse containers are hot fingerprint entries. For the remaining fingerprint entries, if a fingerprint entry matches any fingerprint of forthcoming backup chunks, it is classified as a fragmented fingerprint entry. Otherwise, it is classified as a useless fingerprint entry. We observe that hot fingerprint entries account for a small part of the index, whereas the remaining fingerprint entries account for the majority of the index. This intriguing observation inspires us to develop a hot fingerprint entry distilling approach named HID . HID segregates useless fingerprint entries from the index to improve memory utilization and bypass disk accesses. In addition, HID separates fragmented fingerprint entries to make a deduplication-based backup system directly rewrite fragmented chunks, thereby alleviating adverse fragmentation. Moreover, HID introduces a feature to treat fragmented chunks as unique chunks. This feature compensates for the shortcoming that a Bloom filter cannot directly identify certain duplicated chunks (i.e., the fragmented chunks). To take full advantage of the preceding feature, we propose an evolved HID strategy called EHID . EHID incorporates a Bloom filter, to which only hot fingerprints are mapped. In doing so, EHID exhibits two salient features: (i) EHID avoids disk accesses to identify unique chunks and the fragmented chunks; (ii) EHID slashes the false positive rate of the integrated Bloom filter. These salient features push EHID into the high-efficiency mode. Our experimental results show our approach reduces the average memory overhead of the index by 34.11% and 25.13% when using the Linux dataset and the FSL dataset, respectively. Furthermore, compared with the state-of-the-art method HAR, EHID boosts the average backup throughput by up to a factor of 2.25 with the Linux dataset, and EHID reduces the average disk I/O traffic by up to 66.21% when it comes to the FSL dataset. EHID also marginally improves the system's restore performance.

Download Full-text

Revisiting active object stores: Bringing data locality to the limit with NVM

Future Generation Computer Systems ◽

10.1016/j.future.2021.10.025 ◽

2021 ◽

Author(s):

Alex Barcelo ◽

Anna Queralt ◽

Toni Cortes

Keyword(s):

Data Locality ◽

Active Object

Download Full-text

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Micromachines ◽

10.3390/mi12101262 ◽

2021 ◽

Vol 12 (10) ◽

pp. 1262

Author(s):

Juan Fang ◽

Zelin Wei ◽

Huijing Yang

Keyword(s):

High Performance ◽

Data Locality ◽

Streaming Data ◽

Utilization Rate ◽

Cache Management ◽

L1 Data ◽

Long Latency ◽

Multiple Threads ◽

Cache Contention ◽

Acceleration Component

GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.

Download Full-text

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

10.1109/sbac-pad53543.2021.00028 ◽

2021 ◽

Author(s):

Wilton Jaciel Loch ◽

Guilherme Piegas Koslovski

Keyword(s):

Data Locality

Download Full-text

Inter-database validation of a deep learning approach for automatic sleep scoring

PLoS ONE ◽

10.1371/journal.pone.0256111 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0256111

Author(s):

Diego Alvarez-Estevez ◽

Roselyne M. Rijsman

Keyword(s):

Deep Learning ◽

Network Architecture ◽

Data Locality ◽

Learning Approach ◽

Local Models ◽

Sleep Staging ◽

Patient Privacy ◽

Kappa Score ◽

Wide Range ◽

Scalable Design

Study objectives Development of inter-database generalizable sleep staging algorithms represents a challenge due to increased data variability across different datasets. Sharing data between different centers is also a problem due to potential restrictions due to patient privacy protection. In this work, we describe a new deep learning approach for automatic sleep staging, and address its generalization capabilities on a wide range of public sleep staging databases. We also examine the suitability of a novel approach that uses an ensemble of individual local models and evaluate its impact on the resulting inter-database generalization performance. Methods A general deep learning network architecture for automatic sleep staging is presented. Different preprocessing and architectural variant options are tested. The resulting prediction capabilities are evaluated and compared on a heterogeneous collection of six public sleep staging datasets. Validation is carried out in the context of independent local and external dataset generalization scenarios. Results Best results were achieved using the CNN_LSTM_5 neural network variant. Average prediction capabilities on independent local testing sets achieved 0.80 kappa score. When individual local models predict data from external datasets, average kappa score decreases to 0.54. Using the proposed ensemble-based approach, average kappa performance on the external dataset prediction scenario increases to 0.62. To our knowledge this is the largest study by the number of datasets so far on validating the generalization capabilities of an automatic sleep staging algorithm using external databases. Conclusions Validation results show good general performance of our method, as compared with the expected levels of human agreement, as well as to state-of-the-art automatic sleep staging methods. The proposed ensemble-based approach enables flexible and scalable design, allowing dynamic integration of local models into the final ensemble, preserving data locality, and increasing generalization capabilities of the resulting system at the same time.

Download Full-text

Federated Learning with Sparsification-Amplified Privacy and Adaptive Optimization

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/202 ◽

2021 ◽

Author(s):

Rui Hu ◽

Yanmin Gong ◽

Yuanxiong Guo

Keyword(s):

Deep Neural Networks ◽

Differential Privacy ◽

Random Noise ◽

Data Locality ◽

Adaptive Optimization ◽

Distributed Agents ◽

Acceleration Techniques ◽

Communication Efficiency ◽

Benchmark Datasets ◽

Model Size

Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and it is desirable to facilitate FL with rigorous differential privacy (DP) guarantee. Existing DP mechanisms would introduce random noise with magnitude proportional to the model size, which can be quite large in deep neural networks. In this paper, we propose a new FL framework with sparsification-amplified privacy. Our approach integrates random sparsification with gradient perturbation on each agent to amplify privacy guarantee. Since sparsification would increase the number of communication rounds required to achieve a certain target accuracy, which is unfavorable for DP guarantee, we further introduce acceleration techniques to help reduce the privacy cost. We rigorously analyze the convergence of our approach and utilize Renyi DP to tightly account the end-to-end DP guarantee. Extensive experiments on benchmark datasets validate that our approach outperforms previous differentially-private FL approaches in both privacy guarantee and communication efficiency.

Download Full-text

Bandwidth-Aware Rescheduling Mechanism in SDN-Based Data Center Networks

Electronics ◽

10.3390/electronics10151774 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1774

Author(s):

Ming-Chin Chuang ◽

Chia-Cheng Yen ◽

Chia-Jui Hung

Keyword(s):

Data Center ◽

Completion Time ◽

Network Performance ◽

Data Locality ◽

Task Completion ◽

Data Center Networks ◽

Task Completion Time ◽

Data Packets ◽

Network Bandwidth ◽

Data Process

Recently, with the increase in network bandwidth, various cloud computing applications have become popular. A large number of network data packets will be generated in such a network. However, most existing network architectures cannot effectively handle big data, thereby necessitating an efficient mechanism to reduce task completion time when large amounts of data are processed in data center networks. Unfortunately, achieving the minimum task completion time in the Hadoop system is an NP-complete problem. Although many studies have proposed schemes for improving network performance, they have shortcomings that degrade their performance. For this reason, in this study, we propose a centralized solution, called the bandwidth-aware rescheduling (BARE) mechanism for software-defined network (SDN)-based data center networks. BARE improves network performance by employing a prefetching mechanism and a centralized network monitor to collect global information, sorting out the locality data process, splitting tasks, and executing a rescheduling mechanism with a scheduler to reduce task completion time. Finally, we used simulations to demonstrate our scheme’s effectiveness. Simulation results show that our scheme outperforms other existing schemes in terms of task completion time and the ratio of data locality.

Download Full-text

data locality
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer

Big Data Workflows: Locality-Aware Orchestration Using Software Containers

LolliRAM: A Cross-Layer Design to Exploit Data Locality in Oblivious RAM

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Revisiting active object stores: Bringing data locality to the limit with NVM

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Inter-database validation of a deep learning approach for automatic sleep scoring

Federated Learning with Sparsification-Amplified Privacy and Adaptive Optimization

Bandwidth-Aware Rescheduling Mechanism in SDN-Based Data Center Networks

Export Citation Format

data localityRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer

Big Data Workflows: Locality-Aware Orchestration Using Software Containers

LolliRAM: A Cross-Layer Design to Exploit Data Locality in Oblivious RAM

Improving the Performance of Deduplication-Based Backup Systems via Container Utilization Based Hot Fingerprint Entry Distilling

Revisiting active object stores: Bringing data locality to the limit with NVM

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

Inter-database validation of a deep learning approach for automatic sleep scoring

Federated Learning with Sparsification-Amplified Privacy and Adaptive Optimization

Bandwidth-Aware Rescheduling Mechanism in SDN-Based Data Center Networks

data locality
Recently Published Documents