Mix and Match: Reorganizing Tasks for Enhancing Data Locality

Author(s):  
Xulong Tang ◽  
Mahmut Taylan Kandemir ◽  
Mustafa Karakoy

Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.


Author(s):  
Ibrahim Al-Kharusi ◽  
David W Walker

Application performance on graphical processing units (GPUs), in terms of execution speed and memory usage, depends on the efficient use of hierarchical memory. It is expected that enhancing data locality in molecular dynamic simulations will lower the cost of data movement across the GPU memory hierarchy. The work presented in this article analyses the spatial data locality and data reuse characteristics for row-major, Hilbert and Morton orderings and the impact these have on the performance of molecular dynamics simulations. A simple cache model is presented, and this is found to give results that are consistent with the timing results for the particle force computation obtained on NVidia GeForce GTX960 and Tesla P100 GPUs. Further analysis of the observed memory use, in terms of cache hits and the number of memory transactions, provides a more detailed explanation of execution behaviour for the different orderings. To the best of our knowledge, this is the first study to investigate memory analysis and data locality issues for molecular dynamics simulations of Lennard-Jones fluids on NVidia’s Maxwell and Tesla architectures.



2021 ◽  
Vol 18 (3) ◽  
pp. 1-22
Author(s):  
Ricardo Alves ◽  
Stefanos Kaxiras ◽  
David Black-Schaffer

Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead. In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead. Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.



2020 ◽  
Vol 28 (1) ◽  
pp. 181-195
Author(s):  
Quentin Vanhaelen

: Computational approaches have been proven to be complementary tools of interest in identifying potential candidates for drug repurposing. However, although the methods developed so far offer interesting opportunities and could contribute to solving issues faced by the pharmaceutical sector, they also come with their constraints. Indeed, specific challenges ranging from data access, standardization and integration to the implementation of reliable and coherent validation methods must be addressed to allow systematic use at a larger scale. In this mini-review, we cover computational tools recently developed for addressing some of these challenges. This includes specific databases providing accessibility to a large set of curated data with standardized annotations, web-based tools integrating flexible user interfaces to perform fast computational repurposing experiments and standardized datasets specifically annotated and balanced for validating new computational drug repurposing methods. Interestingly, these new databases combined with the increasing number of information about the outcomes of drug repurposing studies can be used to perform a meta-analysis to identify key properties associated with successful drug repurposing cases. This information could further be used to design estimation methods to compute a priori assessment of the repurposing possibilities.



2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Yue Weng ◽  
Xi Zhang ◽  
Xiaohu Guo ◽  
Xianwei Zhang ◽  
Yutong Lu ◽  
...  

AbstractIn unstructured finite volume method, loop on different mesh components such as cells, faces, nodes, etc is used widely for the traversal of data. Mesh loop results in direct or indirect data access that affects data locality significantly. By loop on mesh, many threads accessing the same data lead to data dependence. Both data locality and data dependence play an important part in the performance of GPU simulations. For optimizing a GPU-accelerated unstructured finite volume Computational Fluid Dynamics (CFD) program, the performance of hot spots under different loops on cells, faces, and nodes is evaluated on Nvidia Tesla V100 and K80. Numerical tests under different mesh scales show that the effects of mesh loop modes are different on data locality and data dependence. Specifically, face loop makes the best data locality, so long as access to face data exists in kernels. Cell loop brings the smallest overheads due to non-coalescing data access, when both cell and node data are used in computing without face data. Cell loop owns the best performance in the condition that only indirect access of cell data exists in kernels. Atomic operations reduced the performance of kernels largely in K80, which is not obvious on V100. With the suitable mesh loop mode in all kernels, the overall performance of GPU simulations can be increased by 15%-20%. Finally, the program on a single GPU V100 can achieve maximum 21.7 and average 14.1 speed up compared with 28 MPI tasks on two Intel CPUs Xeon Gold 6132.



2021 ◽  
Vol 20 (5s) ◽  
pp. 1-20
Author(s):  
Hyungmin Cho

Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embedded systems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to the conventional convolution layers. Many deep neural network (DNN) accelerators adopt an architecture that exploits the high data-reuse factor of DNN computations, such as a systolic array. However, depthwise convolutions have low data-reuse factor and under-utilize the processing elements (PEs) in systolic arrays. In this paper, we present a DNN accelerator design called RiSA, which provides a novel mechanism that boosts the PE utilization for depthwise convolutions on a systolic array with minimal overheads. In addition, the PEs in systolic arrays can be efficiently used only if the data items ( tensors ) are arranged in the desired layout. Typical DNN accelerators provide various types of PE interconnects or additional modules to flexibly rearrange the data items and manage data movements during DNN computations. RiSA provides a lightweight set of tensor management tasks within the PE array itself that eliminates the need for an additional module for tensor reshaping tasks. Using this embedded tensor reshaping, RiSA supports various DNN models, including convolutional neural networks and natural language processing models while maintaining a high area efficiency. Compared to Eyeriss v2, RiSA improves the area and energy efficiency for MobileNet-V1 inference by 1.91× and 1.31×, respectively.



Author(s):  
Subhra Prosun Paul ◽  
◽  
Dr. Shruti Aggarwal ◽  

In today’s World sensor networks offer various opportunities for data management applications because of their low cost, reliability, scalability, high-speed data processing, and other versatile advantageous purposes. It is a great challenge to organize data effectively and to retrieve the appropriate data from the large volume of various data sets in ad-hoc network databases, mobile databases, etc. The sensor network is necessary for routing of data, performance analysis of data management activities, and data incorporation for the right application. Data management involves intranet and extranet query handling, data access mechanism, modeling of data, different data movement algorithm, data warehousing, and data mining of network database. Additionally, connectivity, design, and lifetime are important issues for sensor networks to perform all data management activities smoothly. In this paper, we are trying to give a cognitive research tendency of Sensor network data management in the last two decades considering all the challenges and issues of both sensor network database and data management functions using Scopus and Web of Science database. To analyze data, different assessments are done considering various parameters like the author, time, publication and citation number, place, source, document separately for Web of Science and Scopus database in global perspective. It is noticed that there is a significant growth of research in data management for sensor networks because of the popularity of this topic.



Author(s):  
Wensong Li ◽  
Fan Yang ◽  
Hengliang Zhu ◽  
Xuan Zeng ◽  
Dian Zhou


2004 ◽  
Vol 61 (3) ◽  
pp. 476-486 ◽  
Author(s):  
Delphine Danancher ◽  
Jacques Labonne ◽  
Roger Pradel ◽  
Philippe Gaudin

In this study, capture–mark–recapture statistics were applied to spatial recapture histories to assess the intensity of fish restricted movements along the longitudinal axis of a river using a previously described model for survival and recruitment analysis. Adapting the stopover estimation method to spatial data, movement probabilities were then used to estimate space used at the population scale. This capture–recapture estimates of space used in streams (CRESUS) method may thus be seen as a complementary tool of classic home range methods and should be used to explore the consequence of behavioural strategies on population mechanisms. We propose a methodological example where movements and space use strategies of a Zingel asper (percid) population in the Beaume River (Ardèche, France) were directly estimated at the population scale taking account of the effects of different biotic or abiotic factors. Results showed differences in Z. asper space use patterns among sexes, periods of biological cycle (growing and spawning period), and types of mesohabitat. Downstream movements were more important during the spawning period and by the way the riffle was more intensively used.



2014 ◽  
Vol 2014 ◽  
pp. 1-19 ◽  
Author(s):  
Mark J. van der Laan ◽  
Richard J. C. M. Starmans

This outlook paper reviews the research of van der Laan’s group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming at only relying on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions. We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Data movement.



Sign in / Sign up

Export Citation Format

Share Document