Modelling and simulation of GPU processing in the MERPSYS environment

In this work, we evaluate an analytical GPU performance model based on Little's law, that expresses the kernel execution time in terms of latency bound, throughput bound, and achieved occupancy.We then combine it with the results of several research papers, introduce equations for data transfer time estimation, and finally incorporate it into the MERPSYS framework, which is a general-purpose simulator for parallel and distributed systems.The resulting solution enables the user to express a CUDA application in a MERPSYS editor using an extended Java language and then conveniently evaluate its performance for various launch configurations using different hardware units.We also provide a systematic methodology for extracting kernel characteristics, that are used as input parameters of the model.The model was evaluated using kernels representing different traits and for a large variety of launch configurations.We found it to be very accurate for computation bound kernels and realistic workloads, whilst for memory throughput bound kernels and uncommon scenarios the results were still within acceptable limits.We have also proven its portability between two devices of the same hardware architecture but different processing power.Consequently, MERPSYS with the theoretical models embedded in it can be used for evaluationof application performance on various GPUs and used for performance prediction and e.g. purchase decision making.

Download Full-text

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

The Journal of Supercomputing ◽

10.1007/s11227-021-03853-x ◽

2021 ◽

Author(s):

Xiaohan Tao ◽

Jianmin Pang ◽

Jinlong Xu ◽

Yu Zhu

Keyword(s):

Energy Consumption ◽

High Performance ◽

Scientific Computing ◽

Data Transfer ◽

Performance Model ◽

Experimental Result ◽

Transfer Model ◽

Scratchpad Memory ◽

On Chip ◽

Many Core

AbstractThe heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49$$\times$$ × and achieves an energy saving of 5.16$$\times$$ × on average.

Download Full-text

A smart home embedded computer system on programmable chips

10.32920/ryerson.14665896.v1 ◽

2021 ◽

Author(s):

Jonathan B. Chan

Keyword(s):

Embedded System ◽

Computer System ◽

Smart Home ◽

Data Transfer ◽

Cold Water ◽

System Development ◽

Cost Savings ◽

General Purpose ◽

Development Environment ◽

Home Appliances

System on Programmable Chip (SoPC) based embedded system development has been increasing, aiming for improved system design, testing, and cost savings in the workflow for Application Specific ICs (ASIC). We examine the development of Smart Home embedded systems, which have been traditionally based on a fixed processor and memory, with inflexible configuration. We investigate how more ability can be added by updating firmware without the burden of updating hardware, or using a full (but dedicated) general purpose computer system. Our development and implementation of the smart home controller is based on the SoPC development environment from Altera. The development board includes all the necessary parts such as processor, memory, and various communication interfaces. The initial implementation includes a simple protocol for communication between home appliances or devices and controller. This protocol allows data transfer between home appliances or devices and the controller, in turn allowing both to support more features. We have investigated and developed a home resource management application. The main resources being managed in this project are hot and cold water, electricity, and gas. We have introduced a number of expert rules to manage these resources. Additionally, we have developed a home simulator, with virtual appliances and devices, that communicates with the home controller. The simulator interacts with the SoPC based smart home embedded system developed in this project by generating messages representing a number of smart appliances in the home. It provides a useful testing environment for the smart home embedded system to verify its design goals.

Download Full-text

Reshaping Text Data for Efficient Processing on Amazon EC2

Scientific Programming ◽

10.1155/2011/642698 ◽

2011 ◽

Vol 19 (2-3) ◽

pp. 133-145

Author(s):

Gabriela Turcu ◽

Ian Foster ◽

Svetlozar Nestorov

Keyword(s):

Cost Effective ◽

Performance Model ◽

Small Data ◽

Data Sets ◽

Performance Measurements ◽

File Size ◽

Application Performance ◽

Execution Plan ◽

Small Data Sets ◽

Amazon Ec2

Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc.). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We derive an execution plan using an empirically determined application performance model. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.

Download Full-text

New Digital Well Construction Planning Solution: Improving Efficiency & Quality of Well Design through Collaboration and Automation

10.2118/205701-ms ◽

2021 ◽

Author(s):

Hendrik Suryadi ◽

Haifeng Li ◽

Diego Medina ◽

Alex Celis

Keyword(s):

Data Transfer ◽

Time Estimation ◽

Process Efficiency ◽

Trajectory Design ◽

Repetitive Tasks ◽

Well Design ◽

Design Changes ◽

Team Productivity ◽

Operational Activity

Abstract Drilling wells with minimum risk and optimizing well placement with the least possible cost are key goals that companies strive to achieve. The major contributor to the successful execution of the well is the quality of the drilling program. Well design is a complex process, which requires full collaboration of multiple domain roles & expertise working together to integrate various well-planning data. Many design challenges will be encountered, such as risk assessments, domain-specific workflows, geological concerns, technology selections, cost & time estimation, environmental and safety concerns. Design process efficiency depends on effective communication between parties, quickly adapting to any changes, reducing the number of changes, and reducing complicated & manual processes. Current existing workflow and tools are not promoting an excellent collaborative environment among the different roles involved. Engineers utilize multiple engineering applications, which involved many manual data transfers and inputs. The different party is still working in a silo and sharing the design via email or other manual data transfer. Any changes to the design cause manual rework, leading to inconsistency, incoherency, slow decision & optimization process, and failure to identify all potential risks, increasing the well planning time. The new digital planning solution based on cloud technology allows the design team to maximize the results by giving them access to all the data and science they need in a single, standard system. It's a radical new way of working that gives engineers quicker and better-quality drilling programs by automating repetitive tasks and validation workflows to ensure the entire plan is coherent. This new planning solution allows multiple roles & domain collaboration to break down silos, increase team productivity through tasks assignment, and share all data. An automated trajectory design changes the way engineers design trajectory from manually connecting the path from a surface location to the target reservoir location to automatically calculate & propose multiple options with various KPIs allowing the engineer to select the best trajectory option. The system reinforces drilling program quality through auto engineering analysis, which provides quick feedback for any design changes and provides an integrated workflow from the trajectory design to operational activity planning and AFE. The automation of repetitive tasks, such as multiple manual inputs, frees domain experts to have more time to focus on creating new engineering insights while still maintaining design traceability to review updates over the life of the projects and see how the design changes have optimized the drilling program. This new solution solves some of the significant challenges in the current well-planning workflow.

Download Full-text

Resource Provisioning in the Cloud

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Handbook of Research on Architectural Trends in Service-Driven Computing ◽

10.4018/978-1-4666-6178-3.ch023 ◽

2014 ◽

pp. 589-612

Author(s):

Ming Mao ◽

Marty Humphrey

Keyword(s):

Job Scheduling ◽

Data Transfer ◽

Resource Provisioning ◽

Research Trends ◽

Future Research ◽

Application Performance ◽

Grid Environments ◽

Solution Methods ◽

The One ◽

Cloud Users

It is a challenge to provision and allocate resources in the Cloud so as to meet both the performance and cost goals of Cloud users. For a Cloud consumer, the ability to acquire and release resources dynamically and trivially in the Cloud, while being a powerful and useful aspect, complicates the resource provisioning and allocation task in the Cloud. While on the one hand, resource under-provisioning may hurt application performance and deteriorate service quality; on the other hand, resource over-provisioning could cost users more and offset Cloud advantages. Although resource management and job scheduling have been studied extensively in the Grid environments and the Cloud shares many common features with the Grid, the mapping from user objectives to resource provisioning and allocation in the Cloud has many challenges due to the seemingly unlimited resource pools, virtualization, and isolation features provided by the Cloud. This chapter focuses on surveying the research trends in resource provisioning in the Cloud based on several factors such as the type of the workload, the VM heterogeneity, data transfer requirements, solution methods, and optimization goals and constraints, and attempts to provide guidelines for future research.

Download Full-text

Mission Performance Model

Mission Adaptive Display Technologies and Operational Decision Making in Aviation ◽

10.4018/978-1-4666-8673-1.ch004 ◽

2015 ◽

pp. 51-78

Keyword(s):

Attitude Control ◽

Large Scale ◽

Training System ◽

Performance Model ◽

Performance Models ◽

Application Performance ◽

Mission Success ◽

Operational Application ◽

Flight Deck ◽

Effective Design

Mission Performance Models (MPM) are important to the design of modern digital avionic systems because the flight deck information is no longer obvious. In large-scale dynamic systems, necessary responses to the incoming information model should be a direct correspondence. A Mission Performance Model is an abstract representation of the activity clusters necessary to achieve mission success. The three core activity clusters are trajectory management, energy management, and attitude control and will be covered in detail. Their combined performance characteristics highlight the vehicle's kinematic attributes, which then anticipates unstable conditions. Six MPM are necessary for the effective design and employment of a modern mission-ready flight deck. We describe MPM and their structure, purpose, and operational application. Performance models have many important uses including training system definition and design, avionic system design, and safety programs.

Download Full-text

Research on the Extension of SCTP Protocol on the Heterogeneous Wireless Network

International Journal of Interdisciplinary Telecommunications and Networking ◽

10.4018/ijitn.2016040107 ◽

2016 ◽

Vol 8 (2) ◽

pp. 69-87

Author(s):

Yao Yuan ◽

Dalin Zhang ◽

Lin Tian ◽

Jinglin Shi

Keyword(s):

Data Transfer ◽

General Purpose ◽

Transport Layer ◽

Transmission Protocol ◽

Multiple Streams ◽

Stream Control Transmission Protocol ◽

Transfer Rates ◽

Heterogeneous Objects ◽

Transport Layer Protocol ◽

Stream Control

As a promising candidate of general-purpose transport layer protocol, the Stream Control Transmission Protocol (SCTP) has its new features such as multi-homing and multi-streaming. SCTP association can make concurrent multi-path transfer an appealing candidate to satisfy the ever increasing user demands for bandwidth by using Multi-homing feature. And multiple streams provide an aggregation mechanism to accommodate heterogeneous objects, which belong to the same application but may require different QoS from the network. In this paper, the authors introduce WM2-SCTP (Wireless Multi-path Multi-flow - Stream Control Transmission Protocol), a transport layer solution for concurrent multi-path transfer with parallel sub-flows. WM2-SCTP aims at exploiting SCTP's multi-homing and multi-streaming capability by grouping SCTP streams into sub-flows based on their required QoS and selecting best paths for each sub-flow to improve data transfer rates. The results show that under different scenarios WM2-SCTP is able to support QoS among the SCTP stream, and it achieves a better throughput.

Download Full-text

Accelerated FDPS: Algorithms to use accelerators with FDPS

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psz133 ◽

2020 ◽

Vol 72 (1) ◽

Cited By ~ 2

Author(s):

Masaki Iwasawa ◽

Daisuke Namekata ◽

Keigo Nitadori ◽

Kentaro Nomura ◽

Long Wang ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

General Purpose ◽

Performance Model ◽

Performance Tuning ◽

Data Types ◽

Interaction Function ◽

Current Implementation ◽

And Performance ◽

Graphics Processing

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Download Full-text

Modeling and Computing Overlapping Aggregation of Large Data Sequences in Geographic Information Systems

International Journal of Information System Modeling and Design ◽

10.4018/ijismd.2019010102 ◽

2019 ◽

Vol 10 (1) ◽

pp. 20-41

Author(s):

Driss En-Nejjary ◽

Francois Pinet ◽

Myoung-Ah Kang

Keyword(s):

Information Systems ◽

Data Transfer ◽

Graphics Processing Unit ◽

Large Data ◽

General Purpose ◽

Environmental Data ◽

Acquisition Time ◽

Processing Unit ◽

Sequential Method ◽

Transfer Cost

Recently, in the field of information systems, the acquisition of geo-referenced data has made a huge leap forward in terms of technology. There is a real issue in terms of the data processing optimization, and different research works have been proposed to analyze large geo-referenced datasets based on multi-core approaches. In this article, different methods based on general-purpose logic on graphics processing unit (GPGPU) are modelled and compared to parallelize overlapping aggregations of raster sequences. Our methods are tested on a sequence of rasters representing the evolution of temperature over time for the same region. Each raster corresponds to a different data acquisition time period, and each raster geo-referenced cell is associated with a temperature value. This article proposes optimized methods to calculate the average temperature for the region for all the possible raster subsequences of a determined length, i.e., to calculate overlapping aggregated data summaries. In these aggregations, the same subsets of values are aggregated several times. For example, this type of aggregation can be useful in different environmental data analyses, e.g., to pre-calculate all the average temperatures in a database. The present article highlights a significant increase in performance and shows that the use of GPGPU parallel processing enabled us to run the aggregations up to more than 50 times faster than the sequential method including data transfer cost and more than 200 times faster without data transfer cost.

Download Full-text

Modeling the Temperature-Dependence of Tertiary Creep Damage of a Directionally Solidified Ni-Base Superalloy

Volume 11: Mechanics of Solids, Structures and Fluids ◽

10.1115/imece2009-11288 ◽

2009 ◽

Cited By ~ 1

Author(s):

Calvin M. Stewart ◽

Erik A. Hogan ◽

Ali P. Gordon

Keyword(s):

Experimental Data ◽

Gas Turbine ◽

Creep Damage ◽

Time Estimation ◽

General Purpose ◽

Tertiary Creep ◽

Secondary Creep ◽

Directionally Solidified ◽

Estimation Model ◽

Element Analysis

Directionally solidified (DS) Ni-base superalloys have become a commonly used material in gas turbine components. Controlled solidification during the material manufacturing process leads to a special alignment of the grain boundaries within the material. This alignment results in different material properties dependent on the orientation of the material. When used in gas turbine applications the direction of the first principle stress experienced by a component is aligned with the enhanced grain orientation leading to enhanced impact strength, high temperature creep and fatigue resistance, and improve corrosion resistance compared to off axis orientations. Of particular importance is the creep response of these DS materials. In the current study, the classical Kachanov-Rabotnov model for tertiary creep damage is implemented in a general-purpose finite element analysis (FEA) software. Creep deformation and rupture experiments are conducted on samples from a representative DS Ni-base superalloys tested at temperatures between 649 and 982°C and two orientations (longitudinally- and transversely-oriented). The secondary creep constants are analytically determined from available experimental data in literature. The simulated annealing optimization routine is utilized to determine the tertiary creep constants. Using regression analysis the creep constants are characterized for temperature and stress-dependence. A rupture time estimation model derived from the Kachanov-Rabotnov model is then parametrically exercised and compared with available experimental data.

Download Full-text