VFC: The Vienna Fortran Compiler

High Performance Fortran (HPF) offers an attractive high‐level language interface for programming scalable parallel architectures providing the user with directives for the specification of data distribution and delegating to the compiler the task of generating an explicitly parallel program. Available HPF compilers can handle regular codes quite efficiently, but dramatic performance losses may be encountered for applications which are based on highly irregular, dynamically changing data structures and access patterns. In this paper we introduce the Vienna Fortran Compiler (VFC), a new source‐to‐source parallelization system for HPF+, an optimized version of HPF, which addresses the requirements of irregular applications. In addition to extended data distribution and work distribution mechanisms, HPF+ provides the user with language features for specifying certain information that decisively influence a program’s performance. This comprises data locality assertions, non‐local access specifications and the possibility of reusing runtime‐generated communication schedules of irregular loops. Performance measurements of kernels from advanced applications demonstrate that with a high‐level data parallel language such as HPF+ a performance close to hand‐written message‐passing programs can be achieved even for highly irregular codes.

Download Full-text

Kemari: A Portable High Performance Fortran System for Distributed Memory Parallel Processors

Scientific Programming ◽

10.1155/1997/743965 ◽

1997 ◽

Vol 6 (1) ◽

pp. 41-58 ◽

Cited By ~ 2

Author(s):

T. Kamachi ◽

A. MÜller ◽

R. RÜhl ◽

Y. Seo ◽

K. Suehiro ◽

...

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Data Distribution ◽

Programming Environment ◽

Performance Measurements ◽

High Performance Fortran ◽

Additional Control ◽

Compilation Process ◽

Structured Problems

We have developed a compilation system which extends High Performance Fortran (HPF) in various aspects. We support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPF's data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, we provide directives for dynamic data distribution through user-defined mappings. The compiler also allows integration of message-passing interface (MPI) primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, we describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compiler's applicability to a variety of application classes.

Download Full-text

P3T+: A Performance Estimator for Distributed and Parallel Programs

Scientific Programming ◽

10.1155/2000/217384 ◽

2000 ◽

Vol 8 (2) ◽

pp. 73-93 ◽

Cited By ~ 6

Author(s):

T. Fahringer ◽

A. Požgaj

Keyword(s):

Message Passing ◽

High Performance ◽

Distributed Applications ◽

Parallel Programs ◽

Code Transformations ◽

Novel Technologies ◽

Work Distribution ◽

Effective Performance ◽

Access Patterns ◽

A Performance

Developing distributed and parallel programs on today's multiprocessor architectures is still a challenging task. Particular distressing is the lack of effective performance tools that support the programmer in evaluating changes in code, problem and machine sizes, and target architectures. In this paper we introduceP3T+ which is a performance estimator for mostly regular HPF (High Performance Fortran) programs but partially covers also message passing programs (MPI).P3T+ is unique by modeling programs, compiler code transformations, and parallel and distributed architectures. It computes at compile-time a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. Several novel technologies are employed to compute these parameters: loop iteration spaces, array access patterns, and data distributions are modeled by employing highly effective symbolic analysis. Communication is estimated by simulating the behavior of a communication library used by the underlying compiler. Computation times are predicted through pre-measured kernels on every target architecture of interest. We carefully model most critical architecture specific factors such as cache lines sizes, number of cache lines available, startup times, message transfer time per byte, etc.P3T+ has been implemented and is closely integrated with the Vienna High Performance Compiler (VFC) to support programmers develop parallel and distributed applications. Experimental results for realistic kernel codes taken from real-world applications are presented to demonstrate both accuracy and usefulness ofP3T+.

Download Full-text

PAEAN: Portable and scalable runtime support for parallel Haskell dialects

Journal of Functional Programming ◽

10.1017/s0956796816000010 ◽

2016 ◽

Vol 26 ◽

Cited By ~ 1

Author(s):

JOST BERTHOLD ◽

HANS-WOLFGANG LOIDL ◽

KEVIN HAMMOND

Keyword(s):

Shared Memory ◽

High Performance ◽

Parallel Machines ◽

State Of The Art ◽

Computing Systems ◽

Programming Abstraction ◽

Work Distribution ◽

High Level ◽

Parallelism Model ◽

Performance Computing

AbstractOver time, several competing approaches to parallel Haskell programming have emerged. Different approaches support parallelism at various different scales, ranging from small multicores to massively parallel high-performance computing systems. They also provide varying degrees of control, ranging from completely implicit approaches to ones providing full programmer control. Most current designs assume a shared memory model at the programmer, implementation and hardware levels. This is, however, becoming increasingly divorced from the reality at the hardware level. It also imposes significant unwanted runtime overheads in the form of garbage collection synchronisation etc. What is needed is an easy way to abstract over the implementation and hardware levels, while presenting a simple parallelism model to the programmer. The PArallEl shAred Nothing runtime system design aims to provide a portable and high-level shared-nothing implementation platform for parallel Haskell dialects. It abstracts over major issues such as work distribution and data serialisation, consolidating existing, successful designs into a single framework. It also provides an optional virtual shared-memory programming abstraction for (possibly) shared-nothing parallel machines, such as modern multicore/manycore architectures or cluster/cloud computing systems. It builds on, unifies and extends, existing well-developed support for shared-memory parallelism that is provided by the widely used GHC Haskell compiler. This paper summarises the state-of-the-art in shared-nothing parallel Haskell implementations, introduces the PArallEl shAred Nothing abstractions, shows how they can be used to implement three distinct parallel Haskell dialects, and demonstrates that good scalability can be obtained on recent parallel machines.

Download Full-text

A Transparent Runtime Data Distribution Engine for OpenMP

Scientific Programming ◽

10.1155/2000/417570 ◽

2000 ◽

Vol 8 (3) ◽

pp. 143-162 ◽

Cited By ~ 4

Author(s):

Dimitrios S. Nikolopoulos ◽

Theodore S. Papatheodorou ◽

Constantine D. Polychronopoulos ◽

Jesús Labarta ◽

Eduard Ayguadé

Keyword(s):

High Performance ◽

Programming Model ◽

Data Distribution ◽

Data Locality ◽

Remote Memory ◽

Main Body ◽

Performance Loss ◽

Page Migration ◽

Runtime Environment ◽

Memory Accesses

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.

Download Full-text

From FORTRAN 77 to Locality-Aware High Productivity Languages for Peta-Scale Computing

Scientific Programming ◽

10.1155/2007/219061 ◽

2007 ◽

Vol 15 (1) ◽

pp. 45-65

Author(s):

Hans P. Zima

Keyword(s):

Programming Languages ◽

Message Passing ◽

High Performance ◽

Assembly Language ◽

Future Research ◽

Programming System ◽

Automatic Programming ◽

Computing Systems ◽

High Productivity ◽

High Level

When the first specification of the FORTRAN language was released in 1956, the goal was to provide an "automatic programming system" that would enhance the economy of programming by replacing assembly language with a notation closer to the domain of scientific programming. A key issue in this context, explicitly recognized by the authors of the language, was the requirement to produce efficient object programs that could compete with their hand-coded counterparts. More than 50 years later, a similar situation exists with respect to finding the right programming paradigm for high performance computing systems. FORTRAN, as the traditional language for scientific programming, has played a major role in the quest for high-productivity programming languages that satisfy very strict performance constraints. This paper focuses on high-level support for locality awareness, one of the most important requirements in this context. The discussion centers on the High Performance Fortran (HPF) family of languages, and their influence on current language developments for peta-scale computing. HPF is a data-parallel language that was designed to provide the user with a high-level interface for programming scientific applications, while delegating to the compiler the task of generating an explicitly parallel message-passing program. We outline developments that led to HPF, explain its major features, identify a set of weaknesses, and discuss subsequent languages that address these problems. The final part of the paper deals with Chapel, a modern object-oriented language developed in the High Productivity Computing Systems (HPCS) program sponsored by DARPA. A salient property of Chapel is its general framework for the support of user-defined distributions, which is related in many ways to ideas first described in Vienna Fortran. This framework is general enough to allow a concise specification of sparse data distributions. The paper concludes with an outlook to future research in this area.

Download Full-text

CRAUL: Compiler and Run-Time Integration for Adaptation under Load

Scientific Programming ◽

10.1155/1999/603478 ◽

1999 ◽

Vol 7 (3-4) ◽

pp. 261-273 ◽

Cited By ~ 5

Author(s):

Sotiris Ioannidis ◽

Umit Rencuzogullari ◽

Robert Stets ◽

Sandhya Dwarkadas

Keyword(s):

High Performance ◽

Time Integration ◽

Distributed Shared Memory ◽

Cost Effective ◽

Data Access ◽

Parallel Applications ◽

Work Distribution ◽

Run Time ◽

Data Access Patterns ◽

Access Patterns

Clusters of workstations provide a cost‐effective, high performance parallel computing environment. These environments, however, are often shared by multiple users, or may consist of heterogeneous machines. As a result, parallel applications executing in these environments must operate despite unequal computational resources. For maximum performance, applications should automatically adapt execution to maximize use of the available resources. Ideally, this adaptation should be transparent to the application programmer. In this paper, we present CRAUL (Compiler and Run‐Time Integration for Adaptation Under Load), a system that dynamically balances computational load in a parallel application. Our target run‐time is software‐based distributed shared memory (SDSM). SDSM is a good target for parallelizing compilers since it reduces compile‐time complexity by providing data caching and other support for dynamic load balancing. CRAUL combines compile‐time support to identify data access patterns with a run‐time system that uses the access information to intelligently distribute the parallel workload in loop‐based programs. The distribution is chosen according to the relative power of the processors and so as to minimize SDSM overhead and maximize locality. We have evaluated the resulting load distribution in the presence of different types of load – computational, computational and memory intensive, and network load. CRAUL performs within 5–23% of ideal in the presence of load, and is able to improve on naive compiler‐based work distribution that does not take locality into account even in the absence of load.

Download Full-text

Data Handling In The Alice O2 Event Processing

EPJ Web of Conferences ◽

10.1051/epjconf/201921401035 ◽

2019 ◽

Vol 214 ◽

pp. 01035

Author(s):

Matthias Richter ◽

Mikolaj Krzewicki ◽

Giulio Eulisse

Keyword(s):

Message Passing ◽

Hadron Collider ◽

Data Handling ◽

Performance Measurements ◽

Alice Experiment ◽

Level Trigger ◽

Multiple Processes ◽

And Performance ◽

High Level ◽

Large Background

The ALICE experiment at the Large Hadron Collider (LHC) at CERN is planned to be operated in a continuous data-taking mode in Run 3. This will allow to inspect data from all Pb-Pb collisions at a rate of 50 kHz, giving access to rare physics signals embedded in a large background. Based on experience with real-time reconstruction of particle trajectories and event properties in the ALICE High Level Trigger, the ALICE O2 facility is currently designed and developed to support processing of a continuous, triggerless stream of data segmented into entities referred to as timeframes. Both raw data input into the ALICE O2 system and the actual processing of aggregated timeframes are distributed among multiple processes on a manynode cluster. Process communication is based on the asynchronous message passing paradigm. This paper presents the basic concept for identification of data in the distributed system together with prototype implementations and performance measurements.

Download Full-text

What Can Digitization Do For Formulated Product Innovation and Development

10.26434/chemrxiv.11763864.v1 ◽

2020 ◽

Author(s):

James McDonagh ◽

William Swope ◽

Richard L. Anderson ◽

Michael Johnston ◽

David J. Bray

Keyword(s):

High Performance ◽

Quality Data ◽

High Quality Data ◽

Base Level ◽

Hybrid Approaches ◽

Digital Ecosystem ◽

Recent Developments ◽

Chemical Simulation ◽

High Level ◽

Physical And Chemical

Digitization oﬀers signiﬁcant opportunities for the formulated product industry to transform the way it works and develop new methods of business. R&D is one area of operation that is challenging to take advantage of these technologies due to its high level of domain specialisation and creativity but the beneﬁts could be signiﬁcant. Recent developments of base level technologies such as artiﬁcial intelligence (AI)/machine learning (ML), robotics and high performance computing (HPC), to name a few, present disruptive and transformative technologies which could oﬀer new insights, discovery methods and enhanced chemical control when combined in a digital ecosystem of connectivity, distributive services and decentralisation. At the fundamental level, research in these technologies has shown that new physical and chemical insights can be gained, which in turn can augment experimental R&D approaches through physics-based chemical simulation, data driven models and hybrid approaches. In all of these cases, high quality data is required to build and validate models in addition to the skills and expertise to exploit such methods. In this article we give an overview of some of the digital technology demonstrators we have developed for formulated product R&D. We discuss the challenges in building and deploying these demonstrators.<br>

Download Full-text

Multi-level Parallelization of Genotype Imputation on Supercomputers

Current Bioinformatics ◽

10.2174/1574893615999200420071307 ◽

2020 ◽

Vol 15 ◽

Author(s):

Weiwen Zhang ◽

Long Wang ◽

Theint Theint Aye ◽

Juniarto Samsudin ◽

Yongqing Zhu

Keyword(s):

Association Study ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Genome Wide Association Study ◽

Job Scheduling ◽

Genotype Imputation ◽

Job Level ◽

Multi Level ◽

High Performance Requirement

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.

Download Full-text

https://imsciences.edu.pk/files/journals/vol12_2/New%201%20MA.864.pdf

Business & Economic Review ◽

10.22547/ber/12.2.2 ◽

2020 ◽

Vol 12 (2) ◽

pp. 19-50 ◽

Cited By ~ 1

Author(s):

Muhammad Siddique ◽

Shandana Shoaib ◽

Zahoor Jan

Keyword(s):

Organizational Performance ◽

Structural Equation ◽

High Performance ◽

Service Sector ◽

Performance Outcomes ◽

Theory And Practice ◽

Relational Coordination ◽

Multiple Sources ◽

And Performance ◽

High Level

A key aspect of work processes in service sector firms is the interconnection between tasks and performance. Relational coordination can play an important role in addressing the issues of coordinating organizational activities due to high level of interdependence complexity in service sector firms. Research has primarily supported the aspect that well devised high performance work systems (HPWS) can intensify organizational performance. There is a growing debate, however, with regard to understanding the “mechanism” linking HPWS and performance outcomes. Using relational coordination theory, this study examines a model that examine the effects of subsets of HPWS, such as motivation, skills and opportunity enhancing HR practices on relational coordination among employees working in reciprocal interdependent job settings. Data were gathered from multiple sources including managers and employees at individual, functional and unit levels to know their understanding in relation to HPWS and relational coordination (RC) in 218 bank branches in Pakistan. Data analysis via structural equation modelling, results suggest that HPWS predicted RC among officers at the unit level. The findings of the study have contributions to both, theory and practice.

Download Full-text