Implicitly threaded parallelism in Manticore

AbstractThe increasing availability of commodity multicore processors is making parallel computing ever more widespread. In order to exploit its potential, programmers need languages that make the benefits of parallelism accessible and understandable. Previous parallel languages have traditionally been intended for large-scale scientific computing, and they tend not to be well suited to programming the applications one typically finds on a desktop system. Thus, we need new parallel-language designs that address a broader spectrum of applications. The Manticore project is our effort to address this need. At its core is Parallel ML, a high-level functional language for programming parallel applications on commodity multicore hardware. Parallel ML provides a diverse collection of parallel constructs for different granularities of work. In this paper, we focus on the implicitly threaded parallel constructs of the language, which support fine-grained parallelism. We concentrate on those elements that distinguish our design from related ones, namely, a novel parallel binding form, a nondeterministic parallel case form, and the treatment of exceptions in the presence of data parallelism. These features differentiate the present work from related work on functional data-parallel language designs, which have focused largely on parallel problems with regular structure and the compiler transformations—most notably, flattening—that make such designs feasible. We present detailed examples utilizing various mechanisms of the language and give a formal description of our implementation.

Download Full-text

An object-oriented approach to the implementation of a high-level data parallel language

Lecture Notes in Computer Science - Scientific Computing in Object-Oriented Parallel Environments ◽

10.1007/3-540-63827-x_49 ◽

1997 ◽

pp. 97-104

Author(s):

Matthias Besch ◽

Hua Bi ◽

Gerd Heber ◽

Matthias Kessler ◽

Matthias Wilhelmi

Keyword(s):

Object Oriented ◽

Parallel Language ◽

Data Parallel ◽

Level Data ◽

Object Oriented Approach ◽

High Level ◽

Oriented Approach

Download Full-text

Technologies of Nanomodification of Low-Carbon Low Alloyed Steels

Materials Science Forum ◽

10.4028/www.scientific.net/msf.638-642.3123 ◽

2010 ◽

Vol 638-642 ◽

pp. 3123-3127

Author(s):

V.A. Malyshevsky ◽

E.I. Khlusova ◽

V.V. Orlov

Keyword(s):

Large Scale ◽

Physical And Mechanical Properties ◽

Metallurgical Industry ◽

Low Alloy Steels ◽

Scale Production ◽

Low Carbon ◽

Fine Grained ◽

Interphase Boundaries ◽

Alloy Steels ◽

High Level

Metallurgical industry can be considered as a field most accommodated for perception of nano-technologies, which in the near future will be able to provide large scale production and high level of investments return. Specially noted should physical and mechanical properties of nano-structured steels and alloys (strength, plasticity, toughness and so on) which will cardinally excel characteristics of respective materials developed using conventional technologies. Investigations have shown that basic principles of selection of a structure up to nano-level for low-carbon low-alloy steels can be put forward, that is: 1) morphological similarity of structural components, pre-domination of globular type structures due to reduction in carbon components and rational alloying; 2) formation of fine-dispersed carbide phase of globular morphology; 3) exclusion of lengthy interphase boundaries; 4) formation of fragmented structure with boundaries close to wide-angle ones, which inherited structure of fine-grained deformed austenite.

Download Full-text

Performance Analysis of Homogeneous On-Chip Large-Scale Parallel Computing Architectures for Data-Parallel Applications

Journal of Electrical and Computer Engineering ◽

10.1155/2015/902591 ◽

2015 ◽

Vol 2015 ◽

pp. 1-20 ◽

Cited By ~ 2

Author(s):

Xiaowen Chen ◽

Zhonghai Lu ◽

Axel Jantsch ◽

Shuming Chen ◽

Yang Guo ◽

...

Keyword(s):

Parallel Computing ◽

Large Scale ◽

Performance Model ◽

Parallel Applications ◽

Network Communication ◽

Core Network ◽

Communication Latency ◽

Data Parallel ◽

Computing Platforms ◽

On Chip

On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to asOn-chip Large-scale Parallel Computing Architectures (OLPCs)in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model’s analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.

Download Full-text

DENOTATIONAL SEMANTICS OF AN HPF-LIKE DATA-PARALLEL LANGUAGE MODEL

Parallel Processing Letters ◽

10.1142/s0129626401000658 ◽

2001 ◽

Vol 11 (02n03) ◽

pp. 363-374 ◽

Cited By ~ 1

Author(s):

MINYI GUO

Keyword(s):

Programming Language ◽

Language Model ◽

Denotational Semantics ◽

Semantic Model ◽

Denotational Semantic ◽

Parallel Language ◽

Data Parallel ◽

Data Alignment ◽

Parallel Languages ◽

Simple Language

It is important for programmers to understand the semantics of a programming language. However, little work has been done about the semantic descriptions of HPF-like data-parallel languages. In this paper, we first define a simple language [Formula: see text], which includes the principal facilities of a data-parallel language such as HPF. Then we present a denotational semantic model of [Formula: see text]. It is useful for understanding the components of an HPF-like language, such as data alignment and distribution directives, forall data-parallel statements.

Download Full-text

Exploring Many-Core Design Templates for FPGAs and ASICs

International Journal of Reconfigurable Computing ◽

10.1155/2012/439141 ◽

2012 ◽

Vol 2012 ◽

pp. 1-15 ◽

Cited By ~ 4

Author(s):

Ilia Lebedev ◽

Christopher Fletcher ◽

Shaoyi Cheng ◽

James Martin ◽

Austin Doupnik ◽

...

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Coarse Grained ◽

Processing Unit ◽

Fine Grained ◽

Data Parallel ◽

Level Data ◽

Graph Inference ◽

High Level ◽

Many Core

We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms.

Download Full-text

A FUNCTIONAL LANGUAGE FOR DEPARTMENTAL METACOMPUTING

Parallel Processing Letters ◽

10.1142/s0129626405002222 ◽

2005 ◽

Vol 15 (03) ◽

pp. 289-304 ◽

Cited By ~ 2

Author(s):

FRÉDÉRIC GAVA ◽

FRÉDÉRIC LOULERGUE

Keyword(s):

Parallel Computing ◽

Functional Data ◽

Execution Time ◽

Large Scale ◽

Parallel Machine ◽

Functional Language ◽

Hierarchical Network ◽

Bulk Synchronous Parallel ◽

Parallel Language ◽

Data Parallel

We have designed a functional data-parallel language called BSML for programming bulk synchronous parallel (BSP) algorithms. Deadlocks and indeterminism are avoided and the execution time can be then estimated. For very large scale applications more than one parallel machine could be needed. One speaks about metacomputing. A major problem in programming application for such architectures is their hierarchical network structures: latency and bandwidth of the network between parallel nodes could be orders of magnitude worse than those inside a parallel node. Here we consider how to extend both the BSP model and BSML, well-suited for parallel computing, in order to obtain a model and a functional language suitable for metacomputing.

Download Full-text

PERFORMANCE EVALUATION OF BLAS ON THE TRIDENT PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626405002325 ◽

2005 ◽

Vol 15 (04) ◽

pp. 407-414

Author(s):

MOSTAFA I. SOLIMAN ◽

STANISLAV G. SEDUKHIN

Keyword(s):

High Performance ◽

Programming Model ◽

Parallel Applications ◽

Instruction Set ◽

Code Size ◽

Data Parallel ◽

Fine Grain ◽

Multi Level ◽

High Level ◽

Programming Interface

Different subtasks of an application usually have different computational, memory, and I/O requirements that result in different needs for computer capabilities. Thus, the more appropriate approach for both high performance and simple programming model is designing a processor having multi-level instruction set architecture (ISA). This leads to high performance and minimum executable code size. Since the fundamental data structures for a wide variety of existing applications are scalar, vector, and matrix, our research Trident processor has three-level ISA executed on zero-, one-, and two-dimensional arrays of data. These levels are used to express a great amount of fine-grain data parallelism to a processor instead of the dynamical extraction by a complicated logic or statically with compilers. This reduces the design complexity and provides high-level programming interface to hardware. In this paper, the performance of Trident processor is evaluated on BLAS, which represent the kernel operations of many data parallel applications. We show that Trident processor proportionally reduces the number of clock cycles per floating-point operation by increasing the number of execution datapaths.

Download Full-text

Effective SIMD code generation for the high-level declarative data-parallel language 8 1/2

Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies ◽

10.1109/eurmic.1996.546372 ◽

2002 ◽

Author(s):

D. De Vito ◽

O. Michel

Keyword(s):

Code Generation ◽

Parallel Language ◽

Data Parallel ◽

High Level

Download Full-text

Hardware and Software Synthesis of Heterogeneous Systems from Dataflow Programs

Journal of Electrical and Computer Engineering ◽

10.1155/2012/484962 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 10

Author(s):

Ghislain Roquier ◽

Endri Bezati ◽

Marco Mattavelli

Keyword(s):

Programming Model ◽

Multicore Processors ◽

Heterogeneous Systems ◽

Reconfigurable Hardware ◽

Design Flow ◽

Parallel Applications ◽

Application Development ◽

Software Synthesis ◽

Heterogeneous Platforms ◽

High Level

The new generation of multicore processors and reconfigurable hardware platforms provides a dramatic increase of the available parallelism and processing capabilities. However, one obstacle for exploiting all the promises of such platforms is deeply rooted in sequential thinking. The sequential programming model does not naturally expose potential parallelism that effectively permits to build parallel applications that can be efficiently mapped on different kind of platforms. A shift of paradigm is necessary at all levels of application development to yield portable and scalable implementations on the widest range of heterogeneous platforms. This paper presents a design flow for the hardware and software synthesis of heterogeneous systems allowing to automatically generate hardware and software components as well as appropriate interfaces, from a unique high-level description of the application, based on the dataflow paradigm, running onto heterogeneous architectures composed by reconfigurable hardware units and multicore processors. Experimental results based on the implementation of several video coding algorithms onto heterogeneous platforms are also provided to show the effectiveness of the approach both in terms of portability and scalability.

Download Full-text

Much ADO about failures: a fault-aware model for compositional verification of strongly consistent distributed systems

Proceedings of the ACM on Programming Languages ◽

10.1145/3485474 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-31

Author(s):

Wolf Honoré ◽

Jieung Kim ◽

Ji-Yong Shin ◽

Zhong Shao

Keyword(s):

Distributed Systems ◽

Large Scale ◽

Distributed Applications ◽

Compositional Verification ◽

Distributed Object ◽

Fine Grained ◽

Correctness Proofs ◽

High Level ◽

State Machine Replication ◽

Strongly Consistent

Despite recent advances, guaranteeing the correctness of large-scale distributed applications without compromising performance remains a challenging problem. Network and node failures are inevitable and, for some applications, careful control over how they are handled is essential. Unfortunately, existing approaches either completely hide these failures behind an atomic state machine replication (SMR) interface, or expose all of the network-level details, sacrificing atomicity. We propose a novel, compositional, atomic distributed object (ADO) model for strongly consistent distributed systems that combines the best of both options. The object-oriented API abstracts over protocol-specific details and decouples high-level correctness reasoning from implementation choices. At the same time, it intentionally exposes an abstract view of certain key distributed failure cases, thus allowing for more fine-grained control over them than SMR-like models. We demonstrate that proving properties even of composite distributed systems can be straightforward with our Coq verification framework, Advert, thanks to the ADO model. We also show that a variety of common protocols including multi-Paxos and Chain Replication refine the ADO semantics, which allows one to freely choose among them for an application's implementation without modifying ADO-level correctness proofs.

Download Full-text