scholarly journals Implicitly threaded parallelism in Manticore

2010 ◽  
Vol 20 (5-6) ◽  
pp. 537-576 ◽  
Author(s):  
MATTHEW FLUET ◽  
MIKE RAINEY ◽  
JOHN REPPY ◽  
ADAM SHAW

AbstractThe increasing availability of commodity multicore processors is making parallel computing ever more widespread. In order to exploit its potential, programmers need languages that make the benefits of parallelism accessible and understandable. Previous parallel languages have traditionally been intended for large-scale scientific computing, and they tend not to be well suited to programming the applications one typically finds on a desktop system. Thus, we need new parallel-language designs that address a broader spectrum of applications. The Manticore project is our effort to address this need. At its core is Parallel ML, a high-level functional language for programming parallel applications on commodity multicore hardware. Parallel ML provides a diverse collection of parallel constructs for different granularities of work. In this paper, we focus on the implicitly threaded parallel constructs of the language, which support fine-grained parallelism. We concentrate on those elements that distinguish our design from related ones, namely, a novel parallel binding form, a nondeterministic parallel case form, and the treatment of exceptions in the presence of data parallelism. These features differentiate the present work from related work on functional data-parallel language designs, which have focused largely on parallel problems with regular structure and the compiler transformations—most notably, flattening—that make such designs feasible. We present detailed examples utilizing various mechanisms of the language and give a formal description of our implementation.

2010 ◽  
Vol 638-642 ◽  
pp. 3123-3127
Author(s):  
V.A. Malyshevsky ◽  
E.I. Khlusova ◽  
V.V. Orlov

Metallurgical industry can be considered as a field most accommodated for perception of nano-technologies, which in the near future will be able to provide large scale production and high level of investments return. Specially noted should physical and mechanical properties of nano-structured steels and alloys (strength, plasticity, toughness and so on) which will cardinally excel characteristics of respective materials developed using conventional technologies. Investigations have shown that basic principles of selection of a structure up to nano-level for low-carbon low-alloy steels can be put forward, that is: 1) morphological similarity of structural components, pre-domination of globular type structures due to reduction in carbon components and rational alloying; 2) formation of fine-dispersed carbide phase of globular morphology; 3) exclusion of lengthy interphase boundaries; 4) formation of fragmented structure with boundaries close to wide-angle ones, which inherited structure of fine-grained deformed austenite.


2015 ◽  
Vol 2015 ◽  
pp. 1-20 ◽  
Author(s):  
Xiaowen Chen ◽  
Zhonghai Lu ◽  
Axel Jantsch ◽  
Shuming Chen ◽  
Yang Guo ◽  
...  

On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to asOn-chip Large-scale Parallel Computing Architectures (OLPCs)in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model’s analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.


2001 ◽  
Vol 11 (02n03) ◽  
pp. 363-374 ◽  
Author(s):  
MINYI GUO

It is important for programmers to understand the semantics of a programming language. However, little work has been done about the semantic descriptions of HPF-like data-parallel languages. In this paper, we first define a simple language [Formula: see text], which includes the principal facilities of a data-parallel language such as HPF. Then we present a denotational semantic model of [Formula: see text]. It is useful for understanding the components of an HPF-like language, such as data alignment and distribution directives, forall data-parallel statements.


2012 ◽  
Vol 2012 ◽  
pp. 1-15 ◽  
Author(s):  
Ilia Lebedev ◽  
Christopher Fletcher ◽  
Shaoyi Cheng ◽  
James Martin ◽  
Austin Doupnik ◽  
...  

We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms.


2005 ◽  
Vol 15 (03) ◽  
pp. 289-304 ◽  
Author(s):  
FRÉDÉRIC GAVA ◽  
FRÉDÉRIC LOULERGUE

We have designed a functional data-parallel language called BSML for programming bulk synchronous parallel (BSP) algorithms. Deadlocks and indeterminism are avoided and the execution time can be then estimated. For very large scale applications more than one parallel machine could be needed. One speaks about metacomputing. A major problem in programming application for such architectures is their hierarchical network structures: latency and bandwidth of the network between parallel nodes could be orders of magnitude worse than those inside a parallel node. Here we consider how to extend both the BSP model and BSML, well-suited for parallel computing, in order to obtain a model and a functional language suitable for metacomputing.


2005 ◽  
Vol 15 (04) ◽  
pp. 407-414
Author(s):  
MOSTAFA I. SOLIMAN ◽  
STANISLAV G. SEDUKHIN

Different subtasks of an application usually have different computational, memory, and I/O requirements that result in different needs for computer capabilities. Thus, the more appropriate approach for both high performance and simple programming model is designing a processor having multi-level instruction set architecture (ISA). This leads to high performance and minimum executable code size. Since the fundamental data structures for a wide variety of existing applications are scalar, vector, and matrix, our research Trident processor has three-level ISA executed on zero-, one-, and two-dimensional arrays of data. These levels are used to express a great amount of fine-grain data parallelism to a processor instead of the dynamical extraction by a complicated logic or statically with compilers. This reduces the design complexity and provides high-level programming interface to hardware. In this paper, the performance of Trident processor is evaluated on BLAS, which represent the kernel operations of many data parallel applications. We show that Trident processor proportionally reduces the number of clock cycles per floating-point operation by increasing the number of execution datapaths.


2012 ◽  
Vol 2012 ◽  
pp. 1-11 ◽  
Author(s):  
Ghislain Roquier ◽  
Endri Bezati ◽  
Marco Mattavelli

The new generation of multicore processors and reconfigurable hardware platforms provides a dramatic increase of the available parallelism and processing capabilities. However, one obstacle for exploiting all the promises of such platforms is deeply rooted in sequential thinking. The sequential programming model does not naturally expose potential parallelism that effectively permits to build parallel applications that can be efficiently mapped on different kind of platforms. A shift of paradigm is necessary at all levels of application development to yield portable and scalable implementations on the widest range of heterogeneous platforms. This paper presents a design flow for the hardware and software synthesis of heterogeneous systems allowing to automatically generate hardware and software components as well as appropriate interfaces, from a unique high-level description of the application, based on the dataflow paradigm, running onto heterogeneous architectures composed by reconfigurable hardware units and multicore processors. Experimental results based on the implementation of several video coding algorithms onto heterogeneous platforms are also provided to show the effectiveness of the approach both in terms of portability and scalability.


2021 ◽  
Vol 5 (OOPSLA) ◽  
pp. 1-31
Author(s):  
Wolf Honoré ◽  
Jieung Kim ◽  
Ji-Yong Shin ◽  
Zhong Shao

Despite recent advances, guaranteeing the correctness of large-scale distributed applications without compromising performance remains a challenging problem. Network and node failures are inevitable and, for some applications, careful control over how they are handled is essential. Unfortunately, existing approaches either completely hide these failures behind an atomic state machine replication (SMR) interface, or expose all of the network-level details, sacrificing atomicity. We propose a novel, compositional, atomic distributed object (ADO) model for strongly consistent distributed systems that combines the best of both options. The object-oriented API abstracts over protocol-specific details and decouples high-level correctness reasoning from implementation choices. At the same time, it intentionally exposes an abstract view of certain key distributed failure cases, thus allowing for more fine-grained control over them than SMR-like models. We demonstrate that proving properties even of composite distributed systems can be straightforward with our Coq verification framework, Advert, thanks to the ADO model. We also show that a variety of common protocols including multi-Paxos and Chain Replication refine the ADO semantics, which allows one to freely choose among them for an application's implementation without modifying ADO-level correctness proofs.


Sign in / Sign up

Export Citation Format

Share Document