scholarly journals Oracle-guided scheduling for controlling granularity in implicitly parallel languages

Author(s):  
UMUT A. ACAR ◽  
ARTHUR CHARGUÉRAUD ◽  
MIKE RAINEY

AbstractA classic problem in parallel computing is determining whether to execute a thread in parallel or sequentially. If small threads are executed in parallel, the overheads due to thread creation can overwhelm the benefits of parallelism, resulting in suboptimal efficiency and performance. If large threads are executed sequentially, processors may spin idle, resulting again in sub-optimal efficiency and performance. This “granularity problem” is especially important in implicitly parallel languages, where the programmer expresses all potential for parallelism, leaving it to the system to exploit parallelism by creating threads as necessary. Although this problem has been identified as an important problem, it is not well understood—broadly applicable solutions remain elusive. In this paper, we propose techniques for automatically controlling granularity in implicitly parallel programming languages to achieve parallel efficiency and performance. To this end, we first extend a classic result, Brent's theorem (a.k.a. the work-time principle) to include thread-creation overheads. Using a cost semantics for a general-purpose language in the style of lambda calculus with parallel tuples, we then present a precise accounting of thread-creation overheads and bound their impact on efficiency and performance. To reduce such overheads, we propose an oracle-guided semantics by using estimates of the sizes of parallel threads. We show that, if the oracle provides accurate estimates in constant time, then the oracle-guided semantics reduces the thread-creation overheads for a reasonably large class of parallel computations. We describe how to approximate the oracle-guided semantics in practice by combining static and dynamic techniques. We require the programmer to provide the asymptotic complexity cost for each parallel thread and use runtime profiling to determine hardware-specific constant factors. We present an implementation of the proposed approach as an extension of the Manticore compiler for Parallel ML. Our empirical evaluation shows that our techniques can reduce thread-creation overheads, leading to good efficiency and performance.

1987 ◽  
Vol 14 (3) ◽  
pp. 134-140 ◽  
Author(s):  
K.A. Clarke

Practical classes in neurophysiology reinforce and complement the theoretical background in a number of ways, including demonstration of concepts, practice in planning and performance of experiments, and the production and maintenance of viable neural preparations. The balance of teaching objectives will depend upon the particular group of students involved. A technique is described which allows the embedding of real compound action potentials from one of the most basic introductory neurophysiology experiments—frog sciatic nerve, into interactive programs for student use. These retain all the elements of the “real experiment” in terms of appearance, presentation, experimental management and measurement by the student. Laboratory reports by the students show that the experiments are carefully and enthusiastically performed and the material is well absorbed. Three groups of student derive most benefit from their use. First, students whose future careers will not involve animal experiments do not spend time developing dissecting skills they will not use, but more time fulfilling the other teaching objectives. Second, relatively inexperienced students, struggling to produce viable neural material and master complicated laboratory equipment, who are often left with little time or motivation to take accurate readings or ponder upon neurophysiological concepts. Third, students in institutions where neurophysiology is taught with difficulty because of the high cost of equipment and lack of specific expertise, may well have access to a low cost general purpose microcomputer system.


Author(s):  
Hatem Abou-Senna ◽  
Mohamed El-Agroudy ◽  
Mustapha Mouloua ◽  
Essam Radwan

The use of express lanes (ELs) in freeway traffic management has seen increasing popularity throughout the United States, particularly in Florida. These lanes aim at making the most efficient transportation system management and operations tool to provide a more reliable trip. An important component of ELs is the channelizing devices used to delineate the separation between the ELs and the general-purpose lane. With the upcoming changes to the FHWA Manual on Uniform Traffic Control Devices, this study provided an opportunity to recommend changes affecting safety and efficiency on a nationwide level. It was important to understand the impacts on driver perception and performance in response to the color of the EL delineators. It was also valuable to understand the differences between demographics in responding to delineator colors under different driving conditions. The driving simulator was used to test the responses of several demographic groups to changes in marker color and driving conditions. Furthermore, participants were tested for several factors relevant to driving performance including visual and subjective responses to the changes in colors and driving conditions. Impacts on driver perception were observed via eye-tracking technology with changes to time of day, visibility, traffic density, roadway surface type, and, crucially, color of the delineating devices. The analyses concluded that white was the optimal and most significant color for notice of delineators across the majority of subjective and performance measures, followed by yellow, with black being the least desirable.


2010 ◽  
Vol 20 (02) ◽  
pp. 103-121 ◽  
Author(s):  
MOSTAFA I. SOLIMAN ◽  
ABDULMAJID F. Al-JUNAID

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.


2021 ◽  
Vol 31 ◽  
Author(s):  
BHARGAV SHIVKUMAR ◽  
JEFFREY MURPHY ◽  
LUKASZ ZIAREK

Abstract There is a growing interest in leveraging functional programming languages in real-time and embedded contexts. Functional languages are appealing as many are strictly typed, amenable to formal methods, have limited mutation, and have simple but powerful concurrency control mechanisms. Although there have been many recent proposals for specialized domain-specific languages for embedded and real-time systems, there has been relatively little progress on adapting more general purpose functional languages for programming embedded and real-time systems. In this paper, we present our current work on leveraging Standard ML (SML) in the embedded and real-time domains. Specifically, we detail our experiences in modifying MLton, a whole-program optimizing compiler for SML, for use in such contexts. We focus primarily on the language runtime, reworking the threading subsystem, object model, and garbage collector. We provide preliminary results over a radar-based aircraft collision detector ported to SML.


Author(s):  
Robert Bourque

An external combustion engine design using steam is described which has good efficiency at full power and even better efficiency at the low power settings common for passenger vehicles. The engine is compact with low weight per unit power. All of its components fit in the engine compartment of a front-wheel drive vehicle despite the space occupied by the transaxle. It readily fits in a rear-drive vehicle. Calculated net efficiencies, after accounting for all losses, range, depending on engine size, from 28–32% at full power increasing to 33–36% at normal road power settings. A two-stage burner, 100% excess air, and combustion temperature below 1500°C assure complete combustion of the fuel and negligible NOx. The engine can burn a variety of fuels and fuel mixes, which should encourage the development of new fuels. Extensive software has been written that calculates full power and part-load energy balances, structural analysis and heat transfer, and performance in specified vehicles including using SAE driving cycles. Engines have been sized from 30 to 3200 hp. In general, fuel consumption should be at least 1.5 times lower than gasoline engines and about the same as diesels operating at low to moderate load settings. Due to this analysis, a prototype, when built, should perform as expected.


2014 ◽  
Vol 596 ◽  
pp. 276-279
Author(s):  
Xiao Hui Pan

Graph component labeling, which is a subset of the general graph coloring problem, is a computationally expensive operation in many important applications and simulations. A number of data-parallel algorithmic variations to the component labeling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on CPUs and GPUs using CUDA. We evaluated our system with real-world graphs. We show how to consider different architectural features of the GPU and the host CPUs and achieve high performance.


Sign in / Sign up

Export Citation Format

Share Document