multithreaded programs
Recently Published Documents


TOTAL DOCUMENTS

187
(FIVE YEARS 18)

H-INDEX

22
(FIVE YEARS 1)

2021 ◽  
Vol 5 (OOPSLA) ◽  
pp. 1-31
Author(s):  
Guy L. Steele Jr. ◽  
Sebastiano Vigna

In 2014, Steele, Lea, and Flood presented SplitMix, an object-oriented pseudorandom number generator (prng) that is quite fast (9 64-bit arithmetic/logical operations per 64 bits generated) and also splittable . A conventional prng object provides a generate method that returns one pseudorandom value and updates the state of the prng; a splittable prng object also has a second operation, split , that replaces the original prng object with two (seemingly) independent prng objects, by creating and returning a new such object and updating the state of the original object. Splittable prng objects make it easy to organize the use of pseudorandom numbers in multithreaded programs structured using fork-join parallelism. This overall strategy still appears to be sound, but the specific arithmetic calculation used for generate in the SplitMix algorithm has some detectable weaknesses, and the period of any one generator is limited to 2 64 . Here we present the LXM family of prng algorithms. The idea is an old one: combine the outputs of two independent prng algorithms, then (optionally) feed the result to a mixing function. An LXM algorithm uses a linear congruential subgenerator and an F 2 -linear subgenerator; the examples studied in this paper use a linear congruential generator (LCG) of period 2 16 , 2 32 , 2 64 , or 2 128 with one of the multipliers recommended by L’Ecuyer or by Steele and Vigna, and an F 2 -linear xor-based generator (XBG) of the xoshiro family or xoroshiro family as described by Blackman and Vigna. For mixing functions we study the MurmurHash3 finalizer function; variants by David Stafford, Doug Lea, and degski; and the null (identity) mixing function. Like SplitMix, LXM provides both a generate operation and a split operation. Also like SplitMix, LXM requires no locking or other synchronization (other than the usual memory fence after instance initialization), and is suitable for use with simd instruction sets because it has no branches or loops. We analyze the period and equidistribution properties of LXM generators, and present the results of thorough testing of specific members of this family, using the TestU01 and PractRand test suites, not only on single instances of the algorithm but also for collections of instances, used in parallel, ranging in size from 2 to 2 24 . Single instances of LXM that include a strong mixing function appear to have no major weaknesses, and LXM is significantly more robust than SplitMix against accidental correlation in a multithreaded setting. We believe that LXM, like SplitMix, is suitable for “everyday” scientific and machine-learning applications (but not cryptographic applications), especially when concurrent threads or distributed processes are involved.


2021 ◽  
Vol 43 (3) ◽  
pp. 1-50
Author(s):  
Lun Liu ◽  
Todd Millstein ◽  
Madanlal Musuvathi

Modern “safe” programming languages follow a design principle that we call safety by default and performance by choice . By default, these languages enforce important programming abstractions, such as memory and type safety, but they also provide mechanisms that allow expert programmers to explicitly trade some safety guarantees for increased performance. However, these same languages have adopted the inverse design principle in their support for multithreading. By default, multithreaded programs violate important abstractions, such as program order and atomic access to individual memory locations to admit compiler and hardware optimizations that would otherwise need to be restricted. Not only does this approach conflict with the design philosophy of safe languages, but very little is known about the practical performance cost of providing a stronger default semantics. In this article, we propose a safe-by-default and performance-by-choice multithreading semantics for safe languages, which we call volatile -by-default . Under this semantics, programs have sequential consistency (SC) by default, which is the natural “interleaving” semantics of threads. However, the volatile -by-default design also includes annotations that allow expert programmers to avoid the associated overheads in performance-critical code. We describe the design, implementation, optimization, and evaluation of the volatile -by-default semantics for two different safe languages: Java and Julia. First, we present V BD-HotSpot and V BDA-HotSpot, modifications of Oracle’s HotSpot JVM that enforce the volatile -by-default semantics on Intel x86-64 hardware and ARM-v8 hardware. Second, we present S C-Julia, a modification to the just-in-time compiler within the standard Julia implementation that provides best-effort enforcement of the volatile -by-default semantics on x86-64 hardware for the purpose of performance evaluation. We also detail two different implementation techniques: a baseline approach that simply reuses existing mechanisms in the compilers for handling atomic accesses, and a speculative approach that avoids the overhead of enforcing the volatile -by-default semantics until there is the possibility of an SC violation. Our results show that the cost of enforcing SC is significant but arguably still acceptable for some use cases today. Further, we demonstrate that compiler optimizations as well as programmer annotations can reduce the overhead considerably.


2021 ◽  
Vol 12 (5) ◽  
pp. 267-273
Author(s):  
P. M. Nikolaev ◽  

The use of parallel computing tools can significantly reduce the execution time of calculations in many engineer­ing tasks. One of the main difficulties in the development of multithreaded programs remains the organization of simultaneous access from different threads to shared data. The most common solution to this problem is to use locking facilities when accessing shared data. There are a number of tasks where data sharing is not needed, but you need to synchronize access to a limited resource, such as a temporary buffer. In such tasks, there is no data exchange between different threads, but there is an object that at a given time can be used by the code of only one thread. One such task is calculating the value of a B-spline. The software implementation of the functions for calculating B-splines, performed according to classical algorithms, requires the use of blocking objects when accessing the common array of intermediate data from different threads. This reduces the degree of parallelism and reduces the efficiency of computational programs using B-splines running on multiprocessor computing systems. The article discusses a way to improve the efficiency of calculating B-splines in parallel programming tasks by eliminating locks when accessing general modified data. A soft­ware implementation is presented in the form of a C++ class template, which provides placement of a temporary array used for calculating a B-spline into a local buffer of a given size with the possibility of increasing it if necessary. Using the developed template in conjunction with the threadlocal qualifier reduces the number of requests for increasing the buffer for high degree B-splines (larger than the initially specified buffer size). It is also possible to implement this scheme using the std::vector template of the C++ STL Standard Library. The results of the application of the developed class when calculating the values of B-splines in a multithreaded environment, showing a reduction in the calculation time in proportion to an increase in the number of computational processors, are presented. The methods of specifying arrays for storing intermediate calculation results considered in this article can be used in other parallel programming tasks.


Author(s):  
Xulong Tang ◽  
Mahmut Taylan Kandemir ◽  
Mustafa Karakoy

Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.


This chapter considers algebra-dynamic models of parallel programs, which are based on concepts of transition systems theory and algebra of algorithms. The models of sequential and parallel multithreaded programs for multicore processors and program models for graphics processing units are constructed. The authors describe transformations of programs aimed at transition from sequential to parallel versions (parallelization) and improving performance of parallel programs in respect to execution time (optimization). The transformations are based on using rewriting rules technique. The formal model of program auto-tuning as an evolutional extension of transition systems is proposed, and some properties of programs are considered.


Author(s):  
Sergey Andreevich Polyakov ◽  
Alexey Evgenevich Borodin

The paper describes an extension to summary based static program analysis to find deadlock errors. Summary based analysis is a popular approach aimed at the detection of bugs in programs due to its high performance and scalability. At the same time, the implementation of deadlock detectors in such an analysis is nontrivial, because there is no information about the locks held higher in the call stack during the process of function intraprocedural analysis. A lock graph, which is built during the main analysis, is used to model the semantics of multithreaded programs. Lock graph is a modification of call graph which contains additional information about held locks. After the lock graph is built, the deadlock detector is launched. Both the construction of the lock graph and the deadlock detection algorithm do not require significant processor time. On the performed measurements, the total analysis time increased by 4%. Based on the results of the analysis of 8 open source projects in C/C++/Java with a total size of more than 14 million lines of code, the proposed algorithm showed a high level of true positives. The described algorithms were implemented in the Svace tool.


Sign in / Sign up

Export Citation Format

Share Document