Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Limits of instruction-level parallelism and higher transistor density sustain the increasing need for multiprocessor systems: they are rapidly taking over both general-purpose and embedded processor domains. Current multiprocessing systems are composed either of many homogeneous and simple cores or of complex superscalar, simultaneous multithread processing elements. As parallel applications are becoming increasingly present in embedded and general-purpose domains and multiprocessing systems must handle a wide range of different application classes, there is no consensus over which are the best hardware solutions to better exploit instruction-level parallelism (TLP) and thread-level parallelism (TLP) together. Therefore, in this work, we have expanded the DIM (dynamic instruction merging) technique to be used in a multiprocessing scenario, proving the need for an adaptable ILP exploitation even in TLP architectures. We have successfully coupled a dynamic reconfigurable system to an SPARC-based multiprocessor and obtained performance gains of up to 40%, even for applications that show a great level of parallelism at thread level.

Download Full-text

Τεχνικές για την βελτιστοποίηση και αποδοτική απεικόνιση παράλληλων κωδίκων σε υπολογιστικούς κόμβους με πολυνηματικές και πολυπύρηνες αρχιτεκτονικές μικροεπεξεργαστών

10.12681/eadd/18839 ◽

2010 ◽

Author(s):

Νικόλαος Αναστόπουλος

Keyword(s):

Transactional Memory ◽

Simultaneous Multithreading ◽

Speculative Parallelization ◽

Thread Level Parallelism ◽

Level Parallelism

Οι πολυπύρηνες και πολυνηματικές αρχιτεκτονικές κερδίζουν συνεχώς έδαφος τα τελευταία χρόνια αποτελώντας πλέον τον κανόνα στη σχεδίαση των επεξεργαστών σε ένα ευρύ φάσμα εφαρμογών. Για να μπορούν να αξιοποιήσουν τα προγράμματα του χρήστη τις δυνατότητές τους, είναι απαραίτητη μια γενικότερη στροφή προς την εκμετάλλευση του παραλληλισμού επιπέδου νημάτων (thread-level parallelism - TLP) που μπορεί να εξαχθεί από αυτά. Σε αυτό το νέο περιβάλλον τίθενται επομένως μια σειρά από σημαντικές προκλήσεις στον προγραμματιστή, όπως ο εντοπισμός, η έκφραση και η απεικόνιση του παραλληλισμού, ο συγχρονισμός μεταξύ των νημάτων και η αποδοτική διαχείριση των πόρων της υποκείμενης αρχιτεκτονικής. Συμβατικές τεχνικές παραλληλοποίησης και συγχρονισμού που έχουν προταθεί στη βιβλιογραφία είναι θεωρητικά εφαρμόσιμες στις νέες αρχιτεκτονικές, όμως είτε καλύπτουν συγκεκριμένα είδη εφαρμογών με προφανή και άμεσα εκμεταλλεύσιμο παραλληλισμό, είτε δε λαμβάνουν υπόψη τις ιδιαιτερότητες κάθε αρχιτεκτονικής στη διαχείριση των πόρων με αποτέλεσμα να οδηγούν σε μειωμένη απόδοση. Στα πλαίσια αυτής της διατριβής εξετάζουμε τεχνικές που έχουν σαν στόχο τον εντοπισμό και την απεικόνιση του παραλληλισμού καθώς και τον αποδοτικό συγχρονισμό σε αρχιτεκτονικές επεξεργαστών με Ταυτόχρονο Πολυνηματισμό (Simultaneous Multithreading - SMT) και Πολυεπεξεργασία σε Επίπεδο Τσιπ (Chip-level Multiprocessing - CMP). Διερευνούμε εναλλακτικές τεχνικές παραλληλοποίησης που στηρίζονται στην ιδέα της βοηθητικής νημάτωσης (helper threading) και οι οποίες προορίζονται κυρίως για εφαρμογές με ασαφή, ακανόνιστο ή και μηδενικό εγγενή παραλληλισμό. Τέτοιες εφαρμογές δε θα μπορούσαν να λάβουν σημαντικά οφέλη αν εκτελούνταν σε κάποιο παραδοσιακό σύστημα πολυεπεξεργασίας ή χρησιμοποιώντας κάποια παραδοσιακή τεχνική παραλληλοποίησης. Στις αρχιτεκτονικές SMT χρησιμοποιούμε τη βοηθητική νημάτωση για να αποφορτίσουμε το κύριο νήμα μιας εφαρμογής από χρονοβόρες λειτουργίες πρόσβασης στη μνήμη. Σε αρκετές περιπτώσεις επιτυγχάνουμε αξιοσημείωτα αποτελέσματα, ωστόσο οι συγκρούσεις ανάμεσα στα εκτελούμενα νήματα για κοινούς πόρους του επεξεργαστή καθιστούν δύσκολη την επίτευξη μεγαλύτερων επιταχύνσεων. Στην κατεύθυνση αυτή προτείνουμε ένα πλαίσιο για την υλοποίηση αποδοτικών λειτουργιών συγχρονισμού, οι οποίες σε σύγκριση με άλλες υλοποιήσεις είναι σε θέση να προσφέρουν τον καλύτερο συμβιβασμό ανάμεσα στην αποδοτική διαχείριση πόρων και τη χαμηλή καθυστέρηση. Στις αρχιτεκτονικές CMP χρησιμοποιούμε τη βοηθητική νημάτωση για να αποφορτίσουμε το κύριο νήμα από πραγματικούς υπολογισμούς, αξιοποιώντας έναν προηγμένο μηχανισμό συγχρονισμού στο υλικό, αυτόν της μνήμης διενεργειών (transactional memory - ΤΜ). Παρουσιάζουμε ένα σχήμα υποθετικής παραλληλοποίησης (speculative parallelization), μέσω του οποίου καταφέρνουμε να επιταχύνουμε μια περίπτωση εφαρμογής για την οποία οποιοδήποτε συμβατικό σχήμα παραλληλοποίησης μέχρι τώρα είχε αρνητικά αποτελέσματα.

Download Full-text

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Level Parallelism

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

Download Full-text

UltraSynth: Insights of a CGRA Integration into a Control Engineering Environment

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01641-7 ◽

2021 ◽

Author(s):

Dennis Wolf ◽

Andreas Engel ◽

Tajas Ruschke ◽

Andreas Koch ◽

Christian Hochberger

Keyword(s):

Computing System ◽

Coarse Grained ◽

Instruction Level Parallelism ◽

Control Engineering ◽

Processing Elements ◽

Actual Application ◽

Reconfigurable Arrays ◽

Engineering Environment ◽

On Chip ◽

Level Parallelism

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.

Download Full-text

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09 ◽

10.1145/1555754.1555775 ◽

2009 ◽

Cited By ~ 256

Author(s):

Sunpyo Hong ◽

Hyesoon Kim

Keyword(s):

Analytical Model ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Gpu Architecture ◽

With Memory

Download Full-text

Thread partitioning and value prediction for exploiting speculative thread-level parallelism

IEEE Transactions on Computers ◽

10.1109/tc.2004.1261823 ◽

2004 ◽

Vol 53 (2) ◽

pp. 114-125 ◽

Cited By ~ 11

Author(s):

P. Marcuello ◽

A. Gonzalez ◽

J. Tubella

Keyword(s):

Value Prediction ◽

Thread Level Parallelism ◽

Thread Partitioning ◽

Level Parallelism

Download Full-text

GPU Performance vs. Thread-Level Parallelism

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3177964 ◽

2018 ◽

Vol 15 (1) ◽

pp. 1-21 ◽

Cited By ~ 4

Author(s):

Zhen Lin ◽

Michael Mantor ◽

Huiyang Zhou

Keyword(s):

Thread Level Parallelism ◽

Level Parallelism

Download Full-text

Available instruction-level parallelism for superscalar and superpipelined machines

Proceedings of the third international conference on Architectural support for programming languages and operating systems - ASPLOS-III ◽

10.1145/70082.68207 ◽

1989 ◽

Cited By ~ 165

Author(s):

N. P. Jouppi ◽

D. W. Wall

Keyword(s):

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Topic 8 Parallel Computer Architecture and Instruction-Level Parallelism

Euro-Par 2003 Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-540-45209-6_78 ◽

2003 ◽

pp. 541-542

Author(s):

Stamatis Vassiliadis ◽

Nikitas Dimopoulos ◽

Jean-Francois Collard ◽

Arndt Bode

Keyword(s):

Computer Architecture ◽

Parallel Computer ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU

The Scientific World JOURNAL ◽

10.1155/2015/848416 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10

Author(s):

Jianliang Ma ◽

Jinglei Meng ◽

Tianzhou Chen ◽

Minghui Wu

Keyword(s):

Scheduling Algorithm ◽

Global Memory ◽

Request Sequence ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Memory Request ◽

Request Service

Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly.

Download Full-text