A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

A Novel Rounding Algorithm for a High Performance IEEE 754 Double-Precision Floating-Point Multiplier

2020 IEEE 38th International Conference on Computer Design (ICCD) ◽

10.1109/iccd50377.2020.00081 ◽

2020 ◽

Author(s):

S. Ross Thompson ◽

James E. Stine

Keyword(s):

High Performance ◽

Floating Point ◽

Double Precision ◽

Rounding Algorithm

Download Full-text

High-performance Machine Learning in Enabling Large-scale Load Analysis Considering Class Imbalance and Frequency Domain Characteristics

2020 IEEE Sustainable Power and Energy Conference (iSPEC) ◽

10.1109/ispec50848.2020.9350922 ◽

2020 ◽

Author(s):

Xi Wang ◽

Quan Tang ◽

Haiyan Wang ◽

Ruiguang Ma ◽

Zizhuo Tang

Keyword(s):

Machine Learning ◽

Frequency Domain ◽

High Performance ◽

Large Scale ◽

Class Imbalance ◽

Load Analysis

Download Full-text

Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma

10.21203/rs.3.pex-963/v1 ◽

2021 ◽

Author(s):

Lin Huang ◽

Kun Qian

Keyword(s):

Machine Learning ◽

Lung Adenocarcinoma ◽

Cancer Detection ◽

High Performance ◽

Large Scale ◽

Early Cancer ◽

Early Stage ◽

Early Cancer Detection ◽

Ionization Mass ◽

Efficient Test

Abstract Early cancer detection greatly increases the chances for successful treatment, but available diagnostics for some tumours, including lung adenocarcinoma (LA), are limited. An ideal early-stage diagnosis of LA for large-scale clinical use must address quick detection, low invasiveness, and high performance. Here, we conduct machine learning of serum metabolic patterns to detect early-stage LA. We extract direct metabolic patterns by the optimized ferric particle-assisted laser desorption/ionization mass spectrometry within 1 second using only 50 nL of serum. We define a metabolic range of 100-400 Da with 143 m/z features. We diagnose early-stage LA with sensitivity~70-90% and specificity~90-93% through the sparse regression machine learning of patterns. We identify a biomarker panel of seven metabolites and relevant pathways to distinguish early-stage LA from controls (p < 0.05). Our approach advances the design of metabolic analysis for early cancer detection and holds promise as an efficient test for low-cost rollout to clinics.

Download Full-text

High Performance and Fault Tolerance Double Precision Floating Point Arithmetic Units

Journal of Artificial Intelligence ◽

10.3923/jai.2013.154.160 ◽

2013 ◽

Vol 6 (2) ◽

pp. 154-160

Author(s):

N. Vinothkuma ◽

M.S. Ravi ◽

Kittur Harish Maillikarj

Keyword(s):

Fault Tolerance ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Floating Point Arithmetic ◽

Arithmetic Units ◽

Point Arithmetic

Download Full-text

High performance and energy efficient single‐precision and double‐precision merged floating‐point adder on FPGA

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2016.0200 ◽

2017 ◽

Vol 12 (1) ◽

pp. 20-29 ◽

Cited By ~ 4

Author(s):

Hao Zhang ◽

Dongdong Chen ◽

Seok‐Bum Ko

Keyword(s):

Energy Efficient ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Single Precision

Download Full-text

Implementation of Embedded Floating Point Arithmetic Units on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.550.126 ◽

2014 ◽

Vol 550 ◽

pp. 126-136

Author(s):

N. Ramya Rani

Keyword(s):

High Speed ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Embedded Computing ◽

Floating Point Arithmetic ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Arithmetic Units ◽

Point Arithmetic

:Floating point arithmetic plays a major role in scientific and embedded computing applications. But the performance of field programmable gate arrays (FPGAs) used for floating point applications is poor due to the complexity of floating point arithmetic. The implementation of floating point units on FPGAs consumes a large amount of resources and that leads to the development of embedded floating point units in FPGAs. Embedded applications like multimedia, communication and DSP algorithms use floating point arithmetic in processing graphics, Fourier transformation, coding, etc. In this paper, methodologies are presented for the implementation of embedded floating point units on FPGA. The work is focused with the aim of achieving high speed of computations and to reduce the power for evaluating expressions. An application that demands high performance floating point computation can achieve better speed and density by incorporating embedded floating point units. Additionally this paper describes a comparative study of the design of single precision and double precision pipelined floating point arithmetic units for evaluating expressions. The modules are designed using VHDL simulation in Xilinx software and implemented on VIRTEX and SPARTAN FPGAs.

Download Full-text

Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms

Journal of Computational Science ◽

10.1016/j.jocs.2015.09.008 ◽

2015 ◽

Vol 11 ◽

pp. 69-81 ◽

Cited By ~ 32

Author(s):

Emad Elsebakhi ◽

Frank Lee ◽

Eric Schendel ◽

Anwar Haque ◽

Nagarajan Kathireason ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Functional Networks ◽

Computing Platforms ◽

Performance Computing

Download Full-text

Splotch

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652713 ◽

2016 ◽

Vol 31 (6) ◽

pp. 550-563

Author(s):

Timothy Dykes ◽

Claudio Gheller ◽

Marzia Rivi ◽

Mel Krokos

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Xeon Phi ◽

The Many ◽

Many Core ◽

Performance Results ◽

Graphics Processing ◽

Performance Computing

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.

Download Full-text

Design of a Reconfigurable Coprocessor for Double Precision Floating Point Matrix Algorithms

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.58-60.1037 ◽

2011 ◽

Vol 58-60 ◽

pp. 1037-1042

Author(s):

Sheng Long Li ◽

Zhao Lin Li ◽

Qing Wei Zheng

Keyword(s):

High Performance ◽

Cmos Technology ◽

General Purpose ◽

Floating Point ◽

Double Precision ◽

Synthesis Time ◽

Matrix Algorithms ◽

Matrix Operations ◽

On Chip ◽

Software Execution

Double precision floating point matrix operations are wildly used in a variety of engineering and scientific computing applications. However, it’s inefficient to achieve these operations using software approaches on general purpose processors. In order to reduce the processing time and satisfy the real-time demand, a reconfigurable coprocessor for double precision floating point matrix algorithms is proposed in this paper. The coprocessor is embedded in a Multi-Processor System on Chip (MPSoC), cooperates with an ARM core and a DSP core for high-performance control and calculation. One algorithm in GPS applications is taken for example to illustrate the efficiency of the coprocessor proposed in this paper. The experiment result shows that the coprocessor can achieve speedup a factor of 50 for the quaternion algorithm of attitude solution in inertial navigation application compare with software execution time of a TI C6713 DSP. The coprocessor is implemented in SMIC 0.13μm CMOS technology, the synthesis time delay is 9.75ns, and the power consumption is 63.69 mW when it works at 100MHz.

Download Full-text