High-performance Machine Learning in Enabling Large-scale Load Analysis Considering Class Imbalance and Frequency Domain Characteristics

Currently, data classification is one of the most important ways to analysis data. However, along with the development of data collection, transmission, and storage technologies, the scale of the data has been sharply increased. Additionally, due to multiple classes and imbalanced data distribution in the dataset, the class imbalance issue is also gradually highlighted. The traditional machine learning algorithms lack of abilities for handling the aforementioned issues so that the classification efficiency and precision may be significantly impacted. Therefore, this paper presents an improved artificial neural network in enabling the high-performance classification for the imbalanced large volume data. Firstly, the Borderline-SMOTE (synthetic minority oversampling technique) algorithm is employed to balance the training dataset, which potentially aims at improving the training of the back propagation neural network (BPNN), and then, zero-mean, batch-normalization, and rectified linear unit (ReLU) are further employed to optimize the input layer and hidden layers of BPNN. At last, the ensemble learning-based parallelization of the improved BPNN is implemented using the Hadoop framework. Positive conclusions can be summarized according to the experimental results. Benefitting from Borderline-SMOTE, the imbalanced training dataset can be balanced, which improves the training performance and the classification accuracy. The improvements for the input layer and hidden layer also enhance the training performances in terms of convergence. The parallelization and the ensemble learning techniques enable BPNN to implement the high-performance large-scale data classification. The experimental results show the effectiveness of the presented classification algorithm.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma

10.21203/rs.3.pex-963/v1 ◽

2021 ◽

Author(s):

Lin Huang ◽

Kun Qian

Keyword(s):

Machine Learning ◽

Lung Adenocarcinoma ◽

Cancer Detection ◽

High Performance ◽

Large Scale ◽

Early Cancer ◽

Early Stage ◽

Early Cancer Detection ◽

Ionization Mass ◽

Efficient Test

Abstract Early cancer detection greatly increases the chances for successful treatment, but available diagnostics for some tumours, including lung adenocarcinoma (LA), are limited. An ideal early-stage diagnosis of LA for large-scale clinical use must address quick detection, low invasiveness, and high performance. Here, we conduct machine learning of serum metabolic patterns to detect early-stage LA. We extract direct metabolic patterns by the optimized ferric particle-assisted laser desorption/ionization mass spectrometry within 1 second using only 50 nL of serum. We define a metabolic range of 100-400 Da with 143 m/z features. We diagnose early-stage LA with sensitivity~70-90% and specificity~90-93% through the sparse regression machine learning of patterns. We identify a biomarker panel of seven metabolites and relevant pathways to distinguish early-stage LA from controls (p < 0.05). Our approach advances the design of metabolic analysis for early cancer detection and holds promise as an efficient test for low-cost rollout to clinics.

Download Full-text

Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms

Journal of Computational Science ◽

10.1016/j.jocs.2015.09.008 ◽

2015 ◽

Vol 11 ◽

pp. 69-81 ◽

Cited By ~ 32

Author(s):

Emad Elsebakhi ◽

Frank Lee ◽

Eric Schendel ◽

Anwar Haque ◽

Nagarajan Kathireason ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Functional Networks ◽

Computing Platforms ◽

Performance Computing

Download Full-text

BrainIAK tutorials: User-friendly learning materials for advanced fMRI analysis

10.31219/osf.io/j4sbc ◽

2019 ◽

Cited By ~ 2

Author(s):

Manoj Kumar ◽

Cameron Thomas Ellis ◽

Qihong Lu ◽

Hejia Zhang ◽

Mihai Capota ◽

...

Keyword(s):

Machine Learning ◽

Functional Connectivity ◽

Open Source ◽

Programming Languages ◽

High Performance ◽

Large Scale ◽

Markov Models ◽

Matrix Analysis ◽

Fmri Analysis ◽

User Friendly

Advanced brain imaging analysis methods, including multivariate pattern analysis (MVPA), functional connectivity, and functional alignment, have become powerful tools in cognitive neuroscience over the past decade. These tools are implemented in custom code and separate packages, often requiring different software and language proficiencies. Although usable by expert researchers, novice users face a steep learning curve. These difficulties stem from the use of new programming languages (e.g., Python), learning how to apply machine-learning methods to high-dimensional fMRI data, and minimal documentation and training materials. Furthermore, most standard fMRI analysis packages (e.g., AFNI, FSL, SPM) focus on preprocessing and univariate analyses, leaving a gap in how to integrate with advanced tools. To address these needs, we developed BrainIAK (brainiak.org), an open-source Python software package that seamlessly integrates several cutting-edge, computationally efficient techniques with other Python packages (e.g., Nilearn, Scikit-learn) for file handling, visualization, and machine learning. To disseminate these powerful tools, we developed user-friendly tutorials (in Jupyter format; https://brainiak.org/tutorials/) for learning BrainIAK and advanced fMRI analysis in Python more generally. These materials cover techniques including: MVPA (pattern classification and representational similarity analysis); parallelized searchlight analysis; background connectivity; full correlation matrix analysis; inter-subject correlation; inter-subject functional connectivity; shared response modeling; event segmentation using hidden Markov models; and real-time fMRI. For long-running jobs or large memory needs we provide detailed guidance on high-performance computing clusters. These notebooks were successfully tested at multiple sites, including as problem sets for courses at Yale and Princeton universities and at various workshops and hackathons. These materials are freely shared, with the hope that they become part of a pool of open-source software and educational materials for large-scale, reproducible fMRI analysis and accelerated discovery.

Download Full-text

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

ACM Transactions on Mathematical Software ◽

10.1145/3431921 ◽

2021 ◽

Vol 47 (3) ◽

pp. 1-23

Author(s):

Ahmad Abdelfattah ◽

Timothy Costa ◽

Jack Dongarra ◽

Mark Gates ◽

Azzam Haidar ◽

...

Keyword(s):

Machine Learning ◽

Linear Algebra ◽

High Performance ◽

Large Scale ◽

Floating Point ◽

Equal Size ◽

Hardware Accelerators ◽

Double Precision ◽

Basic Linear Algebra Subprograms ◽

Many Core

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

Download Full-text

LLAMA: a robust and scalable machine learning pipeline for analysis of large scale 4D microscopy data: analysis of cell ruffles and filopodia

BMC Bioinformatics ◽

10.1186/s12859-021-04324-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

James G. Lefevre ◽

Yvette W. H. Koh ◽

Adam A. Wall ◽

Nicholas D. Condon ◽

Jennifer L. Stow ◽

...

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

High Speed ◽

High Performance ◽

Large Scale ◽

Immune Surveillance ◽

Complex Surface ◽

Semantic Segmentation ◽

Systematic Analysis ◽

4D Microscopy

Abstract Background With recent advances in microscopy, recordings of cell behaviour can result in terabyte-size datasets. The lattice light sheet microscope (LLSM) images cells at high speed and high 3D resolution, accumulating data at 100 frames/second over hours, presenting a major challenge for interrogating these datasets. The surfaces of vertebrate cells can rapidly deform to create projections that interact with the microenvironment. Such surface projections include spike-like filopodia and wave-like ruffles on the surface of macrophages as they engage in immune surveillance. LLSM imaging has provided new insights into the complex surface behaviours of immune cells, including revealing new types of ruffles. However, full use of these data requires systematic and quantitative analysis of thousands of projections over hundreds of time steps, and an effective system for analysis of individual structures at this scale requires efficient and robust methods with minimal user intervention. Results We present LLAMA, a platform to enable systematic analysis of terabyte-scale 4D microscopy datasets. We use a machine learning method for semantic segmentation, followed by a robust and configurable object separation and tracking algorithm, generating detailed object level statistics. Our system is designed to run on high-performance computing to achieve high throughput, with outputs suitable for visualisation and statistical analysis. Advanced visualisation is a key element of LLAMA: we provide a specialised tool which supports interactive quality control, optimisation, and output visualisation processes to complement the processing pipeline. LLAMA is demonstrated in an analysis of macrophage surface projections, in which it is used to i) discriminate ruffles induced by lipopolysaccharide (LPS) and macrophage colony stimulating factor (CSF-1) and ii) determine the autonomy of ruffle morphologies. Conclusions LLAMA provides an effective open source tool for running a cell microscopy analysis pipeline based on semantic segmentation, object analysis and tracking. Detailed numerical and visual outputs enable effective statistical analysis, identifying distinct patterns of increased activity under the two interventions considered in our example analysis. Our system provides the capacity to screen large datasets for specific structural configurations. LLAMA identified distinct features of LPS and CSF-1 induced ruffles and it identified a continuity of behaviour between tent pole ruffling, wave-like ruffling and filopodia deployment.

Download Full-text

High Performance Machine Learning Models of Large Scale Air Pollution Data in Urban Area

Cybernetics and Information Technologies ◽

10.2478/cait-2020-0060 ◽

2020 ◽

Vol 20 (6) ◽

pp. 49-60

Author(s):

Snezhana G. Gocheva-Ilieva ◽

Atanas V. Ivanov ◽

Ioannis E. Livieris

Keyword(s):

Machine Learning ◽

Urban Areas ◽

Air Pollutants ◽

High Performance ◽

Large Scale ◽

Moving Average ◽

Measurement Data ◽

Hourly Data ◽

Auto Regressive ◽

Time Variables

AbstractPreserving the air quality in urban areas is crucial for the health of the population as well as for the environment. The availability of large volumes of measurement data on the concentrations of air pollutants enables their analysis and modelling to establish trends and dependencies in order to forecast and prevent future pollution. This study proposes a new approach for modelling air pollutants data using the powerful machine learning method Random Forest (RF) and Auto-Regressive Integrated Moving Average (ARIMA) methodology. Initially, a RF model of the pollutant is built and analysed in relation to the meteorological variables. This model is then corrected through subsequent modelling of its residuals using the univariate ARIMA. The approach is demonstrated for hourly data on seven air pollutants (O3, NOx, NO, NO2, CO, SO2, PM10) in the town of Dimitrovgrad, Bulgaria over 9 years and 3 months. Six meteorological and three time variables are used as predictors. High-performance models are obtained explaining the data with R2 = 90%-98%.

Download Full-text

dislib: Large Scale High Performance Machine Learning in Python

2019 15th International Conference on eScience (eScience) ◽

10.1109/escience.2019.00018 ◽

2019 ◽

Cited By ~ 1

Author(s):

Javier Alvarez Cid-Fuentes ◽

Salvi Sola ◽

Pol Alvarez ◽

Alfred Castro-Ginard ◽

Rosa M. Badia

Keyword(s):

Machine Learning ◽

High Performance ◽

Large Scale

Download Full-text

Pandemic drugs at pandemic speed: infrastructure for accelerating COVID-19 drug discovery with hybrid machine learning- and physics-based simulations on high-performance computers

Interface Focus ◽

10.1098/rsfs.2021.0018 ◽

2021 ◽

Vol 11 (6) ◽

Cited By ~ 1

Author(s):

Agastya P. Bhati ◽

Shunzhou Wan ◽

Dario Alfè ◽

Austin R. Clyde ◽

Mathis Bode ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

High Performance ◽

Large Scale ◽

Antiviral Drug ◽

Linear Accelerators ◽

Vast Number ◽

Global Pandemic ◽

Major Bottleneck ◽

High Performance Computers

The race to meet the challenges of the global pandemic has served as a reminder that the existing drug discovery process is expensive, inefficient and slow. There is a major bottleneck screening the vast number of potential small molecules to shortlist lead compounds for antiviral drug development. New opportunities to accelerate drug discovery lie at the interface between machine learning methods, in this case, developed for linear accelerators, and physics-based methods. The two in silico methods, each have their own advantages and limitations which, interestingly, complement each other. Here, we present an innovative infrastructural development that combines both approaches to accelerate drug discovery. The scale of the potential resulting workflow is such that it is dependent on supercomputing to achieve extremely high throughput. We have demonstrated the viability of this workflow for the study of inhibitors for four COVID-19 target proteins and our ability to perform the required large-scale calculations to identify lead antiviral compounds through repurposing on a variety of supercomputers.

Download Full-text