Formal semantics and high performance in declarative machine learning using Datalog

AbstractWith an escalating arms race to adopt machine learning (ML) in diverse application domains, there is an urgent need to support declarative machine learning over distributed data platforms. Toward this goal, a new framework is needed where users can specify ML tasks in a manner where programming is decoupled from the underlying algorithmic and system concerns. In this paper, we argue that declarative abstractions based on Datalog are natural fits for machine learning and propose a purely declarative ML framework with a Datalog query interface. We show that using aggregates in recursive Datalog programs entails a concise expression of ML applications, while providing a strictly declarative formal semantics. This is achieved by introducing simple conditions under which the semantics of recursive programs is guaranteed to be equivalent to that of aggregate-stratified ones. We further provide specialized compilation and planning techniques for semi-naive fixpoint computation in the presence of aggregates and optimization strategies that are effective on diverse recursive programs and distributed data platforms. To test and demonstrate these research advances, we have developed a powerful and user-friendly system on top of Apache Spark. Extensive evaluations on large-scale datasets illustrate that this approach will achieve promising performance gains while improving both programming flexibility and ease of development and deployment for ML applications.

Download Full-text

BrainIAK tutorials: User-friendly learning materials for advanced fMRI analysis

10.31219/osf.io/j4sbc ◽

2019 ◽

Cited By ~ 2

Author(s):

Manoj Kumar ◽

Cameron Thomas Ellis ◽

Qihong Lu ◽

Hejia Zhang ◽

Mihai Capota ◽

...

Keyword(s):

Machine Learning ◽

Functional Connectivity ◽

Open Source ◽

Programming Languages ◽

High Performance ◽

Large Scale ◽

Markov Models ◽

Matrix Analysis ◽

Fmri Analysis ◽

User Friendly

Advanced brain imaging analysis methods, including multivariate pattern analysis (MVPA), functional connectivity, and functional alignment, have become powerful tools in cognitive neuroscience over the past decade. These tools are implemented in custom code and separate packages, often requiring different software and language proficiencies. Although usable by expert researchers, novice users face a steep learning curve. These difficulties stem from the use of new programming languages (e.g., Python), learning how to apply machine-learning methods to high-dimensional fMRI data, and minimal documentation and training materials. Furthermore, most standard fMRI analysis packages (e.g., AFNI, FSL, SPM) focus on preprocessing and univariate analyses, leaving a gap in how to integrate with advanced tools. To address these needs, we developed BrainIAK (brainiak.org), an open-source Python software package that seamlessly integrates several cutting-edge, computationally efficient techniques with other Python packages (e.g., Nilearn, Scikit-learn) for file handling, visualization, and machine learning. To disseminate these powerful tools, we developed user-friendly tutorials (in Jupyter format; https://brainiak.org/tutorials/) for learning BrainIAK and advanced fMRI analysis in Python more generally. These materials cover techniques including: MVPA (pattern classification and representational similarity analysis); parallelized searchlight analysis; background connectivity; full correlation matrix analysis; inter-subject correlation; inter-subject functional connectivity; shared response modeling; event segmentation using hidden Markov models; and real-time fMRI. For long-running jobs or large memory needs we provide detailed guidance on high-performance computing clusters. These notebooks were successfully tested at multiple sites, including as problem sets for courses at Yale and Princeton universities and at various workshops and hackathons. These materials are freely shared, with the hope that they become part of a pool of open-source software and educational materials for large-scale, reproducible fMRI analysis and accelerated discovery.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

High-performance Machine Learning in Enabling Large-scale Load Analysis Considering Class Imbalance and Frequency Domain Characteristics

2020 IEEE Sustainable Power and Energy Conference (iSPEC) ◽

10.1109/ispec50848.2020.9350922 ◽

2020 ◽

Author(s):

Xi Wang ◽

Quan Tang ◽

Haiyan Wang ◽

Ruiguang Ma ◽

Zizhuo Tang

Keyword(s):

Machine Learning ◽

Frequency Domain ◽

High Performance ◽

Large Scale ◽

Class Imbalance ◽

Load Analysis

Download Full-text

Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma

10.21203/rs.3.pex-963/v1 ◽

2021 ◽

Author(s):

Lin Huang ◽

Kun Qian

Keyword(s):

Machine Learning ◽

Lung Adenocarcinoma ◽

Cancer Detection ◽

High Performance ◽

Large Scale ◽

Early Cancer ◽

Early Stage ◽

Early Cancer Detection ◽

Ionization Mass ◽

Efficient Test

Abstract Early cancer detection greatly increases the chances for successful treatment, but available diagnostics for some tumours, including lung adenocarcinoma (LA), are limited. An ideal early-stage diagnosis of LA for large-scale clinical use must address quick detection, low invasiveness, and high performance. Here, we conduct machine learning of serum metabolic patterns to detect early-stage LA. We extract direct metabolic patterns by the optimized ferric particle-assisted laser desorption/ionization mass spectrometry within 1 second using only 50 nL of serum. We define a metabolic range of 100-400 Da with 143 m/z features. We diagnose early-stage LA with sensitivity~70-90% and specificity~90-93% through the sparse regression machine learning of patterns. We identify a biomarker panel of seven metabolites and relevant pathways to distinguish early-stage LA from controls (p < 0.05). Our approach advances the design of metabolic analysis for early cancer detection and holds promise as an efficient test for low-cost rollout to clinics.

Download Full-text

Large-scale machine learning based on functional networks for biomedical big data with high performance computing platforms

Journal of Computational Science ◽

10.1016/j.jocs.2015.09.008 ◽

2015 ◽

Vol 11 ◽

pp. 69-81 ◽

Cited By ~ 32

Author(s):

Emad Elsebakhi ◽

Frank Lee ◽

Eric Schendel ◽

Anwar Haque ◽

Nagarajan Kathireason ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Functional Networks ◽

Computing Platforms ◽

Performance Computing

Download Full-text

A comprehensive resource for retrieving, visualizing, and integrating functional genomics data

Life Science Alliance ◽

10.26508/lsa.201900546 ◽

2019 ◽

Vol 3 (1) ◽

pp. e201900546

Author(s):

Matthias Blum ◽

Pierre-Etienne Cholley ◽

Valeriya Malysheva ◽

Samuel Nicaise ◽

Julien Moehlin ◽

...

Keyword(s):

Functional Genomics ◽

High Performance ◽

Large Scale ◽

Quality Data ◽

Multidimensional Data ◽

Online Resource ◽

Public Data ◽

User Friendly ◽

Quality Assessments ◽

Multidimensional Data Integration

The enormous amount of freely accessible functional genomics data is an invaluable resource for interrogating the biological function of multiple DNA-interacting players and chromatin modifications by large-scale comparative analyses. However, in practice, interrogating large collections of public data requires major efforts for (i) reprocessing available raw reads, (ii) incorporating quality assessments to exclude artefactual and low-quality data, and (iii) processing data by using high-performance computation. Here, we present qcGenomics, a user-friendly online resource for ultrafast retrieval, visualization, and comparative analysis of tens of thousands of genomics datasets to gain new functional insight from global or focused multidimensional data integration.

Download Full-text

A Multi-resolution Approach for Content-based Image Retrieval on the Grid

Methods of Information in Medicine ◽

10.1055/s-0038-1633949 ◽

2005 ◽

Vol 44 (02) ◽

pp. 211-214 ◽

Cited By ~ 3

Author(s):

T. Tweed ◽

S. Miguet ◽

K. Hassan

Keyword(s):

Image Retrieval ◽

High Performance ◽

Large Scale ◽

Breast Cancer Diagnosis ◽

Content Based Image Retrieval ◽

Distributed Data ◽

Promising Tool ◽

Medical Centers ◽

Image Retrieval System ◽

Processing Resources

Summary Objectives: Hospitals and medical centers are producing more and more data that need to be processed. Those data are confidential, heterogeneous, and limited to the geographic site where they have been produced. Unless properly anonymized, they cannot be distributed on wide area networks. Methods: Grid technologies allow the globalization of storage and processing resources, and enable large-scale experimentations on distributed data. They constitute a promising tool to treat the different data and analyze the knowledge they contain, while offering secured access and high-performance computing capacities to the different users. Our aim is to evaluate the possibilities of grid technologies for handling medical data. Results and Conclusions: In this paper, we focus on a breast cancer diagnosis assistance tool, based on distributed and incremental knowledge construction and a content-based image retrieval system. We analyze the different scenarios of uses of such a tool. We further propose an algorithm that indexes mammographic images for content-based query purposes. This algorithm is tested on images of different resolutions in order to reduce the indexation time and we analyze its performance with experiments on the grid.

Download Full-text

scBFA: modeling detection patterns to mitigate technical noise in large-scale single-cell genomics data

Genome Biology ◽

10.1186/s13059-019-1806-0 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 6

Author(s):

Ruoxin Li ◽

Gerald Quon

Keyword(s):

Gene Expression ◽

Single Cell ◽

Large Scale ◽

Feature Detection ◽

State Of The Art ◽

Cell Type ◽

Technical Noise ◽

Single Cell Genomics ◽

Performance Gains ◽

New Framework

Abstract Technical variation in feature measurements, such as gene expression and locus accessibility, is a key challenge of large-scale single-cell genomic datasets. We show that this technical variation in both scRNA-seq and scATAC-seq datasets can be mitigated by analyzing feature detection patterns alone and ignoring feature quantification measurements. This result holds when datasets have low detection noise relative to quantification noise. We demonstrate state-of-the-art performance of detection pattern models using our new framework, scBFA, for both cell type identification and trajectory inference. Performance gains can also be realized in one line of R code in existing pipelines.

Download Full-text

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

ACM Transactions on Mathematical Software ◽

10.1145/3431921 ◽

2021 ◽

Vol 47 (3) ◽

pp. 1-23

Author(s):

Ahmad Abdelfattah ◽

Timothy Costa ◽

Jack Dongarra ◽

Mark Gates ◽

Azzam Haidar ◽

...

Keyword(s):

Machine Learning ◽

Linear Algebra ◽

High Performance ◽

Large Scale ◽

Floating Point ◽

Equal Size ◽

Hardware Accelerators ◽

Double Precision ◽

Basic Linear Algebra Subprograms ◽

Many Core

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

Download Full-text

LLAMA: a robust and scalable machine learning pipeline for analysis of large scale 4D microscopy data: analysis of cell ruffles and filopodia

BMC Bioinformatics ◽

10.1186/s12859-021-04324-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

James G. Lefevre ◽

Yvette W. H. Koh ◽

Adam A. Wall ◽

Nicholas D. Condon ◽

Jennifer L. Stow ◽

...

Keyword(s):

Machine Learning ◽

Statistical Analysis ◽

High Speed ◽

High Performance ◽

Large Scale ◽

Immune Surveillance ◽

Complex Surface ◽

Semantic Segmentation ◽

Systematic Analysis ◽

4D Microscopy

Abstract Background With recent advances in microscopy, recordings of cell behaviour can result in terabyte-size datasets. The lattice light sheet microscope (LLSM) images cells at high speed and high 3D resolution, accumulating data at 100 frames/second over hours, presenting a major challenge for interrogating these datasets. The surfaces of vertebrate cells can rapidly deform to create projections that interact with the microenvironment. Such surface projections include spike-like filopodia and wave-like ruffles on the surface of macrophages as they engage in immune surveillance. LLSM imaging has provided new insights into the complex surface behaviours of immune cells, including revealing new types of ruffles. However, full use of these data requires systematic and quantitative analysis of thousands of projections over hundreds of time steps, and an effective system for analysis of individual structures at this scale requires efficient and robust methods with minimal user intervention. Results We present LLAMA, a platform to enable systematic analysis of terabyte-scale 4D microscopy datasets. We use a machine learning method for semantic segmentation, followed by a robust and configurable object separation and tracking algorithm, generating detailed object level statistics. Our system is designed to run on high-performance computing to achieve high throughput, with outputs suitable for visualisation and statistical analysis. Advanced visualisation is a key element of LLAMA: we provide a specialised tool which supports interactive quality control, optimisation, and output visualisation processes to complement the processing pipeline. LLAMA is demonstrated in an analysis of macrophage surface projections, in which it is used to i) discriminate ruffles induced by lipopolysaccharide (LPS) and macrophage colony stimulating factor (CSF-1) and ii) determine the autonomy of ruffle morphologies. Conclusions LLAMA provides an effective open source tool for running a cell microscopy analysis pipeline based on semantic segmentation, object analysis and tracking. Detailed numerical and visual outputs enable effective statistical analysis, identifying distinct patterns of increased activity under the two interventions considered in our example analysis. Our system provides the capacity to screen large datasets for specific structural configurations. LLAMA identified distinct features of LPS and CSF-1 induced ruffles and it identified a continuity of behaviour between tent pole ruffling, wave-like ruffling and filopodia deployment.

Download Full-text