Designing production-friendly machine learning

Building production ML applications is difficult because of their resource cost and complex failure modes. I will discuss these challenges from two perspectives: the Stanford DAWN Lab and experience with large-scale commercial ML users at Databricks. I will then present two emerging ideas to help address these challenges. The first is "ML platforms", an emerging class of software systems that standardize the interfaces used in ML applications to make them easier to build and maintain. I will give a few examples, including the open-source MLflow system from Databricks [3]. The second idea is models that are more "production-friendly" by design. As a concrete example, I will discuss retrieval-based NLP models such as Stanford's ColBERT [1, 2] that query documents from an updateable corpus to perform tasks such as question-answering, which gives multiple practical advantages, including low computational cost, high interpretability, and very fast updates to the model's "knowledge". These models are an exciting alternative to large language models such as GPT-3.

Download Full-text

Open-source Δ-quantum machine learning for medicinal chemistry

10.33774/chemrxiv-2021-fz6v7 ◽

2021 ◽

Author(s):

Kenneth Atz ◽

Clemens Isert ◽

Markus N. A. Böcker ◽

José Jiménez-Luna ◽

Gisbert Schneider

Keyword(s):

Machine Learning ◽

Open Source ◽

Density Functional ◽

Large Scale ◽

Molecular Design ◽

State Of The Art ◽

Computational Cost ◽

Quantum Mechanical ◽

Quantum Observables ◽

Graph Neural Networks

Certain molecular design tasks benefit from fast and accurate calculations of quantum-mechanical (QM) properties. However, the computational cost of QM methods applied to drug-like compounds currently makes large-scale applications of quantum chemistry challenging. In order to mitigate this problem, we developed DelFTa, an open-source toolbox for predicting small-molecule electronic properties at the density functional (DFT) level of theory, using the Δ-machine learning principle. DelFTa employs state-of-the-art E(3)-equivariant graph neural networks that were trained on the QMugs dataset of QM properties. It provides access to a wide array of quantum observables by predicting approximations to ωB97X-D/def2-SVP values from a GFN2-xTB semiempirical baseline. Δ-learning with DelFTa was shown to outperform direct DFT learning for most of the considered QM endpoints. The software is provided as open-source code with fully-documented command-line and Python APIs.

Download Full-text

Machine learning-based feature importance approach for sensitivity analysis of steel frames

10.31224/osf.io/mvkf3 ◽

2021 ◽

Author(s):

Hyeyoung Koh ◽

Hannah Beth Blum

Keyword(s):

Machine Learning ◽

Sensitivity Analysis ◽

Feature Selection ◽

Large Scale ◽

Failure Modes ◽

Model Development ◽

Predictive Performance ◽

Computational Effort ◽

Structural Systems ◽

Feature Importance

This study presents a machine learning-based approach for sensitivity analysis to examine how parameters affect a given structural response while accounting for uncertainty. Reliability-based sensitivity analysis involves repeated evaluations of the performance function incorporating uncertainties to estimate the influence of a model parameter, which can lead to prohibitive computational costs. This challenge is exacerbated for large-scale engineering problems which often carry a large quantity of uncertain parameters. The proposed approach is based on feature selection algorithms that rank feature importance and remove redundant predictors during model development which improve model generality and training performance by focusing only on the significant features. The approach allows performing sensitivity analysis of structural systems by providing feature rankings with reduced computational effort. The proposed approach is demonstrated with two designs of a two-bay, two-story planar steel frame with different failure modes: inelastic instability of a single member and progressive yielding. The feature variables in the data are uncertainties including material yield strength, Young’s modulus, frame sway imperfection, and residual stress. The Monte Carlo sampling method is utilized to generate random realizations of the frames from published distributions of the feature parameters, and the response variable is the frame ultimate strength obtained from finite element analyses. Decision trees are trained to identify important features. Feature rankings are derived by four feature selection techniques including impurity-based, permutation, SHAP, and Spearman's correlation. Predictive performance of the model including the important features are discussed using the evaluation metric for imbalanced datasets, Matthews correlation coefficient. Finally, the results are compared with those from reliability-based sensitivity analysis on the same example frames to show the validity of the feature selection approach. As the proposed machine learning-based approach produces the same results as the reliability-based sensitivity analysis with improved computational efficiency and accuracy, it could be extended to other structural systems.

Download Full-text

Logging Analysis and Prediction in Open Source Java Project

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch038 ◽

2021 ◽

pp. 733-761

Author(s):

Sangeeta Lal ◽

Neetu Sardana ◽

Ashish Sureka

Keyword(s):

Machine Learning ◽

Content Analysis ◽

Software Development ◽

Anomaly Detection ◽

Open Source ◽

Large Scale ◽

Source Code ◽

Scale Analysis ◽

Large Scale Analysis ◽

Research Questions

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.

Download Full-text

Machine learning issues and opportunities in ultrafast particle classification for label-free microflow cytometry

Scientific Reports ◽

10.1038/s41598-020-77765-w ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Alessio Lugnan ◽

Emmanuel Gooskens ◽

Jeremy Vatin ◽

Joni Dambre ◽

Peter Bienstman

Keyword(s):

Machine Learning ◽

Computational Cost ◽

Particle Analysis ◽

Label Free ◽

Machine Learning Approach ◽

Microflow Cytometer ◽

Learning Machine ◽

Learning Issues ◽

Low Computational Cost

AbstractMachine learning offers promising solutions for high-throughput single-particle analysis in label-free imaging microflow cytomtery. However, the throughput of online operations such as cell sorting is often limited by the large computational cost of the image analysis while offline operations may require the storage of an exceedingly large amount of data. Moreover, the training of machine learning systems can be easily biased by slight drifts of the measurement conditions, giving rise to a significant but difficult to detect degradation of the learned operations. We propose a simple and versatile machine learning approach to perform microparticle classification at an extremely low computational cost, showing good generalization over large variations in particle position. We present proof-of-principle classification of interference patterns projected by flowing transparent PMMA microbeads with diameters of $${15.2}\,\upmu \text {m}$$ 15.2 μ m and $${18.6}\,\upmu \text {m}$$ 18.6 μ m . To this end, a simple, cheap and compact label-free microflow cytometer is employed. We also discuss in detail the detection and prevention of machine learning bias in training and testing due to slight drifts of the measurement conditions. Moreover, we investigate the implications of modifying the projected particle pattern by means of a diffraction grating, in the context of optical extreme learning machine implementations.

Download Full-text

Prediction of vascular aging based on smartphone acquired PPG signals

Scientific Reports ◽

10.1038/s41598-020-76816-6 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Lorenzo Dall’Olio ◽

Nico Curti ◽

Daniel Remondini ◽

Yosef Safi Harb ◽

Folkert W. Asselbergs ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Screening Tool ◽

Computational Cost ◽

Penalized Regression ◽

Prediction Performance ◽

Second Derivative ◽

Vascular Aging ◽

Non Invasive ◽

Potential Biomarkers

AbstractPhotoplethysmography (PPG) measured by smartphone has the potential for a large scale, non-invasive, and easy-to-use screening tool. Vascular aging is linked to increased arterial stiffness, which can be measured by PPG. We investigate the feasibility of using PPG to predict healthy vascular aging (HVA) based on two approaches: machine learning (ML) and deep learning (DL). We performed data preprocessing, including detrending, demodulating, and denoising on the raw PPG signals. For ML, ridge penalized regression has been applied to 38 features extracted from PPG, whereas for DL several convolutional neural networks (CNNs) have been applied to the whole PPG signals as input. The analysis has been conducted using the crowd-sourced Heart for Heart data. The prediction performance of ML using two features (AUC of 94.7%) – the a wave of the second derivative PPG and tpr, including four covariates, sex, height, weight, and smoking – was similar to that of the best performing CNN, 12-layer ResNet (AUC of 95.3%). Without having the heavy computational cost of DL, ML might be advantageous in finding potential biomarkers for HVA prediction. The whole workflow of the procedure is clearly described, and open software has been made available to facilitate replication of the results.

Download Full-text

Extracting and studying the Logging-Code-Issue- Introducing changes in Java-based large-scale open source software systems

Empirical Software Engineering ◽

10.1007/s10664-019-09690-0 ◽

2019 ◽

Vol 24 (4) ◽

pp. 2285-2322

Author(s):

Boyuan Chen ◽

Zhen Ming Jiang

Keyword(s):

Open Source ◽

Open Source Software ◽

Large Scale ◽

Software Systems

Download Full-text

A large-scale study of architectural evolution in open-source software systems

Empirical Software Engineering ◽

10.1007/s10664-016-9466-0 ◽

2016 ◽

Vol 22 (3) ◽

pp. 1146-1193 ◽

Cited By ~ 13

Author(s):

Pooyan Behnamghader ◽

Duc Minh Le ◽

Joshua Garcia ◽

Daniel Link ◽

Arman Shahbazian ◽

...

Keyword(s):

Open Source ◽

Open Source Software ◽

Large Scale ◽

Software Systems ◽

Large Scale Study

Download Full-text

BrainIAK tutorials: User-friendly learning materials for advanced fMRI analysis

10.31219/osf.io/j4sbc ◽

2019 ◽

Cited By ~ 2

Author(s):

Manoj Kumar ◽

Cameron Thomas Ellis ◽

Qihong Lu ◽

Hejia Zhang ◽

Mihai Capota ◽

...

Keyword(s):

Machine Learning ◽

Functional Connectivity ◽

Open Source ◽

Programming Languages ◽

High Performance ◽

Large Scale ◽

Markov Models ◽

Matrix Analysis ◽

Fmri Analysis ◽

User Friendly

Advanced brain imaging analysis methods, including multivariate pattern analysis (MVPA), functional connectivity, and functional alignment, have become powerful tools in cognitive neuroscience over the past decade. These tools are implemented in custom code and separate packages, often requiring different software and language proficiencies. Although usable by expert researchers, novice users face a steep learning curve. These difficulties stem from the use of new programming languages (e.g., Python), learning how to apply machine-learning methods to high-dimensional fMRI data, and minimal documentation and training materials. Furthermore, most standard fMRI analysis packages (e.g., AFNI, FSL, SPM) focus on preprocessing and univariate analyses, leaving a gap in how to integrate with advanced tools. To address these needs, we developed BrainIAK (brainiak.org), an open-source Python software package that seamlessly integrates several cutting-edge, computationally efficient techniques with other Python packages (e.g., Nilearn, Scikit-learn) for file handling, visualization, and machine learning. To disseminate these powerful tools, we developed user-friendly tutorials (in Jupyter format; https://brainiak.org/tutorials/) for learning BrainIAK and advanced fMRI analysis in Python more generally. These materials cover techniques including: MVPA (pattern classification and representational similarity analysis); parallelized searchlight analysis; background connectivity; full correlation matrix analysis; inter-subject correlation; inter-subject functional connectivity; shared response modeling; event segmentation using hidden Markov models; and real-time fMRI. For long-running jobs or large memory needs we provide detailed guidance on high-performance computing clusters. These notebooks were successfully tested at multiple sites, including as problem sets for courses at Yale and Princeton universities and at various workshops and hackathons. These materials are freely shared, with the hope that they become part of a pool of open-source software and educational materials for large-scale, reproducible fMRI analysis and accelerated discovery.

Download Full-text

Detecting The Speaker Language Using CNN Deep Learning Algorithm

Iraqi Journal for Computer Science and Mathematics ◽

10.52866/ijcsm.2022.01.01.005 ◽

2022 ◽

pp. 43-52

Author(s):

Fawziya M. Rammo ◽

Mohammed N. Al-Hamdani

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Learning ◽

Open Source ◽

Convolutional Neural Networks ◽

Learning Algorithm ◽

Language Models ◽

Mel Frequency Cepstral Coefficients ◽

Deep Learning Algorithm ◽

Time Frames

Many languages identification (LID) systems rely on language models that use machine learning (ML) approaches, LID systems utilize rather long recording periods to achieve satisfactory accuracy. This study aims to extract enough information from short recording intervals in order to successfully classify the spoken languages under test. The classification process is based on frames of (2-18) seconds where most of the previous LID systems were based on much longer time frames (from 3 seconds to 2 minutes). This research defined and implemented many low-level features using MFCC (Mel-frequency cepstral coefficients), containing speech files in five languages (English. French, German, Italian, Spanish), from voxforge.org an open-source corpus that consists of user-submitted audio clips in various languages, is the source of data used in this paper. A CNN (convolutional Neural Networks) algorithm applied in this paper for classification and the result was perfect, binary language classiﬁcation had an accuracy of 100%, and five languages classiﬁcation with six languages had an accuracy of 99.8%.

Download Full-text

Comparison of different machine learning models for the prediction of forces in copper and silicon dioxide

Physical Chemistry Chemical Physics ◽

10.1039/c8cp04508a ◽

2018 ◽

Vol 20 (47) ◽

pp. 30006-30020 ◽

Cited By ~ 8

Author(s):

Wenwen Li ◽

Yasunobu Ando

Keyword(s):

Machine Learning ◽

Silicon Dioxide ◽

Force Field ◽

Computational Cost ◽

High Accuracy ◽

Simulation Approach ◽

Learning Models ◽

Atomic Simulation ◽

Low Computational Cost ◽

Machine Learning Models

Recently, the machine learning (ML) force field has emerged as a powerful atomic simulation approach because of its high accuracy and low computational cost.

Download Full-text