Designing production-friendly machine learning

2021 ◽  
Vol 14 (13) ◽  
pp. 3420-3420
Author(s):  
Matei Zaharia

Building production ML applications is difficult because of their resource cost and complex failure modes. I will discuss these challenges from two perspectives: the Stanford DAWN Lab and experience with large-scale commercial ML users at Databricks. I will then present two emerging ideas to help address these challenges. The first is "ML platforms", an emerging class of software systems that standardize the interfaces used in ML applications to make them easier to build and maintain. I will give a few examples, including the open-source MLflow system from Databricks [3]. The second idea is models that are more "production-friendly" by design. As a concrete example, I will discuss retrieval-based NLP models such as Stanford's ColBERT [1, 2] that query documents from an updateable corpus to perform tasks such as question-answering, which gives multiple practical advantages, including low computational cost, high interpretability, and very fast updates to the model's "knowledge". These models are an exciting alternative to large language models such as GPT-3.

2021 ◽  
Author(s):  
Kenneth Atz ◽  
Clemens Isert ◽  
Markus N. A. Böcker ◽  
José Jiménez-Luna ◽  
Gisbert Schneider

Certain molecular design tasks benefit from fast and accurate calculations of quantum-mechanical (QM) properties. However, the computational cost of QM methods applied to drug-like compounds currently makes large-scale applications of quantum chemistry challenging. In order to mitigate this problem, we developed DelFTa, an open-source toolbox for predicting small-molecule electronic properties at the density functional (DFT) level of theory, using the Δ-machine learning principle. DelFTa employs state-of-the-art E(3)-equivariant graph neural networks that were trained on the QMugs dataset of QM properties. It provides access to a wide array of quantum observables by predicting approximations to ωB97X-D/def2-SVP values from a GFN2-xTB semiempirical baseline. Δ-learning with DelFTa was shown to outperform direct DFT learning for most of the considered QM endpoints. The software is provided as open-source code with fully-documented command-line and Python APIs.


2021 ◽  
Author(s):  
Hyeyoung Koh ◽  
Hannah Beth Blum

This study presents a machine learning-based approach for sensitivity analysis to examine how parameters affect a given structural response while accounting for uncertainty. Reliability-based sensitivity analysis involves repeated evaluations of the performance function incorporating uncertainties to estimate the influence of a model parameter, which can lead to prohibitive computational costs. This challenge is exacerbated for large-scale engineering problems which often carry a large quantity of uncertain parameters. The proposed approach is based on feature selection algorithms that rank feature importance and remove redundant predictors during model development which improve model generality and training performance by focusing only on the significant features. The approach allows performing sensitivity analysis of structural systems by providing feature rankings with reduced computational effort. The proposed approach is demonstrated with two designs of a two-bay, two-story planar steel frame with different failure modes: inelastic instability of a single member and progressive yielding. The feature variables in the data are uncertainties including material yield strength, Young’s modulus, frame sway imperfection, and residual stress. The Monte Carlo sampling method is utilized to generate random realizations of the frames from published distributions of the feature parameters, and the response variable is the frame ultimate strength obtained from finite element analyses. Decision trees are trained to identify important features. Feature rankings are derived by four feature selection techniques including impurity-based, permutation, SHAP, and Spearman's correlation. Predictive performance of the model including the important features are discussed using the evaluation metric for imbalanced datasets, Matthews correlation coefficient. Finally, the results are compared with those from reliability-based sensitivity analysis on the same example frames to show the validity of the feature selection approach. As the proposed machine learning-based approach produces the same results as the reliability-based sensitivity analysis with improved computational efficiency and accuracy, it could be extended to other structural systems.


Author(s):  
Sangeeta Lal ◽  
Neetu Sardana ◽  
Ashish Sureka

Log statements present in source code provide important information to the software developers because they are useful in various software development activities such as debugging, anomaly detection, and remote issue resolution. Most of the previous studies on logging analysis and prediction provide insights and results after analyzing only a few code constructs. In this chapter, the authors perform an in-depth, focused, and large-scale analysis of logging code constructs at two levels: the file level and catch-blocks level. They answer several research questions related to statistical and content analysis. Statistical and content analysis reveals the presence of differentiating properties among logged and nonlogged code constructs. Based on these findings, the authors propose a machine-learning-based model for catch-blocks logging prediction. The machine-learning-based model is found to be effective in catch-blocks logging prediction.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Alessio Lugnan ◽  
Emmanuel Gooskens ◽  
Jeremy Vatin ◽  
Joni Dambre ◽  
Peter Bienstman

AbstractMachine learning offers promising solutions for high-throughput single-particle analysis in label-free imaging microflow cytomtery. However, the throughput of online operations such as cell sorting is often limited by the large computational cost of the image analysis while offline operations may require the storage of an exceedingly large amount of data. Moreover, the training of machine learning systems can be easily biased by slight drifts of the measurement conditions, giving rise to a significant but difficult to detect degradation of the learned operations. We propose a simple and versatile machine learning approach to perform microparticle classification at an extremely low computational cost, showing good generalization over large variations in particle position. We present proof-of-principle classification of interference patterns projected by flowing transparent PMMA microbeads with diameters of $${15.2}\,\upmu \text {m}$$ 15.2 μ m and $${18.6}\,\upmu \text {m}$$ 18.6 μ m . To this end, a simple, cheap and compact label-free microflow cytometer is employed. We also discuss in detail the detection and prevention of machine learning bias in training and testing due to slight drifts of the measurement conditions. Moreover, we investigate the implications of modifying the projected particle pattern by means of a diffraction grating, in the context of optical extreme learning machine implementations.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Lorenzo Dall’Olio ◽  
Nico Curti ◽  
Daniel Remondini ◽  
Yosef Safi Harb ◽  
Folkert W. Asselbergs ◽  
...  

AbstractPhotoplethysmography (PPG) measured by smartphone has the potential for a large scale, non-invasive, and easy-to-use screening tool. Vascular aging is linked to increased arterial stiffness, which can be measured by PPG. We investigate the feasibility of using PPG to predict healthy vascular aging (HVA) based on two approaches: machine learning (ML) and deep learning (DL). We performed data preprocessing, including detrending, demodulating, and denoising on the raw PPG signals. For ML, ridge penalized regression has been applied to 38 features extracted from PPG, whereas for DL several convolutional neural networks (CNNs) have been applied to the whole PPG signals as input. The analysis has been conducted using the crowd-sourced Heart for Heart data. The prediction performance of ML using two features (AUC of 94.7%) – the a wave of the second derivative PPG and tpr, including four covariates, sex, height, weight, and smoking – was similar to that of the best performing CNN, 12-layer ResNet (AUC of 95.3%). Without having the heavy computational cost of DL, ML might be advantageous in finding potential biomarkers for HVA prediction. The whole workflow of the procedure is clearly described, and open software has been made available to facilitate replication of the results.


2016 ◽  
Vol 22 (3) ◽  
pp. 1146-1193 ◽  
Author(s):  
Pooyan Behnamghader ◽  
Duc Minh Le ◽  
Joshua Garcia ◽  
Daniel Link ◽  
Arman Shahbazian ◽  
...  

2019 ◽  
Author(s):  
Manoj Kumar ◽  
Cameron Thomas Ellis ◽  
Qihong Lu ◽  
Hejia Zhang ◽  
Mihai Capota ◽  
...  

Advanced brain imaging analysis methods, including multivariate pattern analysis (MVPA), functional connectivity, and functional alignment, have become powerful tools in cognitive neuroscience over the past decade. These tools are implemented in custom code and separate packages, often requiring different software and language proficiencies. Although usable by expert researchers, novice users face a steep learning curve. These difficulties stem from the use of new programming languages (e.g., Python), learning how to apply machine-learning methods to high-dimensional fMRI data, and minimal documentation and training materials. Furthermore, most standard fMRI analysis packages (e.g., AFNI, FSL, SPM) focus on preprocessing and univariate analyses, leaving a gap in how to integrate with advanced tools. To address these needs, we developed BrainIAK (brainiak.org), an open-source Python software package that seamlessly integrates several cutting-edge, computationally efficient techniques with other Python packages (e.g., Nilearn, Scikit-learn) for file handling, visualization, and machine learning. To disseminate these powerful tools, we developed user-friendly tutorials (in Jupyter format; https://brainiak.org/tutorials/) for learning BrainIAK and advanced fMRI analysis in Python more generally. These materials cover techniques including: MVPA (pattern classification and representational similarity analysis); parallelized searchlight analysis; background connectivity; full correlation matrix analysis; inter-subject correlation; inter-subject functional connectivity; shared response modeling; event segmentation using hidden Markov models; and real-time fMRI. For long-running jobs or large memory needs we provide detailed guidance on high-performance computing clusters. These notebooks were successfully tested at multiple sites, including as problem sets for courses at Yale and Princeton universities and at various workshops and hackathons. These materials are freely shared, with the hope that they become part of a pool of open-source software and educational materials for large-scale, reproducible fMRI analysis and accelerated discovery.


Author(s):  
Fawziya M. Rammo ◽  
Mohammed N. Al-Hamdani

Many languages identification (LID) systems rely on language models that use machine learning (ML) approaches, LID systems utilize rather long recording periods to achieve satisfactory accuracy. This study aims to extract enough information from short recording intervals in order to successfully classify the spoken languages under test. The classification process is based on frames of (2-18) seconds where most of the previous LID systems were based on much longer time frames (from 3 seconds to 2 minutes). This research defined and implemented many low-level features using MFCC (Mel-frequency cepstral coefficients), containing speech files in five languages (English. French, German, Italian, Spanish), from voxforge.org an open-source corpus that consists of user-submitted audio clips in various languages, is the source of data used in this paper. A CNN (convolutional Neural Networks) algorithm applied in this paper for classification and the result was perfect, binary language classification had an accuracy of 100%, and five languages classification with six languages had an accuracy of 99.8%.


2018 ◽  
Vol 20 (47) ◽  
pp. 30006-30020 ◽  
Author(s):  
Wenwen Li ◽  
Yasunobu Ando

Recently, the machine learning (ML) force field has emerged as a powerful atomic simulation approach because of its high accuracy and low computational cost.


Sign in / Sign up

Export Citation Format

Share Document