Towards an optimized GROUP by abstraction for large-scale machine learning

Many applications that use large-scale machine learning (ML) increasingly prefer different models for subgroups (e.g., countries) to improve accuracy, fairness, or other desiderata. We call this emerging popular practice learning over groups , analogizing to GROUP BY in SQL, albeit for ML training instead of SQL aggregates. From the systems standpoint, this practice compounds the already data-intensive workload of ML model selection (e.g., hyperparameter tuning). Often, thousands of models may need to be trained, necessitating high-throughput parallel execution. Alas, most ML systems today focus on training one model at a time or at best, parallelizing hyperparameter tuning. This status quo leads to resource wastage, low throughput, and high runtimes. In this work, we take the first step towards enabling and optimizing learning over groups from the data systems standpoint for three popular classes of ML: linear models, neural networks, and gradient-boosted decision trees. Analytically and empirically, we compare standard approaches to execute this workload today: task-parallelism and data-parallelism. We find neither is universally dominant. We put forth a novel hybrid approach we call grouped learning that avoids redundancy in communications and I/O using a novel form of parallel gradient descent we call Gradient Accumulation Parallelism (GAP). We prototype our ideas into a system we call Kingpin built on top of existing ML tools and the flexible massively-parallel runtime Ray. An extensive empirical evaluation on large ML benchmark datasets shows that Kingpin matches or is 4x to 14x faster than state-of-the-art ML systems, including Ray's native execution and PyTorch DDP.

Download Full-text

Detecting Pressure Anomalies While Drilling Using a Machine Learning Hybrid Approach

10.2118/204035-ms ◽

2021 ◽

Author(s):

Aurore Lafond ◽

Maurice Ringer ◽

Florian Le Blay ◽

Jiaxu Liu ◽

Ekaterina Millan ◽

...

Keyword(s):

Machine Learning ◽

Data Quality ◽

Real Time ◽

Large Scale ◽

Hybrid Approach ◽

Physical Models ◽

Training Data ◽

Digital Data ◽

Machine Learning Techniques ◽

New System

Abstract Abnormal surface pressure is typically the first indicator of a number of problematic events, including kicks, losses, washouts and stuck pipe. These events account for 60–70% of all drilling-related nonproductive time, so their early and accurate detection has the potential to save the industry billions of dollars. Detecting these events today requires an expert user watching multiple curves, which can be costly, and subject to human errors. The solution presented in this paper is aiming at augmenting traditional models with new machine learning techniques, which enable to detect these events automatically and help the monitoring of the drilling well. Today’s real-time monitoring systems employ complex physical models to estimate surface standpipe pressure while drilling. These require many inputs and are difficult to calibrate. Machine learning is an alternative method to predict pump pressure, but this alone needs significant labelled training data, which is often lacking in the drilling world. The new system combines these approaches: a machine learning framework is used to enable automated learning while the physical models work to compensate any gaps in the training data. The system uses only standard surface measurements, is fully automated, and is continuously retrained while drilling to ensure the most accurate pressure prediction. In addition, a stochastic (Bayesian) machine learning technique is used, which enables not only a prediction of the pressure, but also the uncertainty and confidence of this prediction. Last, the new system includes a data quality control workflow. It discards periods of low data quality for the pressure anomaly detection and enables to have a smarter real-time events analysis. The new system has been tested on historical wells using a new test and validation framework. The framework runs the system automatically on large volumes of both historical and simulated data, to enable cross-referencing the results with observations. In this paper, we show the results of the automated test framework as well as the capabilities of the new system in two specific case studies, one on land and another offshore. Moreover, large scale statistics enlighten the reliability and the efficiency of this new detection workflow. The new system builds on the trend in our industry to better capture and utilize digital data for optimizing drilling.

Download Full-text

Recent advances on bedform research and application: Process-based to machine learning

10.5194/egusphere-egu21-9017 ◽

2021 ◽

Author(s):

Sanjay Giri ◽

Amin Shakya ◽

Mohamed Nabi ◽

Suleyman Naqshband ◽

Toshiki Iwasaki ◽

...

Keyword(s):

Machine Learning ◽

Sediment Transport ◽

Flow Resistance ◽

Large Scale ◽

Data Science ◽

Numerical Models ◽

Hybrid Approach ◽

Water Model ◽

Micro Scale ◽

Cfd Models

<p>Evolution and transition of bedforms in lowland rivers are micro-scale morphological processes that influence river management decisions. This work builds upon our past efforts that include physics-based modelling, physical experiments and the machine learning (ML) approach to predict bedform features, states as well as associated flow resistance. We revisit our past works and efforts on developing and applying numerical models, from simple to sophisticated, starting with a multi-scale shallow-water model with a dual-grid technique. The model incorporates an adjustment of the local bed shear stress by a slope effect and an additional term that influences bedform feature. Furthermore, we review our work on a vertical two-dimensional model with a free surface flow condition. We explore the effects of different sediment transport approaches such as equilibrium transport with bed slope correction and a non-equilibrium transport with pick-up and deposition. We revisit a sophisticated three-dimensional Large Eddy Simulation (LES) model with an improved sediment transport approach that includes sliding, rolling, and jumping based on a Lagrangian framework. Finally, we discuss about bedform states and transition that are studied using laboratory experiments as well as a theory-guided data science approach that assures logical reasoning to analyze physical phenomena with large amounts of data. A theoretical evaluation of parameters that influence bedform development is carried out, followed by classification of bedform type by using a neural network model.</p><p>In second part, we focus on practical application, and discuss about large-scale numerical models that are being applied in river engineering and management practices. Such models are found to have noticeable inaccuracies and uncertainties associated with various physical and non-physical reasons. A key physical problem of these large-scale numerical models is related to the prediction of evolution and transition of micro-scale bedforms, and associated flow resistance. The evolution and transition of bedforms during rising and falling stages of a flood wave have a noticeable impact on morphology and flow levels in low-land alluvial rivers. The interaction between flow and micro-scale bedforms cannot be considered in a physics-based manner in large-scale numerical models due to the incompatibility between the resolution of the models and the scale of morphological changes. The dynamics of bedforms and the corresponding changes in flow resistance are not captured. As a way forward, we propse a hydrid approach that includes application of the CFD models, mentioned above, to generate a large amount of data in complement with field and laboratory observations, analysis of their reliability based on which developing a ML model. The CFD models can replicate bedform evolution and transition processes as well as associated flow resistance in physics-based manner under steady and varying flow conditions. The hybrid approach of using CFD and ML models can offer a better prediction of flow resistance that can be coupled with large-scale numerical models to improve their performance. The reseach is in progress.</p>

Download Full-text

Feature selection using autoencoders with Bayesian methods to high-dimensional data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211348 ◽

2021 ◽

pp. 1-10

Author(s):

Lei Shu ◽

Kun Huang ◽

Wenhao Jiang ◽

Wenming Wu ◽

Hongling Liu

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Bayesian Methods ◽

Large Scale ◽

High Dimensional Data ◽

Hybrid Approach ◽

High Dimensional ◽

Real World Data ◽

Learning Tasks ◽

Low Dimensional

It is easy to lead to poor generalization in machine learning tasks using real-world data directly, since such data is usually high-dimensional dimensionality and limited. Through learning the low dimensional representations of high-dimensional data, feature selection can retain useful features for machine learning tasks. Using these useful features effectively trains machine learning models. Hence, it is a challenge for feature selection from high-dimensional data. To address this issue, in this paper, a hybrid approach consisted of an autoencoder and Bayesian methods is proposed for a novel feature selection. Firstly, Bayesian methods are embedded in the proposed autoencoder as a special hidden layer. This of doing is to increase the precision during selecting non-redundant features. Then, the other hidden layers of the autoencoder are used for non-redundant feature selection. Finally, compared with the mainstream approaches for feature selection, the proposed method outperforms them. We find that the way consisted of autoencoders and probabilistic correction methods is more meaningful than that of stacking architectures or adding constraints to autoencoders as regards feature selection. We also demonstrate that stacked autoencoders are more suitable for large-scale feature selection, however, sparse autoencoders are beneficial for a smaller number of feature selection. We indicate that the value of the proposed method provides a theoretical reference to analyze the optimality of feature selection.

Download Full-text

An adaptive two-stage analog/regression model for probabilistic prediction of local precipitation in France

10.5194/hess-2017-62 ◽

2017 ◽

Cited By ~ 1

Author(s):

Jérémy Chardon ◽

Benoit Hingray ◽

Anne-Catherine Favre

Keyword(s):

Hybrid Model ◽

Large Scale ◽

Linear Models ◽

Transfer Functions ◽

Hybrid Approach ◽

Probabilistic Prediction ◽

Precipitation Occurrence ◽

Seasonal Stratification ◽

Probability Of Precipitation ◽

Local Weather

Abstract. Statistical Downscaling Methods (SDMs) are often used to produce local weather scenarios from large scale atmospheric information. SDMs include transfer functions which are based on a statistical link identified from observations between local weather and a set of large scale predictors. As physical processes generating surface weather vary in time, the most relevant predictors and the regression link are likely to also vary in time. This is well known for precipitation for instance and the link is thus often estimated after some seasonal stratification of the data. In this study, we present a hybrid model where the regression link is estimated from atmospheric analogs of the current prediction day. Atmospheric analogs are first identified from geopotential fields at 1000 and 500 hPa. For the regression stage, two Generalized Linear Models are further used to model the probability of precipitation occurrence and the distribution of non-zero precipitation amounts respectively. The hybrid model is evaluated for the probabilistic prediction of local precipitation over France. It noticeably improves the skill of the prediction for both precipitation occurrence and quantity. As the analog days vary from one prediction day to another, the atmospheric predictors selected in the regression stage and the value of the corresponding regression coefficients vary from one prediction day to another. The hybrid approach allows thus for a day-to-day adaptive and tailored downscaling. It can also reveal specific predictors for peculiar and non-frequent weather configurations.

Download Full-text

PremPLI: a machine learning model for predicting the effects of missense mutations on protein-ligand interactions

Communications Biology ◽

10.1038/s42003-021-02826-3 ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Tingting Sun ◽

Yuting Chen ◽

Yuhao Wen ◽

Zefeng Zhu ◽

Minghui Li

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Binding Affinity ◽

Large Scale ◽

Predictive Accuracy ◽

Predictive Performance ◽

Driver Mutations ◽

Missense Mutations ◽

Resistance Mutations ◽

Benchmark Datasets

AbstractResistance to small-molecule drugs is the main cause of the failure of therapeutic drugs in clinical practice. Missense mutations altering the binding of ligands to proteins are one of the critical mechanisms that result in genetic disease and drug resistance. Computational methods have made a lot of progress for predicting binding affinity changes and identifying resistance mutations, but their prediction accuracy and speed are still not satisfied and need to be further improved. To address these issues, we introduce a structure-based machine learning method for quantitatively estimating the effects of single mutations on ligand binding affinity changes (named as PremPLI). A comprehensive comparison of the predictive performance of PremPLI with other available methods on two benchmark datasets confirms that our approach performs robustly and presents similar or even higher predictive accuracy than the approaches relying on first-principle statistical mechanics and mixed physics- and knowledge-based potentials while requires much less computational resources. PremPLI can be used for guiding the design of ligand-binding proteins, identifying and understanding disease driver mutations, and finding potential resistance mutations for different drugs. PremPLI is freely available at https://lilab.jysw.suda.edu.cn/research/PremPLI/ and allows to do large-scale mutational scanning.

Download Full-text

Machine Learning for Large-Scale Quality Control of 3D Shape Models in Neuroimaging

10.1101/166496 ◽

2017 ◽

Cited By ~ 1

Author(s):

Dmitry Petrov ◽

Boris A. Gutman ◽

Shih-Hua (Julie) Yu ◽

Theo G.M. van Erp ◽

Jessica A. Turner ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Support Vector ◽

Shape Features ◽

Shape Models ◽

Boosted Decision Trees ◽

Human Quality ◽

Derived Data ◽

Deep Brain ◽

Comparable Size

AbstractAs very large studies of complex neuroimaging phenotypes become more common, human quality assessment of MRI-derived data remains one of the last major bottlenecks. Few attempts have so far been made to address this issue with machine learning. In this work, we optimize predictive models of quality for meshes representing deep brain structure shapes. We use standard vertex-wise and global shape features computed homologously across 19 cohorts and over 7500 human-rated subjects, training kernelized Support Vector Machine and Gradient Boosted Decision Trees classifiers to detect meshes of failing quality. Our models generalize across datasets and diseases, reducing human workload by 30-70%, or equivalently hundreds of human rater hours for datasets of comparable size, with recall rates approaching inter-rater reliability.

Download Full-text

Sentimental Analysis based on hybrid approach of Latent Dirichlet Allocation and Machine Learning for Large-Scale of Imbalanced Twitter Data

2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence ◽

10.1145/3446132.3446413 ◽

2020 ◽

Author(s):

Nasir Jamal ◽

Chen Xianqiao ◽

Junaid Hussain Abro ◽

Doniyor Tukhtakhunov

Keyword(s):

Machine Learning ◽

Large Scale ◽

Latent Dirichlet Allocation ◽

Hybrid Approach ◽

Twitter Data ◽

Dirichlet Allocation

Download Full-text

Distributed deep learning on data systems

Proceedings of the VLDB Endowment ◽

10.14778/3467861.3467867 ◽

2021 ◽

Vol 14 (10) ◽

pp. 1769-1782

Author(s):

Yuhao Zhang ◽

Frank McQuillan ◽

Nandish Jayaram ◽

Nikhil Kak ◽

Ekta Khanna ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Model Selection ◽

Data Analytics ◽

Large Scale ◽

Source Code ◽

Pareto Frontier ◽

Multiple Criteria ◽

Data Systems ◽

Recent Advances

Deep learning (DL) is growing in popularity for many data analytics applications, including among enterprises. Large business-critical datasets in such settings typically reside in RDBMSs or other data systems. The DB community has long aimed to bring machine learning (ML) to DBMS-resident data. Given past lessons from in-DBMS ML and recent advances in scalable DL systems, DBMS and cloud vendors are increasingly interested in adding more DL support for DB-resident data. Recently, a new parallel DL model selection execution approach called Model Hopper Parallelism (MOP) was proposed. In this paper, we characterize the particular suitability of MOP for DL on data systems, but to bring MOP-based DL to DB-resident data, we show that there is no single "best" approach, and an interesting tradeoff space of approaches exists. We explain four canonical approaches and build prototypes upon Greenplum Database, compare them analytically on multiple criteria (e.g., runtime efficiency and ease of governance) and compare them empirically with large-scale DL workloads. Our experiments and analyses show that it is non-trivial to meet all practical desiderata well and there is a Pareto frontier; for instance, some approaches are 3x-6x faster but fare worse on governance and portability. Our results and insights can help DBMS and cloud vendors design better DL support for DB users. All of our source code, data, and other artifacts are available at https://github.com/makemebitter/cerebro-ds.

Download Full-text

Ht-index for empirical evaluation of the sampled graph-based Discrete Pulse Transform

South African Computer Journal ◽

10.18489/sacj.v32i2.849 ◽

2020 ◽

Vol 32 (2) ◽

Author(s):

Mark De Lancey ◽

Inger Fabris-Rotelli

Keyword(s):

Filter Bank ◽

Large Scale ◽

Filter Banks ◽

Empirical Evaluation ◽

Deterministic Algorithm ◽

Spectral Domain ◽

Original Algorithm ◽

Effective Implementation ◽

Benchmark Datasets ◽

Computational Resources

The Discrete Pulse Transform decomposes a signal into pulses, with the most recent and effective implementation being a graph-base algorithm called the Roadmaker’s Pavage. Even though an efficient implementation, the theoretical structure results in a slow, deterministic algorithm. This paper examines the use of the spectral domain of graphs and designs graph filter banks to downsample the algorithm, investigating the extent to which this speeds up the algorithm. Converting graph signals to the spectral domain is costly, thus estimation for filter banks is examined, as well as the design of a reusable filter bank. The sampled version requires hyperparameters to reconstruct the same textures of the image as the original algorithm, preventing a large scale study. Here an objective and efficient way of deriving similar results between the original and our proposed Filtered Roadmaker’s Pavage is provided. The method makes use of the Ht-index, separating the distribution of information at scale intervals. Empirical research using benchmark datasets provides improved results, showing that using the proposed algorithm consistently runs faster, uses less computational resources, while having a positive SSIM with low variance. This provides an informative and faster approximation to the nonlinear DPT, a property not standardly achievable.

Download Full-text

Empirical Evaluation of Map Reduce Based Hybrid Approach for Problem of Imbalanced Classification in Big Data

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2019070102 ◽

2019 ◽

Vol 11 (3) ◽

pp. 23-45 ◽

Cited By ~ 2

Author(s):

Khyati Ahlawat ◽

Anuradha Chug ◽

Amit Prakash Singh

Keyword(s):

Machine Learning ◽

Big Data ◽

Hybrid Approach ◽

Empirical Evaluation ◽

Processing Technique ◽

Machine Learning Algorithms ◽

Future Research ◽

Svm Classifier ◽

Hybrid Technique ◽

Imbalanced Classification

Imbalanced datasets are the ones with uneven distribution of classes that deteriorates classifier's performance. In this paper, SVM classifier is combined with K-Means clustering approach and a hybrid approach, Hy_SVM_KM is introduced. The performance of proposed method is also empirically evaluated using Accuracy and FN Rate measure and compared with existing methods like SMOTE. The results have shown that the proposed hybrid technique has outperformed traditional machine learning classifier SVM in mostly datasets and have performed better than known pre-processing technique SMOTE for all datasets. The goal of this article is to extend capabilities of popular machine learning algorithms and adapt it to meet the challenges of imbalanced big data classification. This article can provide a baseline study for future research on imbalanced big datasets classification and provides an efficient mechanism to deal with imbalanced nature big dataset with modified SVM classifier and improves the overall performance of the model.

Download Full-text