An extended comparison study of large scale datadriven prediction methods based on variable selection, latent variables, penalized regression and machine learning

Advanced predictive methods for wine age prediction: Part I – A comparison study of single-block regression approaches based on variable selection, penalized regression, latent variables and tree-based ensemble methods

Talanta ◽

10.1016/j.talanta.2016.10.062 ◽

2017 ◽

Vol 171 ◽

pp. 341-350 ◽

Cited By ~ 11

Author(s):

Ricardo Rendall ◽

Ana Cristina Pereira ◽

Marco S. Reis

Keyword(s):

Variable Selection ◽

Latent Variables ◽

Ensemble Methods ◽

Penalized Regression ◽

Comparison Study ◽

Predictive Methods ◽

Single Block ◽

Age Prediction

Download Full-text

Prediction of vascular aging based on smartphone acquired PPG signals

Scientific Reports ◽

10.1038/s41598-020-76816-6 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Lorenzo Dall’Olio ◽

Nico Curti ◽

Daniel Remondini ◽

Yosef Safi Harb ◽

Folkert W. Asselbergs ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Screening Tool ◽

Computational Cost ◽

Penalized Regression ◽

Prediction Performance ◽

Second Derivative ◽

Vascular Aging ◽

Non Invasive ◽

Potential Biomarkers

AbstractPhotoplethysmography (PPG) measured by smartphone has the potential for a large scale, non-invasive, and easy-to-use screening tool. Vascular aging is linked to increased arterial stiffness, which can be measured by PPG. We investigate the feasibility of using PPG to predict healthy vascular aging (HVA) based on two approaches: machine learning (ML) and deep learning (DL). We performed data preprocessing, including detrending, demodulating, and denoising on the raw PPG signals. For ML, ridge penalized regression has been applied to 38 features extracted from PPG, whereas for DL several convolutional neural networks (CNNs) have been applied to the whole PPG signals as input. The analysis has been conducted using the crowd-sourced Heart for Heart data. The prediction performance of ML using two features (AUC of 94.7%) – the a wave of the second derivative PPG and tpr, including four covariates, sex, height, weight, and smoking – was similar to that of the best performing CNN, 12-layer ResNet (AUC of 95.3%). Without having the heavy computational cost of DL, ML might be advantageous in finding potential biomarkers for HVA prediction. The whole workflow of the procedure is clearly described, and open software has been made available to facilitate replication of the results.

Download Full-text

Large-scale comparison of machine learning methods for drug target prediction on ChEMBL

Chemical Science ◽

10.1039/c8sc00148k ◽

2018 ◽

Vol 9 (24) ◽

pp. 5441-5451 ◽

Cited By ~ 109

Author(s):

Andreas Mayr ◽

Günter Klambauer ◽

Thomas Unterthiner ◽

Marvin Steijaert ◽

Jörg K. Wegner ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Comparative Study ◽

Drug Target ◽

Large Scale ◽

State Of The Art ◽

Target Prediction ◽

Prediction Methods ◽

Machine Learning Methods ◽

Drug Target Prediction

The to date largest comparative study of nine state-of-the-art drug target prediction methods finds that deep learning outperforms all other competitors. The results are based on a benchmark of 1300 assays and half a million compounds.

Download Full-text

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Statistics and Computing ◽

10.1007/s11222-019-09914-9 ◽

2019 ◽

Vol 30 (3) ◽

pp. 697-719 ◽

Cited By ~ 1

Author(s):

Fan Wang ◽

Sach Mukherjee ◽

Sylvia Richardson ◽

Steven M. Hill

Keyword(s):

Variable Selection ◽

Large Scale ◽

Penalized Regression ◽

Adaptive Lasso ◽

High Dimensional ◽

Finite Sample ◽

Dantzig Selector ◽

Regression Methods ◽

High Dimensional Regression ◽

Selection And Ranking

AbstractPenalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a “no panacea” view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.

Download Full-text

Large-scale benchmark study of survival prediction methods using multi-omics data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa167 ◽

2020 ◽

Author(s):

Moritz Herrmann ◽

Philipp Probst ◽

Roman Hornung ◽

Vindi Jurinovic ◽

Anne-Laure Boulesteix

Keyword(s):

Survival Time ◽

Large Scale ◽

Cox Model ◽

Penalized Regression ◽

Supplementary Information ◽

Survival Prediction ◽

Prediction Methods ◽

Omics Data ◽

Clinical Variables ◽

Benchmark Study

Abstract Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database ‘The Cancer Genome Atlas’ (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan–Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno’s C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups—especially clinical variables—from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:[email protected], +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.

Download Full-text

Prediction of vascular aging based on smartphone acquired PPG signals

10.1101/2020.05.26.116186 ◽

2020 ◽

Author(s):

Lorenzo Dall’Olio ◽

Nico Curti ◽

Daniel Remondini ◽

Yosef Safi Harb ◽

Folkert W. Asselbergs ◽

...

Keyword(s):

Machine Learning ◽

Large Scale ◽

Screening Tool ◽

Computational Cost ◽

Penalized Regression ◽

Prediction Performance ◽

Second Derivative ◽

Vascular Aging ◽

Non Invasive ◽

Potential Biomarkers

ABSTRACTPhotoplethysmography (PPG) measured by smartphone has the potential for a large scale, non-invasive, and easy-to-use screening tool. Vascular aging is linked to increased arterial stiffness, which can be measured by PPG. We investigate the feasibility of using PPG to predict healthy vascular aging (HVA) based on two approaches: machine learning (ML) and deep learning (DL). We performed data preprocessing including detrending, demodulating and denoising on the raw PPG signals. For ML, ridge penalized regression has been applied to 38 features extracted from PPG, whereas for DL several convolutional neural networks (CNNs) have been applied to the whole PPG signals as input. The analysis has been conducted using the crowd-sourced Heart for Heart data. The prediction performance of ML using two features (AUC of 94.7%) – the a wave of the second derivative PPG and tpr, including four covariates, sex, height, weight, and smoking – was similar to that of the best performing CNN, 12-layer ResNet (AUC of 95.3%). Without having the heavy computational cost of DL, ML might be advantageous in finding potential biomarkers for HVA prediction. The whole workflow of the procedure is clearly described, and open software has been made available to facilitate replication of the results.

Download Full-text

SNR: Sub-Network Routing for Flexible Parameter Sharing in Multi-Task Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.3301216 ◽

2019 ◽

Vol 33 ◽

pp. 216-223 ◽

Cited By ~ 1

Author(s):

Jiaqi Ma ◽

Zhe Zhao ◽

Jilin Chen ◽

Ang Li ◽

Lichan Hong ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Latent Variables ◽

Large Scale ◽

Network Routing ◽

Low Level ◽

Task Learning ◽

Computation Efficiency ◽

Flexible Architectures ◽

Parameter Sharing

Machine learning applications, such as object detection and content recommendation, often require training a single model to predict multiple targets at the same time. Multi-task learning through neural networks became popular recently, because it not only helps improve the accuracy of many prediction tasks when they are related, but also saves computation cost by sharing model architectures and low-level representations. The latter is critical for real-time large-scale machine learning systems. However, classic multi-task neural networks may degenerate significantly in accuracy when tasks are less related. Previous works (Misra et al. 2016; Yang and Hospedales 2016; Ma et al. 2018) showed that having more flexible architectures in multi-task models, either manually-tuned or softparameter-sharing structures like gating networks, helps improve the prediction accuracy. However, manual tuning is not scalable, and the previous soft-parameter sharing models are either not flexible enough or computationally expensive. In this work, we propose a novel framework called SubNetwork Routing (SNR) to achieve more flexible parameter sharing while maintaining the computational advantage of the classic multi-task neural-network model. SNR modularizes the shared low-level hidden layers into multiple layers of subnetworks, and controls the connection of sub-networks with learnable latent variables to achieve flexible parameter sharing. We demonstrate the effectiveness of our approach on a large-scale dataset YouTube8M. We show that the proposed method improves the accuracy of multi-task models while maintaining their computation efficiency.

Download Full-text

Large-Scale Data Learning Method for Anomaly Detection using Machine Learning for Monitoring Vibration in Vehicle Equipment

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.140.480 ◽

2020 ◽

Vol 140 (6) ◽

pp. 480-487

Author(s):

Minoru Kondo

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Large Scale ◽

Learning Method ◽

Large Scale Data ◽

Scale Data

Download Full-text

Coded Computing: Mitigating Fundamental Bottlenecks in Large-Scale Distributed Computing and Machine Learning

10.1561/9781680837056 ◽

2020 ◽

Author(s):

Songze Li ◽

Salman Avestimehr

Keyword(s):

Machine Learning ◽

Distributed Computing ◽

Large Scale

Download Full-text

Evolution of Metastable Structures in Bimetallic Catalysts from Microscopy and Machine-Learning Molecular Dynamics

10.26434/chemrxiv.11811660.v1 ◽

2020 ◽

Author(s):

Jin Soo Lim ◽

Jonathan Vandermause ◽

Matthijs A. van Spronsen ◽

Albert Musaelian ◽

Christopher R. O’Connor ◽

...

Keyword(s):

Machine Learning ◽

Molecular Dynamics ◽

Large Scale ◽

Materials Science ◽

Complete Characterization ◽

Layer By Layer ◽

Surface Restructuring ◽

Metastable Structures ◽

Mechanistic Investigation ◽

Underlying Mechanisms

Restructuring of interface plays a crucial role in materials science and heterogeneous catalysis. Bimetallic systems, in particular, often adopt very different composition and morphology at surfaces compared to the bulk. For the first time, we reveal a detailed atomistic picture of the long-timescale restructuring of Pd deposited on Ag, using microscopy, spectroscopy, and novel simulation methods. Encapsulation of Pd by Ag always precedes layer-by-layer dissolution of Pd, resulting in significant Ag migration out of the surface and extensive vacancy pits. These metastable structures are of vital catalytic importance, as Ag-encapsulated Pd remains much more accessible to reactants than bulk-dissolved Pd. The underlying mechanisms are uncovered by performing fast and large-scale machine-learning molecular dynamics, followed by our newly developed method for complete characterization of atomic surface restructuring events. Our approach is broadly applicable to other multimetallic systems of interest and enables the previously impractical mechanistic investigation of restructuring dynamics.

Download Full-text