Machine learning RF shimming: Prediction by iteratively projected ridge regression

Abstract. Air pollution is a key public health issue in urban areas worldwide. The development of low-cost air pollution sensors is consequently a major research priority. However, low-cost sensors often fail to attain sufficient measurement performance compared to state-of-the-art measurement stations, and typically require calibration procedures in expensive laboratory settings. As a result, there has been much debate about calibration techniques that could make their performance more reliable, while also developing calibration procedures that can be carried out without access to advanced laboratories. One repeatedly proposed strategy is low-cost sensor calibration through co-location with public measurement stations. The idea is that, using a regression function, the low-cost sensor signals can be calibrated against the station reference signal, to be then deployed separately with performances similar to the original stations. Here we test the idea of using machine learning algorithms for such regression tasks using hourly-averaged co-location data for nitrogen dioxide (NO2) and particulate matter of particle sizes smaller than 10 μm (PM10) at three different locations in the urban area of London, UK. Specifically, we compare the performance of Ridge regression, a linear statistical learning algorithm, to two non-linear algorithms in the form of Random Forest (RF) regression and Gaussian Process regression (GPR). We further benchmark the performance of all three machine learning methods to the more common Multiple Linear Regression (MLR). We obtain very good out-of-sample R2-scores (coefficient of determination) > 0.7, frequently exceeding 0.8, for the machine learning calibrated low-cost sensors. In contrast, the performance of MLR is more dependent on random variations in the sensor hardware and co-located signals, and is also more sensitive to the length of the co-location period. We find that, subject to certain conditions, GPR is typically the best performing method in our calibration setting, followed by Ridge regression and RF regression. However, we also highlight several key limitations of the machine learning methods, which will be crucial to consider in any co-location calibration. In particular, none of the methods is able to extrapolate to pollution levels well outside those encountered at training stage. Ultimately, this is one of the key limiting factors when sensors are deployed away from the co-location site itself. Consequently, we find that the linear Ridge method, which best mitigates such extrapolation effects, is typically performing as good as, or even better, than GPR after sensor re-location. Overall, our results highlight the potential of co-location methods paired with machine learning calibration techniques to reduce costs of air pollution measurements, subject to careful consideration of the co-location training conditions, the choice of calibration variables, and the features of the calibration algorithm.

Download Full-text

A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery

Bioinformatics ◽

10.1093/bioinformatics/btz293 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4656-4663 ◽

Cited By ~ 3

Author(s):

Oliver P Watson ◽

Isidro Cortes-Ciriano ◽

Aimee R Taylor ◽

James A Watson

Keyword(s):

Machine Learning ◽

Random Forests ◽

Ridge Regression ◽

Machine Learning Algorithms ◽

Neural Nets ◽

Loss Functions ◽

Support Vector ◽

Activity Distribution ◽

Structure Activity ◽

Vector Machines

Abstract Motivation Artificial intelligence, trained via machine learning (e.g. neural nets, random forests) or computational statistical algorithms (e.g. support vector machines, ridge regression), holds much promise for the improvement of small-molecule drug discovery. However, small-molecule structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs. Results The quantile-activity bootstrap is proposed as a new model validation framework using quantile splits on the activity distribution function to construct training and testing sets. In addition, we propose two novel rank-based loss functions which penalize only the out-of-sample predicted ranks of high-activity molecules. The combination of these methods was used to assess the performance of neural nets, random forests, support vector machines (regression) and ridge regression applied to 25 diverse high-quality structure-activity datasets publicly available on ChEMBL. Model validation based on random partitioning of available data favours models that overfit and ‘memorize’ the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes extrapolation of models onto structurally different molecules outside of the training data. Simpler, traditional statistical methods such as ridge regression can outperform state-of-the-art machine learning methods in this setting. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand. Availability and implementation All software and data are available as Jupyter notebooks found at https://github.com/owatson/QuantileBootstrap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A machine learning based prediction model of anti-PD-1 therapy response using noninvasive clinical information and blood markers of lung cancer patients.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e14138 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e14138-e14138

Author(s):

Beung-Chul AHN ◽

Kyoung Ho Pyo ◽

Dongmin Jung ◽

Chun-Feng Xin ◽

Chang Gon Kim ◽

...

Keyword(s):

Machine Learning ◽

Flow Cytometry ◽

Supervised Learning ◽

Clinical Data ◽

Ridge Regression ◽

Predictive Score ◽

Support Vector ◽

Data Set ◽

Test Set ◽

Flow Cytometry Data

e14138 Background: Immune checkpoint inhibitors have become breakthrough therapy for various types of cancers. However, regarding their total response rate around 20% based on clinical trials, predicting accurate aPD-1 response for individual patient is unestablished. The presence of PD-L1 expression or tumor infiltrating lymphocyte may be used as indicators of response but are limited. We developed models using machine learning methods to predict the aPD-1 response. Methods: A total of 126 advanced NSCLC patients treated with the aPD-1 were enrolled. Their clinical characteristics, treatment outcomes, and adverse events were collected. Total clinical data (n = 126) consist of 15 variables were divided into two subsets, discovery set (n = 63) and test set (n = 63). Thirteen supervised learning algorithms including support vector machine and regularized regression (lasso, ridge, elastic net) were applied on discovery set for model development and on test set for validation. Each model were evaluated according to the ROC curve and cross-validation method. Same methods were used to the subset which had additional flow cytometry data (n = 40). Results: The median age was 64 and 69.8% were male. Adenocarcinoma was predominant (69.8%) and twenty patients (15.1%) were driver mutation positive. Clinical data set (n = 126) demonstrated that the Ridge regression (AUC: 0.79) was the best model for prediction. Of 15 clinical variables, tumor burden, age, ECOG PS and PD-L1, were most important based on the random forest algorithm. When we merged the clinical and flow cytometry data, the Ridge regression model (AUC:0.82) showed better performance compared to using clinical data only. Among 52 variables of merged set, the top most important immune markers were as follows: CD3+CD8+CD25+/Teff-CD28, CD3+CD8+CD25-/Teff-Ki-67, and CD3+CD8+CD25+/Teff-NY-ESO/Teff-PD-1, which indicate activated tumor specific T cell subset. Conclusions: Our machine learning based model has benefit for predicting aPD-1 responses. After further validation in independent patient cohort, the supervised learning based non-invasive predictive score can be established to predict aPD-1 response.

Download Full-text

Neural networks and kernel ridge regression for excited states dynamics of CH2NH$_2^+$: From single-state to multi-state representations and multi-property machine learning models

Machine Learning: Science and Technology ◽

10.1088/2632-2153/ab88d0 ◽

2020 ◽

Vol 1 (2) ◽

pp. 025009 ◽

Cited By ~ 8

Author(s):

Julia Westermayr ◽

Felix A Faber ◽

Anders S Christensen ◽

O Anatole von Lilienfeld ◽

Philipp Marquetand

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Excited States ◽

Ridge Regression ◽

Learning Models ◽

Kernel Ridge Regression ◽

Single State ◽

Machine Learning Models

Download Full-text

Predictions of Indigenous Chicken Phenotypes from Genotypes: Comparison Between Machine Learning and conventional Linear Models

10.21203/rs.2.13280/v1 ◽

2019 ◽

Author(s):

CHESANG SUMUKWO ◽

Sumukwo Chesang ◽

Thomas K. Muasya ◽

Kiplangat. Ngeno

Keyword(s):

Neural Network ◽

Machine Learning ◽

Predictive Models ◽

Ridge Regression ◽

Linear Models ◽

Pearson Correlation ◽

Breeding Strategy ◽

Indigenous Chicken ◽

Genome Wide ◽

Significant Difference

Abstract Genomic selection is a new breeding strategy which is rapidly becoming the method of choice of selection. It is useful in predicting the phenotypes of quantitative traits based on genome-wide markers of genotypes using conventional predictive models such as ridge regression BLUP. However, these conventional predictive models are faced with a statistical challenge related to the high dimensionality of marker data, inter and intra-allelic interactions and typically make strong assumptions. Machine learning models can be used as an alternative in the prediction of phenotypes due to their ability to address this challenges. Therefore, the aim of this study was to compare the predictive ability of machine learning using deep convolutional neural network (DeepGS), conventional neural network (Artificial neural network), conventional statistical predictive model ridge regression best linear unbiased (RR-BLUP) model and combination of DeepGS and RR-BLUP(Ensemble model) in predicting body weight (BW) of indigenous chicken based on genome-wide markers. The pearson correlation coefficient (PCC) results from this study for the four models were 0.891,0.889, 0.892 and 0.812 for DeepGS, RR-BLUP, Ensemble and ANN. This showed that DeepGS did not yield significant difference (p>0.05) from the other models, therefore, it can be used in complement to the commonly used conventional models. For individuals with higher phenotypic values, the PCC results showed a drastic decrease in the performance of DeepGS, rrBLUP, Ensemble and ANN from 0.891, 0.889, 0.892, 0.845 to 0.315, 0.466, 0.342, 0.518 respectively. Therefore, more effort should be put on individuals with higher phenotypic values.

Download Full-text

Machine Learning for Absorption Cross Sections

10.26434/chemrxiv.12594191.v1 ◽

2020 ◽

Author(s):

Bao-Xin Xue ◽

Mario Barbatti ◽

Pavlo O. Dral

Keyword(s):

Machine Learning ◽

Electronic Structure ◽

Electronic Properties ◽

Cross Sections ◽

Ridge Regression ◽

Computational Cost ◽

Oscillator Strengths ◽

Absorption Cross ◽

Statistical Sampling ◽

Excitation Energies

We present a machine learning (ML) method to accelerate the nuclear ensemble approach (NEA) for computing absorption cross sections. ML-NEA is used to calculate cross sections on vast ensembles of nuclear geometries to reduce the error due to insufficient statistical sampling. The electronic properties — excitation energies and oscillator strengths — are calculated with a reference electronic structure method only for relatively few points in the ensemble. Kernel-ridge-regression-based ML combined with the RE descriptor as implemented in MLatom is used to predict these properties for the remaining tens of thousands of points in the ensemble without incurring much of additional computational cost. We demonstrate for two examples, benzene and a 9-dicyanomethylene derivative of acridine, that ML-NEA can produce statistically converged cross sections even for very challenging cases and even with as few as several hundreds of training points.

Download Full-text

Received-Signal-Strength (RSS) Based 3D Visible-Light-Positioning (VLP) System Using Kernel Ridge Regression Machine Learning Algorithm With Sigmoid Function Data Preprocessing Method

IEEE Access ◽

10.1109/access.2020.3041192 ◽

2020 ◽

Vol 8 ◽

pp. 214269-214281

Author(s):

Yu-Chun Wu ◽

Chi-Wai Chow ◽

Yang Liu ◽

Yun-Shen Lin ◽

Chong-You Hong ◽

...

Keyword(s):

Machine Learning ◽

Visible Light ◽

Ridge Regression ◽

Learning Algorithm ◽

Signal Strength ◽

Data Preprocessing ◽

Received Signal Strength ◽

Machine Learning Algorithm ◽

Sigmoid Function ◽

Kernel Ridge Regression

Download Full-text

429. County-level predictors of COVID-19 testing across the 62 counties in New York State: A comparison across machine learning algorithms

Open Forum Infectious Diseases ◽

10.1093/ofid/ofaa439.623 ◽

2020 ◽

Vol 7 (Supplement_1) ◽

pp. S281-S281

Author(s):

Chengbo Zeng ◽

Yunyu Xiao

Keyword(s):

Machine Learning ◽

New York ◽

Ridge Regression ◽

Prediction Models ◽

Learning Algorithms ◽

New York State ◽

Regression Tree ◽

Machine Learning Algorithms ◽

County Level ◽

Public Datasets

Abstract Background More than 360,000 people infected with COVID-19 in New York State (NYS) by the end of May 2020. Although expanded testing could effectively control statewide COVID-19 outbreak, the county-level factors predicting the number of testing are unknown. Accurately identifying the county-level predictors of testing may contribute to more effective testing allocation across counties in NYS. This study leveraged multiple public datasets and machine learning algorithms to construct and compare county-level prediction models of COVID-19 testing in NYS. Methods Testing data by May 15th was extracted from the Department of Health in NYS. A total of 28 county-level predictors derived from multiple public datasets (e.g., American Community Survey and US Health Data) were used to construct the prediction models. Three machine learning algorithms, including generalized linear regression with the least absolute shrinkage and selection operator(LASSO), ridge regression, and regression tree, were used to identify the most important county-level predictors, adjusting for prevalence and incidence. Model performances were assessed using the mean square error (MSE), with smaller MSE indicating a better model performance. Results The testing rate was 70.3 per 1,000 people in NYS. Counties (Rockland and Westchester) closed to the epicenter had high testing rates while counties (Chautauqua and Clinton) located at the boundary of NYS and were far away from the epicenter had low testing rates. The MSEs of linear regression with the LASSO penalty, ridge regression, and regression tree was 123.60, 40.59, and 298.0, respectively. Ridge regression was selected as the final model and revealed that the mental health provider rate was positively associated with testing (β=5.11, p=.04) while the proportion of religious adherents (β=-3.91, p=.05) was inversely related to the variation of testing rate across counties. Conclusion This study identified healthcare resources and religious environment as the strongest predictor of spatial variations of COVID-19 testing across NYS. Structural or policy efforts should address the spatial variations and target the relevant county-level predictors to promote statewide testing. Disclosures All Authors: No reported disclosures

Download Full-text