Robust Wasserstein profile inference and applications to machine learning

Jose Blanchet; Yang Kang; Karthyek Murthy

doi:10.1017/jpr.2019.49

Robust Wasserstein profile inference and applications to machine learning

Journal of Applied Probability ◽

10.1017/jpr.2019.49 ◽

2019 ◽

Vol 56 (3) ◽

pp. 830-857 ◽

Cited By ~ 9

Author(s):

Jose Blanchet ◽

Yang Kang ◽

Karthyek Murthy

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Optimal Transport ◽

Optimization Problems ◽

Empirical Distribution ◽

Transport Costs ◽

Regularization Parameters ◽

Out Of Sample ◽

Distributionally Robust ◽

Wasserstein Distances

AbstractWe show that several machine learning estimators, including square-root least absolute shrinkage and selection and regularized logistic regression, can be represented as solutions to distributionally robust optimization problems. The associated uncertainty regions are based on suitably defined Wasserstein distances. Hence, our representations allow us to view regularization as a result of introducing an artificial adversary that perturbs the empirical distribution to account for out-of-sample effects in loss estimation. In addition, we introduce RWPI (robust Wasserstein profile inference), a novel inference methodology which extends the use of methods inspired by empirical likelihood to the setting of optimal transport costs (of which Wasserstein distances are a particular case). We use RWPI to show how to optimally select the size of uncertainty regions, and as a consequence we are able to choose regularization parameters for these machine learning estimators without the use of cross validation. Numerical experiments are also given to validate our theoretical findings.

Download Full-text

Cross-validation and out-of-sample testing of physical activity intensity predictions with a wrist-worn accelerometer

Journal of Applied Physiology ◽

10.1152/japplphysiol.00760.2017 ◽

2018 ◽

Vol 124 (5) ◽

pp. 1284-1293 ◽

Cited By ~ 9

Author(s):

Alexander H. K. Montoye ◽

Bradford S. Westgate ◽

Morgan R. Fonley ◽

Karin A. Pfeiffer

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Cross Validation ◽

Learning Models ◽

Data Set ◽

Feature Sets ◽

Activity Intensity ◽

Out Of Sample ◽

Sample Testing ◽

Machine Learning Models

Wrist-worn accelerometers are gaining popularity for measurement of physical activity. However, few methods for predicting physical activity intensity from wrist-worn accelerometer data have been tested on data not used to create the methods (out-of-sample data). This study utilized two previously collected data sets [Ball State University (BSU) and Michigan State University (MSU)] in which participants wore a GENEActiv accelerometer on the left wrist while performing sedentary, lifestyle, ambulatory, and exercise activities in simulated free-living settings. Activity intensity was determined via direct observation. Four machine learning models (plus 2 combination methods) and six feature sets were used to predict activity intensity (30-s intervals) with the accelerometer data. Leave-one-out cross-validation and out-of-sample testing were performed to evaluate accuracy in activity intensity prediction, and classification accuracies were used to determine differences among feature sets and machine learning models. In out-of-sample testing, the random forest model (77.3–78.5%) had higher accuracy than other machine learning models (70.9–76.4%) and accuracy similar to combination methods (77.0–77.9%). Feature sets utilizing frequency-domain features had improved accuracy over other feature sets in leave-one-out cross-validation (92.6–92.8% vs. 87.8–91.9% in MSU data set; 79.3–80.2% vs. 76.7–78.4% in BSU data set) but similar or worse accuracy in out-of-sample testing (74.0–77.4% vs. 74.1–79.1% in MSU data set; 76.1–77.0% vs. 75.5–77.3% in BSU data set). All machine learning models outperformed the euclidean norm minus one/GGIR method in out-of-sample testing (69.5–78.5% vs. 53.6–70.6%). From these results, we recommend out-of-sample testing to confirm generalizability of machine learning models. Additionally, random forest models and feature sets with only time-domain features provided the best accuracy for activity intensity prediction from a wrist-worn accelerometer. NEW & NOTEWORTHY This study includes in-sample and out-of-sample cross-validation of an alternate method for deriving meaningful physical activity outcomes from accelerometer data collected with a wrist-worn accelerometer. This method uses machine learning to directly predict activity intensity. By so doing, this study provides a classification model that may avoid high errors present with energy expenditure prediction while still allowing researchers to assess adherence to physical activity guidelines.

Download Full-text

Conic reformulations for Kullback-Leibler divergence constrained distributionally robust optimization and applications

An International Journal of Optimization and Control Theories & Applications (IJOCTA) ◽

10.11121/ijocta.01.2021.001001 ◽

2021 ◽

Vol 11 (2) ◽

pp. 139-151

Author(s):

Burak Kocuk

Keyword(s):

Robust Optimization ◽

Optimization Problem ◽

Optimization Problems ◽

Empirical Distribution ◽

Computational Study ◽

Facility Location Problem ◽

Optimization Approach ◽

Distributionally Robust Optimization ◽

Leibler Divergence ◽

Distributionally Robust

In this paper, we consider a Kullback-Leibler divergence constrained distributionally robust optimization model. This model considers an ambiguity set that consists of all distributions whose Kullback-Leibler divergence to an empirical distribution is bounded. Utilizing the fact that this divergence measure has an exponential cone representation, we obtain the robust counterpart of the Kullback-Leibler divergence constrained distributionally robust optimization problem as a dual exponential cone constrained program under mild assumptions on the underlying optimization problem. The resulting conic reformulation of the original optimization problem can be directly solved by a commercial conic programming solver. We specialize our generic formulation to two classical optimization problems, namely, the Newsvendor Problem and the Uncapacitated Facility Location Problem. Our computational study in an out-of-sample analysis shows that the solutions obtained via the distributionally robust optimization approach yield significantly better performance in terms of the dispersion of the cost realizations while the central tendency deteriorates only slightly compared to the solutions obtained by stochastic programming.

Download Full-text

Detecting Tonic-Clonic Seizures in Multimodal Biosignal Data From Wearables: Methodology Design and Validation (Preprint)

10.2196/preprints.27674 ◽

2021 ◽

Author(s):

Sebastian Böttcher ◽

Elisa Bruno ◽

Nikolay V Manyakov ◽

Nino Epitashvili ◽

Kasper Claes ◽

...

Keyword(s):

Machine Learning ◽

False Alarm ◽

Cross Validation ◽

Electrodermal Activity ◽

Supervised Machine Learning ◽

Data Set ◽

Test Set ◽

Out Of Sample ◽

Clonic Seizures ◽

Epilepsy Monitoring

BACKGROUND Video electroencephalography recordings, routinely used in epilepsy monitoring units, are the gold standard for monitoring epileptic seizures. However, monitoring is also needed in the day-to-day lives of people with epilepsy, where video electroencephalography is not feasible. Wearables could fill this gap by providing patients with an accurate log of their seizures. OBJECTIVE Although there are already systems available that provide promising results for the detection of tonic-clonic seizures (TCSs), research in this area is often limited to detection from 1 biosignal modality or only during the night when the patient is in bed. The aim of this study is to provide evidence that supervised machine learning can detect TCSs from multimodal data in a new data set during daytime and nighttime. METHODS An extensive data set of biosignals from a multimodal watch worn by people with epilepsy was recorded during their stay in the epilepsy monitoring unit at 2 European clinical sites. From a larger data set of 243 enrolled participants, those who had data recorded during TCSs were selected, amounting to 10 participants with 21 TCSs. Accelerometry and electrodermal activity recorded by the wearable device were used for analysis, and seizure manifestation was annotated in detail by clinical experts. Ten accelerometry and 3 electrodermal activity features were calculated for sliding windows of variable size across the data. A gradient tree boosting algorithm was used for seizure detection, and the optimal parameter combination was determined in a leave-one-participant-out cross-validation on a training set of 10 seizures from 8 participants. The model was then evaluated on an out-of-sample test set of 11 seizures from the remaining 2 participants. To assess specificity, we additionally analyzed data from up to 29 participants without TCSs during the model evaluation. RESULTS In the leave-one-participant-out cross-validation, the model optimized for sensitivity could detect all 10 seizures with a false alarm rate of 0.46 per day in 17.3 days of data. In a test set of 11 out-of-sample TCSs, amounting to 8.3 days of data, the model could detect 10 seizures and produced no false positives. Increasing the test set to include data from 28 more participants without additional TCSs resulted in a false alarm rate of 0.19 per day in 78 days of wearable data. CONCLUSIONS We show that a gradient tree boosting machine can robustly detect TCSs from multimodal wearable data in an original data set and that even with very limited training data, supervised machine learning can achieve a high sensitivity and low false-positive rate. This methodology may offer a promising way to approach wearable-based nonconvulsive seizure detection.

Download Full-text

From Data to Decisions: Distributionally Robust Optimization Is Optimal

Management Science ◽

10.1287/mnsc.2020.3678 ◽

2020 ◽

Author(s):

Bart P. G. Van Parys ◽

Peyman Mohajerin Esfahani ◽

Daniel Kuhn

Keyword(s):

Robust Optimization ◽

Cost Function ◽

Optimization Problem ◽

Empirical Distribution ◽

Distributionally Robust Optimization ◽

Expected Cost ◽

Large Deviations Theory ◽

Out Of Sample ◽

Distributionally Robust ◽

Meta Optimization

We study stochastic programs where the decision maker cannot observe the distribution of the exogenous uncertainties but has access to a finite set of independent samples from this distribution. In this setting, the goal is to find a procedure that transforms the data to an estimate of the expected cost function under the unknown data-generating distribution, that is, a predictor, and an optimizer of the estimated cost function that serves as a near-optimal candidate decision, that is, a prescriptor. As functions of the data, predictors and prescriptors constitute statistical estimators. We propose a meta-optimization problem to find the least conservative predictors and prescriptors subject to constraints on their out-of-sample disappointment. The out-of-sample disappointment quantifies the probability that the actual expected cost of the candidate decision under the unknown true distribution exceeds its predicted cost. Leveraging tools from large deviations theory, we prove that this meta-optimization problem admits a unique solution: The best predictor-prescriptor-pair is obtained by solving a distributionally robust optimization problem over all distributions within a given relative entropy distance from the empirical distribution of the data. This paper was accepted by Chung Piaw Teo, optimization.

Download Full-text

Machine Learning Methods for "Small-n, Large-p" Problems: Understanding the Complex Drivers of Modern-Day Slavery

10.21203/rs.3.rs-296275/v1 ◽

2021 ◽

Author(s):

Rosa Lavelle-Hill ◽

Anjali Mazumder ◽

James Goulding ◽

Gavin Smith ◽

Todd Landman

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Feature Space ◽

Machine Learning Methods ◽

Effective Interventions ◽

Modern Slavery ◽

Out Of Sample ◽

Small N ◽

Modern Day Slavery ◽

Linear Machine

Abstract 40 million people are estimated to be in some form of modern slavery across the globe. Understanding the factors that make any particular individual or geographical region vulnerable to such abuse is essential for the development of effective interventions and policy. Efforts to isolate and assess the importance of individual drivers statistically are impeded by two key challenges: data scarcity and high dimensionality. The hidden nature of modern slavery restricts available datapoints; and the large number of candidate variables that are potentially predictive of slavery inflates the feature space exponentially. The result is a highly problematic "small-n, large-p' setting, where overfitting and multi-collinearity can render more traditional statistical approaches inapplicable. Recent advances in non-parametric computational methods, however, offer scope to overcome such challenges. We present an approach that combines non-linear machine learning models and strict cross-validation methods with novel variable importance techniques, emphasising the importance of stability of model explanations via Rashomon-set analysis. This approach is used to model the prevalence of slavery in 48 countries, with results bringing to light the importance predictive factors - such as a country's capacity to protect the physical security of women, which has previously been under-emphasized in the literature. Out-of-sample estimates of slavery prevalence are then made for countries where no survey data currently exists.

Download Full-text

Improving Sample Average Approximation Using Distributional Robustness

INFORMS Journal on Optimization ◽

10.1287/ijoo.2021.0061 ◽

2021 ◽

Author(s):

Edward Anderson ◽

Andy Philpott

Keyword(s):

Stochastic Optimization ◽

Optimization Problems ◽

Electric Power Industry ◽

Ground Truth ◽

Sample Average Approximation ◽

Approximation Problem ◽

Sample Average ◽

Out Of Sample ◽

Distributionally Robust ◽

Average Approximation

Sample average approximation is a popular approach to solving stochastic optimization problems. It has been widely observed that some form of robustification of these problems often improves the out-of-sample performance of the solution estimators. In estimation problems, this improvement boils down to a trade-off between the opposing effects of bias and shrinkage. This paper aims to characterize the features of more general optimization problems that exhibit this behaviour when a distributionally robust version of the sample average approximation problem is used. The paper restricts attention to quadratic problems for which sample average approximation solutions are unbiased and shows that expected out-of-sample performance can be calculated for small amounts of robustification and depends on the type of distributionally robust model used and properties of the underlying ground-truth probability distribution of random variables. The paper was written as part of a New Zealand funded research project that aimed to improve stochastic optimization methods in the electric power industry. The authors of the paper have worked together in this domain for the past 25 years.

Download Full-text

Machine Learning in predicting the extent of gas and rock outburst

E3S Web of Conferences ◽

10.1051/e3sconf/20187100009 ◽

2018 ◽

Vol 71 ◽

pp. 00009 ◽

Cited By ~ 1

Author(s):

Maciej Bodlak ◽

Jan Kudełko ◽

Andrzej Zibrow

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Learning Algorithm ◽

Hard Coal ◽

Sample Analysis ◽

Key Factors ◽

Out Of Sample ◽

Importance Ranking ◽

The Mean ◽

Validation Technique

In order to develop a method for forecasting the costs generated by rock and gas outbursts for hard coal deposit "Nowa Ruda Pole Piast Wacław-Lech", the analyses presented in this paper focused on key factors influencing the discussed phenomenon. Part of this research consisted in developing a prediction model of the extentof rock and gas outbursts with regard to the most probable mass of rock [Mg] and volume of gas [m3] released in an outburst and to the length of collapsed and/or damaged workings [running meters, rm]. For this purpose, a machine learning method was used, i.e. a "random forests method" with the "XGBoost" machine learning algorithm. After performing the machine learning process with the cross-validation technique, with five iterations, the lowest possible values of the mean-square prediction error "RMSE" were achieved. The obtained model and the program written in the programming language "R" was verified on the basis of the "RMSE" values, prediction matching graphs, out of sample analysis, importance ranking of input parameters and the sensitivity of the model during the forecast for hypothetical conditions.

Download Full-text

Prediction of K562 Cells Functional Inhibitors Based on Machine Learning Approaches

Current Pharmaceutical Design ◽

10.2174/1381612825666191107092214 ◽

2020 ◽

Vol 25 (40) ◽

pp. 4296-4302 ◽

Cited By ~ 2

Author(s):

Yuan Zhang ◽

Zhenyan Han ◽

Qian Gao ◽

Xiaoyi Bai ◽

Chi Zhang ◽

...

Keyword(s):

Machine Learning ◽

Inclusion Bodies ◽

Cross Validation ◽

Independent Set ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

Validation Test ◽

Excess Number ◽

Fold Cross Validation

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.

Download Full-text

Machine Learning in Futures Markets

Journal of Risk and Financial Management ◽

10.3390/jrfm14030119 ◽

2021 ◽

Vol 14 (3) ◽

pp. 119

Author(s):

Fabian Waldow ◽

Matthias Schnaubelt ◽

Christopher Krauss ◽

Thomas Günter Fischer

Keyword(s):

Machine Learning ◽

Futures Markets ◽

Learning Models ◽

Cross Sectional ◽

Data Set ◽

Statistical Arbitrage ◽

Out Of Sample ◽

Sample Testing ◽

Arbitrage Strategy ◽

Machine Learning Models

In this paper, we demonstrate how a well-established machine learning-based statistical arbitrage strategy can be successfully transferred from equity to futures markets. First, we preprocess futures time series comprised of front months to render them suitable for our returns-based trading framework and compile a data set comprised of 60 futures covering nearly 10 trading years. Next, we train several machine learning models to predict whether the h-day-ahead return of each future out- or underperforms the corresponding cross-sectional median return. Finally, we enter long/short positions for the top/flop-k futures for a duration of h days and assess the financial performance of the resulting portfolio in an out-of-sample testing period. Thereby, we find the machine learning models to yield statistically significant out-of-sample break-even transaction costs of 6.3 bp—a clear challenge to the semi-strong form of market efficiency. Finally, we discuss sources of profitability and the robustness of our findings.

Download Full-text

Practical CO2—WAG Field Operational Designs Using Hybrid Numerical-Machine-Learning Approaches

Energies ◽

10.3390/en14041055 ◽

2021 ◽

Vol 14 (4) ◽

pp. 1055

Author(s):

Qian Sun ◽

William Ampomah ◽

Junyu You ◽

Martha Cather ◽

Robert Balch

Keyword(s):

Machine Learning ◽

Oil Recovery ◽

History Matching ◽

Optimization Problems ◽

Learning Technologies ◽

Petroleum Engineering ◽

Support Vector ◽

Learning Approaches ◽

Field Development ◽

Proxy Models

Machine-learning technologies have exhibited robust competences in solving many petroleum engineering problems. The accurate predictivity and fast computational speed enable a large volume of time-consuming engineering processes such as history-matching and field development optimization. The Southwest Regional Partnership on Carbon Sequestration (SWP) project desires rigorous history-matching and multi-objective optimization processes, which fits the superiorities of the machine-learning approaches. Although the machine-learning proxy models are trained and validated before imposing to solve practical problems, the error margin would essentially introduce uncertainties to the results. In this paper, a hybrid numerical machine-learning workflow solving various optimization problems is presented. By coupling the expert machine-learning proxies with a global optimizer, the workflow successfully solves the history-matching and CO2 water alternative gas (WAG) design problem with low computational overheads. The history-matching work considers the heterogeneities of multiphase relative characteristics, and the CO2-WAG injection design takes multiple techno-economic objective functions into accounts. This work trained an expert response surface, a support vector machine, and a multi-layer neural network as proxy models to effectively learn the high-dimensional nonlinear data structure. The proposed workflow suggests revisiting the high-fidelity numerical simulator for validation purposes. The experience gained from this work would provide valuable guiding insights to similar CO2 enhanced oil recovery (EOR) projects.

Download Full-text