scholarly journals Stable Prediction with Model Misspecification and Agnostic Distribution Shift

2020 ◽  
Vol 34 (04) ◽  
pp. 4485-4492
Author(s):  
Kun Kuang ◽  
Ruoxuan Xiong ◽  
Peng Cui ◽  
Susan Athey ◽  
Bo Li

For many machine learning algorithms, two main assumptions are required to guarantee performance. One is that the test data are drawn from the same distribution as the training data, and the other is that the model is correctly specified. In real applications, however, we often have little prior knowledge on the test data and on the underlying true model. Under model misspecification, agnostic distribution shift between training and test data leads to inaccuracy of parameter estimation and instability of prediction across unknown test data. To address these problems, we propose a novel Decorrelated Weighting Regression (DWR) algorithm which jointly optimizes a variable decorrelation regularizer and a weighted regression model. The variable decorrelation regularizer estimates a weight for each sample such that variables are decorrelated on the weighted training data. Then, these weights are used in the weighted regression to improve the accuracy of estimation on the effect of each variable, thus help to improve the stability of prediction across unknown test data. Extensive experiments clearly demonstrate that our DWR algorithm can significantly improve the accuracy of parameter estimation and stability of prediction with model misspecification and agnostic distribution shift.

Author(s):  
Michael Auer ◽  
Mark D. Griffiths

AbstractPlayer protection and harm minimization have become increasingly important in the gambling industry along with the promotion of responsible gambling (RG). Among the most widespread RG tools that gaming operators provide are limit-setting tools that help players limit the amount of time and/or money they spend gambling. Research suggests that limit-setting significantly reduces the amount of money that players spend. If limit-setting is to be encouraged as a way of facilitating responsible gambling, it is important to know what variables are important in getting individuals to set and change limits in the first place. In the present study, 33 variables assessing the player behavior among Norsk Tipping clientele (N = 70,789) from January to March 2017 were computed. The 33 variables which reflect the players’ behavior were then used to predict the likelihood of gamblers changing their monetary limit between April and June 2017. The 70,789 players were randomly split into a training dataset of 56,532 and an evaluation set of 14,157 players (corresponding to an 80/20 split). The results demonstrated that it is possible to predict future limit-setting based on player behavior. The random forest algorithm appeared to predict limit-changing behavior much better than the other algorithms. However, on the independent test data, the random forest algorithm’s accuracy dropped significantly. The best performance on the test data along with a small decrease in accuracy in comparison to the training data was delivered by the gradient boost machine learning algorithm. The most important variables predicting future limit-setting using the gradient boost machine algorithm were players receiving feedback that they had reached 80% of their personal monthly global loss limit, personal monthly loss limit, the amount bet, theoretical loss, and whether the players had increased their limits in the past. With the help of predictive analytics, players with a high likelihood of changing their limits can be proactively approached.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Burak Cankaya ◽  
Berna Eren Tokgoz ◽  
Ali Dag ◽  
K.C. Santosh

Purpose This paper aims to propose a machine learning-based automatic labeling methodology for chemical tanker activities that can be applied to any port with any number of active tankers and the identification of important predictors. The methodology can be applied to any type of activity tracking that is based on automatically generated geospatial data. Design/methodology/approach The proposed methodology uses three machine learning algorithms (artificial neural networks, support vector machines (SVMs) and random forest) along with information fusion (IF)-based sensitivity analysis to classify chemical tanker activities. The data set is split into training and test data based on vessels, with two vessels in the training data and one in the test data set. Important predictors were identified using a receiver operating characteristic comparative approach, and overall variable importance was calculated using IF from the top models. Findings Results show that an SVM model has the best balance between sensitivity and specificity, at 93.5% and 91.4%, respectively. Speed, acceleration and change in the course on the ground for the vessels are identified as the most important predictors for classifying vessel activity. Research limitations/implications The study evaluates the vessel movements waiting between different terminals in the same port, but not their movements between different ports for their tank-cleaning activities. Practical implications The findings in this study can be used by port authorities, shipping companies, vessel operators and other stakeholders for decision support, performance tracking, as well as for automated alerts. Originality/value This analysis makes original contributions to the existing literature by defining and demonstrating a methodology that can automatically label vehicle activity based on location data and identify certain characteristics of the activity by finding important location-based predictors that effectively classify the activity status.


Author(s):  
João P. de Almeida Martins ◽  
Markus Nilsson ◽  
Björn Lampinen ◽  
Marco Palombo ◽  
Peter T. While ◽  
...  

ABSTRACTSpecific features of white-matter microstructure can be investigated by using biophysical models to interpret relaxation-diffusion MRI brain data. Although more intricate models have the potential to reveal more details of the tissue, they also incur time-consuming parameter estimation that may con-verge to inaccurate solutions due to a prevalence of local minima in a degenerate fitting landscape. Machine-learning fitting algorithms have been proposed to accelerate the parameter estimation and increase the robustness of the attained estimates. So far, learning-based fitting approaches have been restricted to lower-dimensional microstructural models where dense sets of training data are easy to generate. Moreover, the degree to which machine learning can alleviate the degeneracy problem is poorly understood. For conventional least-squares solvers, it has been shown that degeneracy can be avoided by acquisition with optimized relaxation-diffusion-correlation protocols that include tensor-valued diffusion encoding; whether machine-learning techniques can offset these acquisition require-ments remains to be tested. In this work, we employ deep neural networks to vastly accelerate the fitting of a recently introduced high-dimensional relaxation-diffusion model of tissue microstructure. We also develop strategies for assessing the accuracy and sensitivity of function fitting networks and use those strategies to explore the impact of acquisition protocol design on the performance of the network. The developed learning-based fitting pipelines were tested on relaxation-diffusion data acquired with optimized and sub-sampled acquisition protocols. We found no evidence that machine-learning algorithms can by themselves replace a careful design of the acquisition protocol or correct for a degenerate fitting landscape.


2021 ◽  
Vol 2096 (1) ◽  
pp. 012099
Author(s):  
A P Chukhnov ◽  
Y S Ivanov

Abstract Machine learning algorithms can be vulnerable to many forms of attacks aimed at leading the machine learning systems to make deliberate errors. The article provides an overview of attack technologies on the models and training datasets for the purpose of destructive (poisoning) effect. Experiments have been carried out to implement the existing attacks on various models. A comparative analysis of cyber-resistance of various models, most frequently used in operating systems, to destructive information actions has been prepared. The stability of various models most often used in applied problems to destructive information influences is investigated. The stability of the models is shown in case of poisoning up to 50% of the training data.


2019 ◽  
Author(s):  
Andrew Medford ◽  
Shengchun Yang ◽  
Fuzhu Liu

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CH<sub>x</sub>, NH<sub>x</sub> and OH<sub>x</sub> species on the oxygen vacancy and pristine rutile TiO<sub>2</sub>(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N<sup>1.12</sup>) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.


2018 ◽  
Vol 6 (2) ◽  
pp. 283-286
Author(s):  
M. Samba Siva Rao ◽  
◽  
M.Yaswanth . ◽  
K. Raghavendra Swamy ◽  
◽  
...  

2020 ◽  
Vol 8 (Suppl 3) ◽  
pp. A62-A62
Author(s):  
Dattatreya Mellacheruvu ◽  
Rachel Pyke ◽  
Charles Abbott ◽  
Nick Phillips ◽  
Sejal Desai ◽  
...  

BackgroundAccurately identified neoantigens can be effective therapeutic agents in both adjuvant and neoadjuvant settings. A key challenge for neoantigen discovery has been the availability of accurate prediction models for MHC peptide presentation. We have shown previously that our proprietary model based on (i) large-scale, in-house mono-allelic data, (ii) custom features that model antigen processing, and (iii) advanced machine learning algorithms has strong performance. We have extended upon our work by systematically integrating large quantities of high-quality, publicly available data, implementing new modelling algorithms, and rigorously testing our models. These extensions lead to substantial improvements in performance and generalizability. Our algorithm, named Systematic HLA Epitope Ranking Pan Algorithm (SHERPA™), is integrated into the ImmunoID NeXT Platform®, our immuno-genomics and transcriptomics platform specifically designed to enable the development of immunotherapies.MethodsIn-house immunopeptidomic data was generated using stably transfected HLA-null K562 cells lines that express a single HLA allele of interest, followed by immunoprecipitation using W6/32 antibody and LC-MS/MS. Public immunopeptidomics data was downloaded from repositories such as MassIVE and processed uniformly using in-house pipelines to generate peptide lists filtered at 1% false discovery rate. Other metrics (features) were either extracted from source data or generated internally by re-processing samples utilizing the ImmunoID NeXT Platform.ResultsWe have generated large-scale and high-quality immunopeptidomics data by using approximately 60 mono-allelic cell lines that unambiguously assign peptides to their presenting alleles to create our primary models. Briefly, our primary ‘binding’ algorithm models MHC-peptide binding using peptide and binding pockets while our primary ‘presentation’ model uses additional features to model antigen processing and presentation. Both primary models have significantly higher precision across all recall values in multiple test data sets, including mono-allelic cell lines and multi-allelic tissue samples. To further improve the performance of our model, we expanded the diversity of our training set using high-quality, publicly available mono-allelic immunopeptidomics data. Furthermore, multi-allelic data was integrated by resolving peptide-to-allele mappings using our primary models. We then trained a new model using the expanded training data and a new composite machine learning architecture. The resulting secondary model further improves performance and generalizability across several tissue samples.ConclusionsImproving technologies for neoantigen discovery is critical for many therapeutic applications, including personalized neoantigen vaccines, and neoantigen-based biomarkers for immunotherapies. Our new and improved algorithm (SHERPA) has significantly higher performance compared to a state-of-the-art public algorithm and furthers this objective.


2020 ◽  
Vol 12 (7) ◽  
pp. 1218
Author(s):  
Laura Tuşa ◽  
Mahdi Khodadadzadeh ◽  
Cecilia Contreras ◽  
Kasra Rafiezadeh Shahi ◽  
Margret Fuchs ◽  
...  

Due to the extensive drilling performed every year in exploration campaigns for the discovery and evaluation of ore deposits, drill-core mapping is becoming an essential step. While valuable mineralogical information is extracted during core logging by on-site geologists, the process is time consuming and dependent on the observer and individual background. Hyperspectral short-wave infrared (SWIR) data is used in the mining industry as a tool to complement traditional logging techniques and to provide a rapid and non-invasive analytical method for mineralogical characterization. Additionally, Scanning Electron Microscopy-based image analyses using a Mineral Liberation Analyser (SEM-MLA) provide exhaustive high-resolution mineralogical maps, but can only be performed on small areas of the drill-cores. We propose to use machine learning algorithms to combine the two data types and upscale the quantitative SEM-MLA mineralogical data to drill-core scale. This way, quasi-quantitative maps over entire drill-core samples are obtained. Our upscaling approach increases result transparency and reproducibility by employing physical-based data acquisition (hyperspectral imaging) combined with mathematical models (machine learning). The procedure is tested on 5 drill-core samples with varying training data using random forests, support vector machines and neural network regression models. The obtained mineral abundance maps are further used for the extraction of mineralogical parameters such as mineral association.


2021 ◽  
Author(s):  
Octavian Dumitru ◽  
Gottfried Schwarz ◽  
Mihai Datcu ◽  
Dongyang Ao ◽  
Zhongling Huang ◽  
...  

&lt;p&gt;During the last years, much progress has been reached with machine learning algorithms. Among the typical application fields of machine learning are many technical and commercial applications as well as Earth science analyses, where most often indirect and distorted detector data have to be converted to well-calibrated scientific data that are a prerequisite for a correct understanding of the desired physical quantities and their relationships.&lt;/p&gt;&lt;p&gt;However, the provision of sufficient calibrated data is not enough for the testing, training, and routine processing of most machine learning applications. In principle, one also needs a clear strategy for the selection of necessary and useful training data and an easily understandable quality control of the finally desired parameters.&lt;/p&gt;&lt;p&gt;At a first glance, one could guess that this problem could be solved by a careful selection of representative test data covering many typical cases as well as some counterexamples. Then these test data can be used for the training of the internal parameters of a machine learning application. At a second glance, however, many researchers found out that a simple stacking up of plain examples is not the best choice for many scientific applications.&lt;/p&gt;&lt;p&gt;To get improved machine learning results, we concentrated on the analysis of satellite images depicting the Earth&amp;#8217;s surface under various conditions such as the selected instrument type, spectral bands, and spatial resolution. In our case, such data are routinely provided by the freely accessible European Sentinel satellite products (e.g., Sentinel-1, and Sentinel-2). Our basic work then included investigations of how some additional processing steps &amp;#8211; to be linked with the selected training data &amp;#8211; can provide better machine learning results.&lt;/p&gt;&lt;p&gt;To this end, we analysed and compared three different approaches to find out machine learning strategies for the joint selection and processing of training data for our Earth observation images:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;One can optimize the training data selection by adapting the data selection to the specific instrument, target, and application characteristics [1].&lt;/li&gt; &lt;li&gt;As an alternative, one can dynamically generate new training parameters by Generative Adversarial Networks. This is comparable to the role of a sparring partner in boxing [2].&lt;/li&gt; &lt;li&gt;One can also use a hybrid semi-supervised approach for Synthetic Aperture Radar images with limited labelled data. The method is split in: polarimetric scattering classification, topic modelling for scattering labels, unsupervised constraint learning, and supervised label prediction with constraints [3].&lt;/li&gt; &lt;/ul&gt;&lt;p&gt;We applied these strategies in the ExtremeEarth sea-ice monitoring project (http://earthanalytics.eu/). As a result, we can demonstrate for which application cases these three strategies will provide a promising alternative to a simple conventional selection of available training data.&lt;/p&gt;&lt;p&gt;[1] C.O. Dumitru et. al, &amp;#8220;Understanding Satellite Images: A Data Mining Module for Sentinel Images&amp;#8221;, Big Earth Data, 2020, 4(4), pp. 367-408.&lt;/p&gt;&lt;p&gt;[2] D. Ao et. al., &amp;#8220;Dialectical GAN for SAR Image Translation: From Sentinel-1 to TerraSAR-X&amp;#8221;, Remote Sensing, 2018, 10(10), pp. 1-23.&lt;/p&gt;&lt;p&gt;[3] Z. Huang, et. al., &quot;HDEC-TFA: An Unsupervised Learning Approach for Discovering Physical Scattering Properties of Single-Polarized SAR Images&quot;, IEEE Transactions on Geoscience and Remote Sensing, 2020, pp.1-18.&lt;/p&gt;


Author(s):  
Yanxiang Yu ◽  
◽  
Chicheng Xu ◽  
Siddharth Misra ◽  
Weichang Li ◽  
...  

Compressional and shear sonic traveltime logs (DTC and DTS, respectively) are crucial for subsurface characterization and seismic-well tie. However, these two logs are often missing or incomplete in many oil and gas wells. Therefore, many petrophysical and geophysical workflows include sonic log synthetization or pseudo-log generation based on multivariate regression or rock physics relations. Started on March 1, 2020, and concluded on May 7, 2020, the SPWLA PDDA SIG hosted a contest aiming to predict the DTC and DTS logs from seven “easy-to-acquire” conventional logs using machine-learning methods (GitHub, 2020). In the contest, a total number of 20,525 data points with half-foot resolution from three wells was collected to train regression models using machine-learning techniques. Each data point had seven features, consisting of the conventional “easy-to-acquire” logs: caliper, neutron porosity, gamma ray (GR), deep resistivity, medium resistivity, photoelectric factor, and bulk density, respectively, as well as two sonic logs (DTC and DTS) as the target. The separate data set of 11,089 samples from a fourth well was then used as the blind test data set. The prediction performance of the model was evaluated using root mean square error (RMSE) as the metric, shown in the equation below: RMSE=sqrt(1/2*1/m* [∑_(i=1)^m▒〖(〖DTC〗_pred^i-〖DTC〗_true^i)〗^2 + 〖(〖DTS〗_pred^i-〖DTS〗_true^i)〗^2 ] In the benchmark model, (Yu et al., 2020), we used a Random Forest regressor and conducted minimal preprocessing to the training data set; an RMSE score of 17.93 was achieved on the test data set. The top five models from the contest, on average, beat the performance of our benchmark model by 27% in the RMSE score. In the paper, we will review these five solutions, including preprocess techniques and different machine-learning models, including neural network, long short-term memory (LSTM), and ensemble trees. We found that data cleaning and clustering were critical for improving the performance in all models.


Sign in / Sign up

Export Citation Format

Share Document