BIAS-VARIANCE CONTROL VIA HARD POINTS SHAVING

In this paper, we propose a regularization technique for AdaBoost. The method implements a bias-variance control strategy in order to avoid overfitting in classification tasks on noisy data. The method is based on a notion of easy and hard training patterns as emerging from analysis of the dynamical evolutions of AdaBoost weights. The procedure consists in sorting the training data points by a hardness measure, and in progressively eliminating the hardest, stopping at an automatically selected threshold. Effectiveness of the method is tested and discussed on synthetic as well as real data.

Download Full-text

Minimum Description Length, Regularization, and Multimodal Data

Neural Computation ◽

10.1162/neco.1996.8.3.595 ◽

1996 ◽

Vol 8 (3) ◽

pp. 595-609 ◽

Cited By ~ 2

Author(s):

Richard Rohwer ◽

John C. van der Rest

Keyword(s):

Inverse Kinematics ◽

Minimum Description Length ◽

Real Data ◽

Classification Problem ◽

Training Data ◽

Data Sets ◽

Regression Problem ◽

Regularization Technique ◽

Multimodal Data ◽

Regularization Parameters

Relationships between clustering, description length, and regularization are pointed out, motivating the introduction of a cost function with a description length interpretation and the unusual and useful property of having its minimum approximated by the densest mode of a distribution. A simple inverse kinematics example is used to demonstrate that this property can be used to select and learn one branch of a multivalued mapping. This property is also used to develop a method for setting regularization parameters according to the scale on which structure is exhibited in the training data. The regularization technique is demonstrated on two real data sets, a classification problem and a regression problem.

Download Full-text

Identifying Physico-Chemical Laws from the Robotically Collected Data

10.26434/chemrxiv.8490149 ◽

2019 ◽

Author(s):

Liwei Cao ◽

Danilo Russo ◽

Vassilios S. Vassiliadis ◽

Alexei Lapkin

Keyword(s):

Experimental Data ◽

Numerical Models ◽

Predictor Variable ◽

Physical Models ◽

Training Data ◽

Mixed Integer ◽

Physico Chemical ◽

Data Points ◽

Future Work ◽

The Relationship

A mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed to identify physical models from noisy experimental data. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the number of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was coupled with the collection of experimental data in an automated fashion, and was proven to be successful in identifying the correct physical models describing the relationship between the shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of reactions. Future work will focus on addressing the limitations of the formulation presented in this work, by extending it to be able to address larger complex physical models.

Download Full-text

Robust estimation for multivariate wrapped models

METRON ◽

10.1007/s40300-021-00214-9 ◽

2021 ◽

Author(s):

Giovanni Saraceno ◽

Claudio Agostinelli ◽

Luca Greco

Keyword(s):

Robust Estimation ◽

Numerical Study ◽

Real Data ◽

Likelihood Method ◽

Weighted Likelihood ◽

Finite Sample ◽

Pearson Residuals ◽

Data Points ◽

Wrapped Distributions ◽

Standard Techniques

AbstractA weighted likelihood technique for robust estimation of multivariate Wrapped distributions of data points scattered on a $$p-$$ p - dimensional torus is proposed. The occurrence of outliers in the sample at hand can badly compromise inference for standard techniques such as maximum likelihood method. Therefore, there is the need to handle such model inadequacies in the fitting process by a robust technique and an effective downweighting of observations not following the assumed model. Furthermore, the employ of a robust method could help in situations of hidden and unexpected substructures in the data. Here, it is suggested to build a set of data-dependent weights based on the Pearson residuals and solve the corresponding weighted likelihood estimating equations. In particular, robust estimation is carried out by using a Classification EM algorithm whose M-step is enhanced by the computation of weights based on current parameters’ values. The finite sample behavior of the proposed method has been investigated by a Monte Carlo numerical study and real data examples.

Download Full-text

ChemProps: A RESTful API enabled database for composite polymer name standardization

Journal of Cheminformatics ◽

10.1186/s13321-021-00502-6 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Bingyin Hu ◽

Anqi Lin ◽

L. Catherine Brinson

Keyword(s):

Data Exchange ◽

Ground Truth ◽

Training Data ◽

Test Accuracy ◽

Polymer Science ◽

Data Systems ◽

Materials Informatics ◽

Related Data ◽

Weight Factors ◽

Data Points

AbstractThe inconsistency of polymer indexing caused by the lack of uniformity in expression of polymer names is a major challenge for widespread use of polymer related data resources and limits broad application of materials informatics for innovation in broad classes of polymer science and polymeric based materials. The current solution of using a variety of different chemical identifiers has proven insufficient to address the challenge and is not intuitive for researchers. This work proposes a multi-algorithm-based mapping methodology entitled ChemProps that is optimized to solve the polymer indexing issue with easy-to-update design both in depth and in width. RESTful API is enabled for lightweight data exchange and easy integration across data systems. A weight factor is assigned to each algorithm to generate scores for candidate chemical names and optimized to maximize the minimum value of the score difference between the ground truth chemical name and the other candidate chemical names. Ten-fold validation is utilized on the 160 training data points to prevent overfitting issues. The obtained set of weight factors achieves a 100% test accuracy on the 54 test data points. The weight factors will evolve as ChemProps grows. With ChemProps, other polymer databases can remove duplicate entries and enable a more accurate “search by SMILES” function by using ChemProps as a common name-to-SMILES translator through API calls. ChemProps is also an excellent tool for auto-populating polymer properties thanks to its easy-to-update design.

Download Full-text

Benchmarking and Field-Testing of the Distributed Quasi-Newton Derivative-Free Optimization Method for Field Development Optimization

10.2118/206267-ms ◽

2021 ◽

Author(s):

Faruk Alpak ◽

Yixuan Wang ◽

Guohua Gao ◽

Vivek Jain

Keyword(s):

Optimization Problems ◽

Field Testing ◽

Optimization Method ◽

Training Data ◽

Local Optima ◽

Field Development ◽

Derivative Free Optimization ◽

Derivative Free ◽

Data Points ◽

Quasi Newton

Abstract Recently, a novel distributed quasi-Newton (DQN) derivative-free optimization (DFO) method was developed for generic reservoir performance optimization problems including well-location optimization (WLO) and well-control optimization (WCO). DQN is designed to effectively locate multiple local optima of highly nonlinear optimization problems. However, its performance has neither been validated by realistic applications nor compared to other DFO methods. We have integrated DQN into a versatile field-development optimization platform designed specifically for iterative workflows enabled through distributed-parallel flow simulations. DQN is benchmarked against alternative DFO techniques, namely, the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method hybridized with Direct Pattern Search (BFGS-DPS), Mesh Adaptive Direct Search (MADS), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). DQN is a multi-thread optimization method that distributes an ensemble of optimization tasks among multiple high-performance-computing nodes. Thus, it can locate multiple optima of the objective function in parallel within a single run. Simulation results computed from one DQN optimization thread are shared with others by updating a unified set of training data points composed of responses (implicit variables) of all successful simulation jobs. The sensitivity matrix at the current best solution of each optimization thread is approximated by a linear-interpolation technique using all or a subset of training-data points. The gradient of the objective function is analytically computed using the estimated sensitivities of implicit variables with respect to explicit variables. The Hessian matrix is then updated using the quasi-Newton method. A new search point for each thread is solved from a trust-region subproblem for the next iteration. In contrast, other DFO methods rely on a single-thread optimization paradigm that can only locate a single optimum. To locate multiple optima, one must repeat the same optimization process multiple times starting from different initial guesses for such methods. Moreover, simulation results generated from a single-thread optimization task cannot be shared with other tasks. Benchmarking results are presented for synthetic yet challenging WLO and WCO problems. Finally, DQN method is field-tested on two realistic applications. DQN identifies the global optimum with the least number of simulations and the shortest run time on a synthetic problem with known solution. On other benchmarking problems without a known solution, DQN identified compatible local optima with reasonably smaller numbers of simulations compared to alternative techniques. Field-testing results reinforce the auspicious computational attributes of DQN. Overall, the results indicate that DQN is a novel and effective parallel algorithm for field-scale development optimization problems.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

CELLULAR NETWORK TRAFFIC PREDICTION USING EXPONENTIAL SMOOTHING METHODS

Journal of Information and Communication Technology ◽

10.32890/jict2019.18.1.8277 ◽

2018 ◽

Author(s):

Quang Thanh Tran ◽

Li Jun Hao ◽

Quang Khai Trinh

Keyword(s):

Network Traffic ◽

Cellular Network ◽

Low Cost ◽

Real Data ◽

Exponential Smoothing ◽

High Accuracy ◽

Training Data ◽

Traffic Prediction ◽

Data Traffic ◽

Smoothing Methods

Wireless traffic prediction plays an important role in network planning and management, especially for real-time decision making and short-term prediction. Systems require high accuracy, low cost, and low computational complexity prediction methods. Although exponential smoothing is an effective method, there is a lack of use with cellular networks and research on data traffic. The accuracy and suitability of this method need to be evaluated using several types of traffic. Thus, this study introduces the application of exponential smoothing as a method of adaptive forecasting of cellular network traffic for cases of voice (in Erlang) and data (in megabytes or gigabytes). Simple and Error, Trend, Seasonal (ETS) methods are used for exponential smoothing. By investigating the effect of their smoothing factors in describing cellular network traffic, the accuracy of forecast using each method is evaluated. This research comprises a comprehensive analysis approach using multiple case study comparisons to determine the best fit model. Different exponential smoothing models are evaluated for various traffic types in different time scales. The experiments are implemented on real data from a commercial cellular network, which is divided into a training data part for modeling and test data part for forecasting comparison. This study found that ETS framework is not suitable for hourly voice traffic, but it provides nearly the same results with Holt–Winter’s multiplicative seasonal (HWMS) in both cases of daily voice and data traffic. HWMS is presumably encompassed by ETC framework and shows good results in all cases of traffic. Therefore, HWMS is recommended for cellular network traffic prediction due to its simplicity and high accuracy.

Download Full-text

Combining Capability Indices and Control Charts in the Process and Analytical Method Control Strategy

10.5772/intechopen.91354 ◽

2020 ◽

Author(s):

Alexis Oliva ◽

Matías Llabrés

Keyword(s):

Control Strategy ◽

Analytical Method ◽

Control Charts ◽

Process Capability ◽

Real Data ◽

Process Capability Indices ◽

Specification Limits ◽

Capability Indices ◽

And Control ◽

Control Limits

Different control charts in combination with the process capability indices, Cp, Cpm and Cpk, as part of the control strategy, were evaluated, since both are key elements in determining whether the method or process is reliable for its purpose. All these aspects were analyzed using real data from unitary processes and analytical methods. The traditional x-chart and moving range chart confirmed both analytical method and process are in control and stable and therefore, the process capability indices can be computed. We applied different criteria to establish the specification limits (i.e., analyst/customer requirements) for fixed method or process performance (i.e., process or method requirements). The unitary process does not satisfy the minimum capability requirements for Cp and Cpk indices when the specification limit and control limits are equal in breath. Therefore, the process needs to be revised; especially, a greater control in the process variation is necessary. For the analytical method, the Cpm and Cpk indices were computed. The obtained results were similar in both cases. For example, if the specification limits are set at ±3% of the target value, the method is considered “satisfactory” (1.22<Cpm<1.50) and no further stringent precision control is required.

Download Full-text