AN IMPRECISE BOOSTING-LIKE APPROACH TO CLASSIFICATION

Author(s):  
LEV V. UTKIN

A new approach for ensemble construction based on restricting a set of weights of examples in training data to avoid overfitting is proposed in the paper. The algorithm called EPIBoost (Extreme Points Imprecise Boost) applies imprecise statistical models to restrict the set of weights. The updating of the weights within the restricted set is carried out by using its extreme points. The approach allows us to construct various algorithms by applying different imprecise statistical models for producing the restricted set. It is shown by various numerical experiments with real data sets that the EPIBoost algorithm may outperform the standard AdaBoost for some parameters of imprecise statistical models.

Geophysics ◽  
2020 ◽  
Vol 85 (2) ◽  
pp. V223-V232 ◽  
Author(s):  
Zhicheng Geng ◽  
Xinming Wu ◽  
Sergey Fomel ◽  
Yangkang Chen

The seislet transform uses the wavelet-lifting scheme and local slopes to analyze the seismic data. In its definition, the designing of prediction operators specifically for seismic images and data is an important issue. We have developed a new formulation of the seislet transform based on the relative time (RT) attribute. This method uses the RT volume to construct multiscale prediction operators. With the new prediction operators, the seislet transform gets accelerated because distant traces get predicted directly. We apply our method to synthetic and real data to demonstrate that the new approach reduces computational cost and obtains excellent sparse representation on test data sets.


2015 ◽  
Vol 14 (03) ◽  
pp. 521-533
Author(s):  
M. Sariyar ◽  
A. Borg

Deterministic record linkage (RL) is frequently regarded as a rival to more sophisticated strategies like probabilistic RL. We investigate the effect of combining deterministic linkage with other linkage techniques. For this task, we use a simple deterministic linkage strategy as a preceding filter: a data pair is classified as ‘match' if all values of attributes considered agree exactly, otherwise as ‘nonmatch'. This strategy is separately combined with two probabilistic RL methods based on the Fellegi–Sunter model and with two classification tree methods (CART and Bagging). An empirical comparison was conducted on two real data sets. We used four different partitions into training data and test data to increase the validity of the results. In almost all cases, application of deterministic linkage as a preceding filter leads to better results compared to the omission of such a pre-filter, and overall classification trees exhibited best results. On all data sets, probabilistic RL only profited from deterministic linkage when the underlying probabilities were estimated before applying deterministic linkage. When using a pre-filter for subtracting definite cases, the underlying population of data pairs changes. It is crucial to take this into account for model-based probabilistic RL.


Geophysics ◽  
2016 ◽  
Vol 81 (6) ◽  
pp. D625-D641 ◽  
Author(s):  
Dario Grana

The estimation of rock and fluid properties from seismic attributes is an inverse problem. Rock-physics modeling provides physical relations to link elastic and petrophysical variables. Most of these models are nonlinear; therefore, the inversion generally requires complex iterative optimization algorithms to estimate the reservoir model of petrophysical properties. We have developed a new approach based on the linearization of the rock-physics forward model using first-order Taylor series approximations. The mathematical method adopted for the inversion is the Bayesian approach previously applied successfully to amplitude variation with offset linearized inversion. We developed the analytical formulation of the linearized rock-physics relations for three different models: empirical, granular media, and inclusion models, and we derived the formulation of the Bayesian rock-physics inversion under Gaussian assumptions for the prior distribution of the model. The application of the inversion to real data sets delivers accurate results. The main advantage of this method is the small computational cost due to the analytical solution given by the linearization and the Bayesian Gaussian approach.


2011 ◽  
Vol 250-253 ◽  
pp. 1757-1760 ◽  
Author(s):  
Machine Hsie ◽  
Chih Tsang Lin ◽  
Yueh Feng Ho

The study proposes the Genetic Operation Tree (GOT), which integrate Genetic Algorithm (GA) and Operation Tree (OT), to build the model for Asphalt Pavement Overlay Cracking. The training data sets of pavement cracks were collected from a 15-year experiment conducted by the Texas Departments of Transportation. Even without a presumed formula structure, the GOT still can self-organize formula and produce a very concise model for predicting the length of pavement cracking.


2020 ◽  
Vol 36 (4) ◽  
pp. 803-825
Author(s):  
Marco Fortini

AbstractRecord linkage addresses the problem of identifying pairs of records coming from different sources and referred to the same unit of interest. Fellegi and Sunter propose an optimal statistical test in order to assign the match status to the candidate pairs, in which the needed parameters are obtained through EM algorithm directly applied to the set of candidate pairs, without recourse to training data. However, this procedure has a quadratic complexity as the two lists to be matched grow. In addition, a large bias of EM-estimated parameters is also produced in this case, so that the problem is tackled by reducing the set of candidate pairs through filtering methods such as blocking. Unfortunately, the probability that excluded pairs would be actually true-matches cannot be assessed through such methods.The present work proposes an efficient approach in which the comparison of records between lists are minimised while the EM estimates are modified by modelling tables with structural zeros in order to obtain unbiased estimates of the parameters. Improvement achieved by the suggested method is shown by means of simulations and an application based on real data.


2021 ◽  
Vol 3 (1) ◽  
pp. 1-7
Author(s):  
Yadgar Sirwan Abdulrahman

Clustering is one of the essential strategies in data analysis. In classical solutions, all features are assumed to contribute equally to the data clustering. Of course, some features are more important than others in real data sets. As a result, essential features will have a more significant impact on identifying optimal clusters than other features. In this article, a fuzzy clustering algorithm with local automatic weighting is presented. The proposed algorithm has many advantages such as: 1) the weights perform features locally, meaning that each cluster's weight is different from the rest. 2) calculating the distance between the samples using a non-euclidian similarity criterion to reduce the noise effect. 3) the weight of the features is obtained comparatively during the learning process. In this study, mathematical analyzes were done to obtain the clustering centers well-being and the features' weights. Experiments were done on the data set range to represent the progressive algorithm's efficiency compared to other proposed algorithms with global and local features


2011 ◽  
Vol 1 (1) ◽  
pp. 45-52 ◽  
Author(s):  
Hamada M. Zahera ◽  
Gamal F. El-Hady ◽  
W. F. Abd El-Wahed

As web contents grow, the importance of search engines become more critical and at the same time user satisfaction decreases. Query recommendation is a new approach to improve search results in web. In this paper a method is proposed that, given a query submitted to a search engine, suggests a list of queries that are related to the user input query. The related queries are based on previously issued queries, and can be issued by the user to the search engine to tune or redirect the search process. The proposed method is based on clustering processes in which groups of semantically similar queries are detected. The clustering process uses the content of historical preferences of users registered in the query log of the search engine. This facility provides queries that are related to the ones submitted by users in order to direct them toward their required information. This method not only discovers the related queries but also ranks them according to a similarity measure. The method has been evaluated using real data sets from the search engine query log.


2015 ◽  
Vol 26 (4) ◽  
pp. 1867-1880
Author(s):  
Ilmari Ahonen ◽  
Denis Larocque ◽  
Jaakko Nevalainen

Outlier detection covers the wide range of methods aiming at identifying observations that are considered unusual. Novelty detection, on the other hand, seeks observations among newly generated test data that are exceptional compared with previously observed training data. In many applications, the general existence of novelty is of more interest than identifying the individual novel observations. For instance, in high-throughput cancer treatment screening experiments, it is meaningful to test whether any new treatment effects are seen compared with existing compounds. Here, we present hypothesis tests for such global level novelty. The problem is approached through a set of very general assumptions, making it innovative in relation to the current literature. We introduce test statistics capable of detecting novelty. They operate on local neighborhoods and their null distribution is obtained by the permutation principle. We show that they are valid and able to find different types of novelty, e.g. location and scale alternatives. The performance of the methods is assessed with simulations and with applications to real data sets.


2018 ◽  
Vol 26 (1) ◽  
pp. 43-68
Author(s):  
Irina Băncescu

Abstract We propose a new method of constructing statistical models which can be interpreted as the lifetime distributions of series-parallel/parallel- series systems used in characterizing coherent systems. An open problem regarding coherent systems is comparing the expected system lifetimes. Using these models, we discuss and establish conditions for ordering of expected system lifetimes of complex series-parallel/parallel-series systems. Also, we consider parameter estimation and the analysis of two real data sets. We give formulae for the reliability, hazard rate and mean hazard rate functions.


2021 ◽  
Vol 25 (3) ◽  
pp. 687-710
Author(s):  
Mostafa Boskabadi ◽  
Mahdi Doostparast

Regression trees are powerful tools in data mining for analyzing data sets. Observations are usually divided into homogeneous groups, and then statistical models for responses are derived in the terminal nodes. This paper proposes a new approach for regression trees that considers the dependency structures among covariates for splitting the observations. The mathematical properties of the proposed method are discussed in detail. To assess the accuracy of the proposed model, various criteria are defined. The performance of the new approach is assessed by conducting a Monte-Carlo simulation study. Two real data sets on classification and regression problems are analyzed by using the obtained results.


Sign in / Sign up

Export Citation Format

Share Document