Tabu Search for Variable Selection in Classification

Author(s):  
Silvia Casado Yusta ◽  
Joaquín Pacheco Bonrostro

Variable selection plays an important role in classification. Before beginning the design of a classification method, when many variables are involved, only those variables that are really required should be selected. There can be many reasons for selecting only a subset of the variables instead of the whole set of candidate variables (Reunanen, 2003): (1) It is cheaper to measure only a reduced set of variables, (2) Prediction accuracy may be improved through the exclusion of redundant and irrelevant variables, (3) The predictor to be built is usually simpler and potentially faster when fewer input variables are used and (4) Knowing which variables are relevant can give insight into the nature of the prediction problem and allows a better understanding of the final classification model. The importance of variables selection before using classification methods is also pointed out in recent works such as Cai et al.(2007) and Rao and Lakshminarayanan (2007). The aim in the classification problem is to classify instances that are characterized by attributes or variables. Based on a set of examples (whose class is known) a set of rules is designed and generalised to classify the set of instances with the greatest precision possible. There are several methodologies for dealing with this problem: Classic Discriminant Analysis, Logistic Regression, Neural Networks, Decision Trees, Instance- Based Learning, etc. Linear Discriminant Analysis and Logistic Regression methods search for linear functions and then use them for classification purposes. They continue to be interesting methodologies. In this work an “ad hoc” new method for variable selection in classification, specifically in discriminant analysis and logistic regression, is analysed. This new method is based on the metaheuristic strategy tabu search and yields better results than the classic methods (stepwise, backward and forward) used by statistical packages such as SPSS or BMDP, as it’s shown below. This method is performed for 2 classes.

1978 ◽  
Vol 15 (1) ◽  
pp. 103-112 ◽  
Author(s):  
William R. Dillon ◽  
Matthew Goldstein ◽  
Leon G. Schiffman

Buyer usage behavior data are used to compare the relative performance of a linear discriminant analysis and several multinomial classification methods. The potential shortcomings of each of the procedures investigated are cited, and a new method for determining the contribution of a variable to discrimination in the context of the multinomial classification problem also is presented.


2014 ◽  
Vol 6 (22) ◽  
pp. 9037-9044 ◽  
Author(s):  
Meilan Ouyang ◽  
Zhimin Zhang ◽  
Chen Chen ◽  
Xinbo Liu ◽  
Yizeng Liang

A new method performs classification and variable selection simultaneously to analyze complicated metabolomics datasets.


2020 ◽  
Vol 30 (1) ◽  
Author(s):  
Michael O. Olusola ◽  
Sydney I. Onyeagu

This paper is centred on a binary classification problem in which it is desired to assign a new object with multivariate features to one of two distinct populations as based on historical sets of samples from two populations. A linear discriminant analysis framework has been proposed, called the minimised sum of deviations by proportion (MSDP) to model the binary classification problem. In the MSDP formulation, the sum of the proportion of exterior deviations is minimised subject to the group separation constraints, the normalisation constraint, the upper bound constraints on proportions of exterior deviations and the sign unrestriction vis-à-vis the non-negativity constraints. The two-phase method in linear programming is adopted as a solution technique to generate the discriminant function. The decision rule on group-membership prediction is constructed using the apparent error rate. The performance of the MSDP has been compared with some existing linear discriminant models using a previously published dataset on road casualties. The MSDP model was more promising and well suited for the imbalanced dataset on road casualties.


2019 ◽  
Vol 2019 ◽  
pp. 1-13 ◽  
Author(s):  
Yu Cheng ◽  
Xu Chen ◽  
Xiaohua Ding ◽  
Linting Zeng

Car-sharing is becoming an increasingly popular travel mode in China and many companies invest plenty of money on that including vehicle enterprises and Internet companies. But most of them build car-sharing stations by their experience or randomly as long as there is parking space in the early development of their business. This results in many stations with low operational efficiency and causes capital loss. This study aims to use different data source with statistical models and machine learning algorithm to help car-sharing operator to choose the optimal location of new stations and adjust the location of existing stations. We select Chengdu where there are huge amounts of car-sharing travel demand and several large car-sharing operators as the research area and two main operators as the research objects. Chengdu is divided into 58724 squared grids each of which is 0.5km⁎0.5km instead of focusing on the buffers generated by stations. We try to find a model to estimate a potential travel demand value for each small grid with three data sources: order data, population data, and Point of Interest (POI) data. This problem is transformed into a binary form and five different methods, Logistic Regression, Logistic Regression with LASSO, Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis, are implemented. The optimal model, Logistic Regression with LASSO, is chosen to estimate the probability of existence of demand in all grids. With car-sharing order data from different operators, an existing order heat value is also computed for each grid. Then we analyze and classify all the grids into four groups. For different groups of grids, we give different suggestions on the optimal location of stations. This study focuses on a more competitive market and finds the influential factors on order number. Suggestions on the optimal location of stations are given in consideration of competitors. We hope that our research can help operators improve their business and make rational plans.


2018 ◽  
Vol 33 (3) ◽  
pp. 799-811 ◽  
Author(s):  
John A. Knaff ◽  
Charles R. Sampson ◽  
Kate D. Musgrave

Abstract This work describes tropical cyclone rapid intensification forecast aids designed for the western North Pacific tropical cyclone basin and for use at the Joint Typhoon Warning Center. Two statistical methods, linear discriminant analysis and logistic regression, are used to create probabilistic forecasts for seven intensification thresholds including 25-, 30-, 35-, and 40-kt changes in 24 h, 45- and 55-kt in 36 h, and 70-kt in 48 h (1 kt = 0.514 m s−1). These forecast probabilities are further used to create an equally weighted probability consensus that is then used to trigger deterministic forecasts equal to the intensification thresholds once the probability in the consensus reaches 40%. These deterministic forecasts are incorporated into an operational intensity consensus forecast as additional members, resulting in an improved intensity consensus for these important and difficult to predict cases. Development of these methods is based on the 2000–15 typhoon seasons, and independent performance is assessed using the 2016 and 2017 typhoon seasons. In many cases, the probabilities have skill relative to climatology and adding the rapid intensification deterministic aids to the operational intensity consensus significantly reduces the negative forecast biases.


2014 ◽  
Vol 3 (3) ◽  
pp. 186-193
Author(s):  
Mohamad Iman Jamnejad ◽  
Hamid Parvin ◽  
Hamid Alinejad-Rokny ◽  
Ali Heidarzadegan

2017 ◽  
Vol 6 (3) ◽  
pp. 57-60
Author(s):  
Денис Кривогуз ◽  
Denis Krivoguz

Modern approaches to the region’s landslide susceptibility assessment are considered in this paper. Have been presented descriptions of the most used techniques for landslide susceptibility assessment: logistic regression, indicator validity, linear discriminant analysis and application of artificial neural networks. These techniques’ advantages and disadvantages are discussed in the paper. The most suitable techniques for various conditions of analysis have been marked. It has been concluded that the most acceptable techniques of analysis for a large number of input data related to the studied region are the method of logistic regression and indicator validity method. With these methods the most accurate results are achieved. When there is a lack of information, it is more expedient to use linear discriminant analysis and artificial neural networks that will minimize potential analysis inaccuracies.


Sign in / Sign up

Export Citation Format

Share Document