Variable Selection in Multiple Linear Regression Using a Genetic Algorithm

Author(s):  
Javier Trejos ◽  
Mario A. Villalobos-Arias ◽  
Jose Luis Espinoza

In this article it is studied the application of a genetic algorithm in the problem of variable selection for multiple linear regression, minimizing the least squares criterion. The algorithm is based on a chromosomic representation of variables that are considered in the least squares model. A binary chromosome indicates the presence (1) or absence (0) of a variable in the model. The fitness function is based on the adjusted square R, proportional to the fitness for chromosome selection in a roulette wheel model selection. Usual genetic operators, such as crossover and mutation are implemented. Comparisons are performed with benchmark data sets, obtaining satisfying and promising results.

Author(s):  
Leila Emami ◽  
Razieh Sabet ◽  
Amirhossein Sakhteman ◽  
Mehdi Khoshnevis Zade

Type 2 diabetes (T2DM) is a metabolic disorder disease and DPP-4 inhibitors are a class of oral hypoglycemic that blocks the dipeptidyl peptidase-4 (DPP-4) enzyme.  DPP-4 inhibitors reduce glucagon and blood glucose levels and don’t have side effects such as hypoglycemia or weight gain. In this paper, a series of imidazolopyrimidine amides analogues as DPP4 inhibitors were selected for quantitative structure-activity relationship (QSAR) analysis and docking studies. A collection of chemometric methods such as multiple linear regression (MLR), factor analysis-based multiple linear regression (FA-MLR), principal component regression (PCR), genetic algorithm for variable selection-MLR (GA-MLR) and partial least squared combined with genetic algorithm for variable selection (GA-PLS), were conducted to make relations between structural features and DPP4 inhibitory of a variety of imidazolopyrimidine amides derivatives. GA-PLS represented superior results with high statistical quality (R2 = 0.94 and Q2 = 0.80) for predicting the activity of the compounds. Docking studies of these compounds reveals and confirms that compounds 15, 18, 25, 26, and 28 are introduced as good candidates for DPP-4 inhibitors were introduced as a good candidate for DPP-4 inhibitory compounds.


Author(s):  
Paola Gramatica

At the end of her academic career, the author summarizes the main aspects of QSAR modeling, giving comments and suggestions according to her 23 years' experience in QSAR research on environmental topics. The focus is mainly on Multiple Linear Regression, particularly Ordinary Least Squares, using a Genetic Algorithm for variable selection from various theoretical molecular descriptors, but the comments can be useful also for other QSAR methods. The need for rigorous validation, also external, and for applicability domain check to guarantee predictivity and reliability of QSAR models is particularly highlighted. The commented approach is the “predictive” one, based on chemometrics, and is usefully applied to the prioritization of environmental pollutants. All the discussed points and the author's ideas are implemented in the software QSARINS, as a legacy to the QSAR community.


1998 ◽  
Vol 6 (1) ◽  
pp. 333-339 ◽  
Author(s):  
Renato Guchardi ◽  
Paulo Augusto da Costa Filho ◽  
Ronei J. Poppi ◽  
Celio Pasquini

This paper describes a near infrared spectroscopic method developed for determination of ethanol and methyl tert-butyl ether (MTBE) as additives in gasoline. The methodology employs data collected from a near infrared spectrophotometer whose monochromator is an Acousto-Optic Tunable Filter (AOTF) operating in the 1500–2400 nm range. Genetic Algorithm variable selection was used in the multiple linear regression (MLR) modelling. Seven wavelengths were selected by the algorithm and the results obtained by MLR revealed that the method produces improved results, when compared with the PLS regression method, as confirmed by the lower RMSEP obtained for ethanol and MTBE determination. Besides the improvement achieved in the analytical results, the variable selection allows a reduction in the time necessary for data acquisition. This fact has special importance when AOTFs are being used as the monochromator element. The AOTF's capability of random access to the selected wavelengths can be employed to access the necessary information very rapidly, enabling the methodology to be used for in-line monitoring of fuel additives.


2015 ◽  
Vol 76 (13) ◽  
Author(s):  
Khoo Li Peng ◽  
Robiah Adnan ◽  
Maizah Hura Ahmad

In this study, Leverage Based Near Neighbour–Robust Weighted Least Squares (LBNN-RWLS) method is proposed in order to estimate the standard error accurately in the presence of heteroscedastic errors and outliers in multiple linear regression. The data sets used in this study are simulated through monte carlo simulation. The data sets contain heteroscedastic errors and different percentages of outliers with different sample sizes.  The study discovered that LBNN-RWLS is able to produce smaller standard errors compared to Ordinary Least Squares (OLS), Least Trimmed of Squares (LTS) and Weighted Least Squares (WLS). This shows that LBNN-RWLS can estimate the standard error accurately even when heteroscedastic errors and outliers are present in the data sets.


2021 ◽  
Author(s):  
J PRINCE JEROME CHRISTOPHER ◽  
K LINGADURAI ◽  
G SHANKAR

Abstract Genetic algorithms are search algorithms based on the mechanics of natural selection and natural genetics. In this paper, we investigate a novel approach to the binary coded testing process based on a genetic algorithm. This paper consists of two parts. Thefirst part addresses the problem in the traditional way of using the decimal number system to define the fitness function to study the variations of counts and the variations of probability against the fitness functions. Second, the initialpopulationsare defined using binary coded digits (genes). For the evaluation of the high fitness function values,three genetic operators, namely, reproduction, crossover and mutation, are randomly used. The results show the importance of the genetic operator, mutation, which yields the peak values for the fitness function based on binary coded numbers performed in a new way.


1982 ◽  
Vol 58 (5) ◽  
pp. 213-219 ◽  
Author(s):  
Jean Beaulieu ◽  
Yvan J. Hardy

This paper presents a method of analysis which differentiates between spruce budworm caused mortality and regular mortality on balsam fir in the Gatineau region in Quebec. A first attempt was made using multiple linear regression and a uniform random number generator. In order to overcome the bias inherent to the least squares method when dealing with a binary (0,1) dependent variable, a profit analysis was also conducted. In this case, the parameters and their variance were estimated using likehood method. These two approaches proved to be equivalent when percent budworm caused mortality was compared within the 1958 to 1979 period covered by the data at hand, while the outbreak lasted from 1968 to 1975.In 1979, approximately 55% of the stems had been killed by the budworm, accounting for 53% of the volume. Maple-yellow birch associations were more affected than fir associations although no significant difference was found. Fir mortality was delayed by aerial spraying of insecticides but this advantage disappeared as soon as the spray operations came to an end.


Sign in / Sign up

Export Citation Format

Share Document