scholarly journals The L-Curve Criterion as a Model Selection Tool in PLS Regression

2019 ◽  
Vol 2019 ◽  
pp. 1-7
Author(s):  
Abdelmounaim Kerkri ◽  
Jelloul Allal ◽  
Zoubir Zarrouk

Partial least squares (PLS) regression is an alternative to the ordinary least squares (OLS) regression, used in the presence of multicollinearity. As with any other modelling method, PLS regression requires a reliable model selection tool. Cross validation (CV) is the most commonly used tool with many advantages in both preciseness and accuracy, but it also has some drawbacks; therefore, we will use L-curve criterion as an alternative, given that it takes into consideration the shrinking nature of PLS. A theoretical justification for the use of L-curve criterion is presented as well as an application on both simulated and real data. The application shows how this criterion generally outperforms cross validation and generalized cross validation (GCV) in mean squared prediction error and computational efficiency.

Geophysics ◽  
2018 ◽  
Vol 83 (6) ◽  
pp. V345-V357 ◽  
Author(s):  
Nasser Kazemi

Given the noise-corrupted seismic recordings, blind deconvolution simultaneously solves for the reflectivity series and the wavelet. Blind deconvolution can be formulated as a fully perturbed linear regression model and solved by the total least-squares (TLS) algorithm. However, this algorithm performs poorly when the data matrix is a structured matrix and ill-conditioned. In blind deconvolution, the data matrix has a Toeplitz structure and is ill-conditioned. Accordingly, we develop a fully automatic single-channel blind-deconvolution algorithm to improve the performance of the TLS method. The proposed algorithm, called Toeplitz-structured sparse TLS, has no assumptions about the phase of the wavelet. However, it assumes that the reflectivity series is sparse. In addition, to reduce the model space and the number of unknowns, the algorithm benefits from the structural constraints on the data matrix. Our algorithm is an alternating minimization method and uses a generalized cross validation function to define the optimum regularization parameter automatically. Because the generalized cross validation function does not require any prior information about the noise level of the data, our approach is suitable for real-world applications. We validate the proposed technique using synthetic examples. In noise-free data, we achieve a near-optimal recovery of the wavelet and the reflectivity series. For noise-corrupted data with a moderate signal-to-noise ratio (S/N), we found that the algorithm successfully accounts for the noise in its model, resulting in a satisfactory performance. However, the results deteriorate as the S/N and the sparsity level of the data are decreased. We also successfully apply the algorithm to real data. The real-data examples come from 2D and 3D data sets of the Teapot Dome seismic survey.


2014 ◽  
Vol 70 (5) ◽  
Author(s):  
Nor Fazila Rasaruddin ◽  
Mas Ezatul Nadia Mohd Ruah ◽  
Mohamed Noor Hasan ◽  
Mohd Zuli Jaafar

This paper shows the determination of iodine value (IV) of pure and frying palm oils using Partial Least Squares (PLS) regression with application of variable selection. A total of 28 samples consisting of pure and frying palm oils which acquired from markets. Seven of them were considered as high-priced palm oils while the remaining was low-priced. PLS regression models were developed for the determination of IV using Fourier Transform Infrared (FTIR) spectra data in absorbance mode in the range from 650 cm-1 to 4000 cm-1. Savitzky Golay derivative was applied before developing the prediction models. The models were constructed using wavelength selected in the FTIR region by adopting selectivity ratio (SR) plot and correlation coefficient to the IV parameter. Each model was validated through Root Mean Square Error Cross Validation, RMSECV and cross validation correlation coefficient, R2cv. The best model using SR plot was the model with mean centring for pure sample and model with a combination of row scaling and standardization of frying sample. The best model with the application of the correlation coefficient variable selection was the model with a combination of row scaling and standardization of pure sample and model with mean centering data pre-processing for frying sample. It is not necessary to row scaled the variables to develop the model since the effect of row scaling on model quality is insignificant.


Author(s):  
Andrea Tri Rian Dani ◽  
Ludia Ni'matuzzahroh

Estimator Spline Truncated adalah salah satu pendekatan dalam regresi nonparametrik yang dapat digunakan ketika pola hubungan antara variabel respon dan variabel prediktor tidak diketahui dengan pasti polanya. Estimator Spline Truncated memiliki fleksibilitas yang tinggi dalam proses pemodelan. Pada penelitian ini  bertujuan untuk memodelkan persentase penduduk miskin Kabupaten/Kota di Provinsi Jawa Barat dengan menggunakan model regresi nonparametrik estimator Spline Truncated. Metode estimasi yang digunakan adalah Ordinary Least Squares (OLS). Kriteria kebaikan model regresi nonparametrik yang digunakan adalah Generalized Cross-Validation (GCV). Berdasarkan hasil analisis, diperoleh model terbaik dari regresi nonparametrik Spline Truncated, yaitu model dengan 3 titik knot, dimana diperoleh nilai GCV minimum sebesar 2.14. Berdasarkan hasil pengujian hipotesis, baik secara simultan maupun parsial, diketahui bahwa variabel prediktor yang digunakan pada penelitian ini, berpengaruh signifikan terhadap persentase penduduk miskin, dengan nilai koefisien determinasi sebesar 95.33%.


2015 ◽  
Vol 2015 ◽  
pp. 1-6 ◽  
Author(s):  
Jie Yu Chen ◽  
Han Zhang ◽  
Jinkui Ma ◽  
Tomohiro Tuchiya ◽  
Yelian Miao

This rapid method for determining the degree of degradation of frying rapeseed oils uses Fourier-transform infrared (FTIR) spectroscopy combined with partial least-squares (PLS) regression. One hundred and fifty-six frying oil samples that degraded to different degrees by frying potatoes were scanned by an FTIR spectrometer using attenuated total reflectance (ATR). PLS regression with full cross validation was used for the prediction of acid value (AV) and total polar compounds (TPC) based on raw, first, and second derivative FTIR spectra (4000–650 cm−1). The precise calibration model based on the second derivative FTIR spectra shows that the coefficients of determination for calibration(R2)and standard errors of cross validation (SECV) were 0.99 and 0.16 mg KOH/g and 0.98 and 1.17% for AV and TPC, respectively. The accuracy of the calibration model, tested using the validation set, yielded standard errors of prediction (SEP) of 0.16 mg KOH/g and 1.10% for AV and TPC, respectively. Therefore, the degradation of frying oils can be accurately measured using FTIR spectroscopy combined with PLS regression.


2021 ◽  
Author(s):  
Martin Emil Jakobsen ◽  
Jonas Peters

Abstract While causal models are robust in that they are prediction optimal under arbitrarily strong interventions, they may not be optimal when the interventions are bounded. We prove that the classical K-class estimator satisfies such optimality by establishing a connection between K-class estimators and anchor regression. This connection further motivates a novel estimator in instrumental variable settings that minimizes the mean squared prediction error subject to the constraint that the estimator lies in an asymptotically valid confidence region of the causal coefficient. We call this estimator PULSE (p-uncorrelated least squares estimator), relate it to work on invariance, show that it can be computed efficiently as a data-driven K-class estimator, even though the underlying optimization problem is non-convex, and prove consistency. We evaluate the estimators on real data and perform simulation experiments illustrating that PULSE suffers from less variability. There are several settings including weak instrument settings, where it outperforms other estimators.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Made Ayu Dwi Octavanny ◽  
I Nyoman Budiantara ◽  
Heri Kuswanto ◽  
Dyah Putri Rahmawati

We introduce a new method for estimating the nonparametric regression curve for longitudinal data. This method combines two estimators: truncated spline and Fourier series. This estimation is completed by minimizing the penalized weighted least squares and weighted least squares. This paper also provides the properties of the new mixed estimator, which are biased and linear in the observations. The best model is selected using the smallest value of generalized cross-validation. The performance of the new method is demonstrated by a simulation study with a variety of time points. Then, the proposed approach is applied to a stroke patient dataset. The results show that simulated data and real data yield consistent findings.


2018 ◽  
Vol 7 (6) ◽  
pp. 33
Author(s):  
Morteza Marzjarani

Selecting a proper model for a data set is a challenging task. In this article, an attempt was made to answer and to find a suitable model for a given data set. A general linear model (GLM) was introduced along with three different methods for estimating the parameters of the model. The three estimation methods considered in this paper were ordinary least squares (OLS), generalized least squares (GLS), and feasible generalized least squares (FGLS). In the case of GLS, two different weights were selected for improving the severity of heteroscedasticity and the proper weight (s) was deployed. The third weight was selected through the application of FGLS. Analyses showed that only two of the three weights including the FGLS were effective in improving or reducing the severity of heteroscedasticity. In addition, each data set was divided into Training, Validation, and Testing producing a more reliable set of estimates for the parameters in the model. Partitioning data is a relatively new approach is statistics borrowed from the field of machine learning. Stepwise and forward selection methods along with a number of statistics including the average square error testing (ASE), Adj. R-Sq, AIC, AICC, and ASE validate along with proper hierarchies were deployed to select a more appropriate model(s) for a given data set. Furthermore, the response variable in both data files was transformed using the Box-Cox method to meet the assumption of normality. Analysis showed that the logarithmic transformation solved this issue in a satisfactory manner. Since the issues of heteroscedasticity, model selection, and partitioning of data have not been addressed in fisheries, for introduction and demonstration purposes only, the 2015 and 2016 shrimp data in the Gulf of Mexico (GOM) were selected and the above methods were applied to these data sets. At the conclusion, some variations of the GLM were identified as possible leading candidates for the above data sets.


2018 ◽  
Vol 7 (4) ◽  
pp. 339
Author(s):  
GEDE ARY PRABHA YOGESSWARA ◽  
EKA N. KENCANA ◽  
KOMANG GDE SUKARSA

Partial least squares (PLS) regression and least absolute shrinkage and selection operator (LASSO) are the regression analysis techniques used to overcome the problems that can not be solved by ordinary least squares (OLS). The purpose of this research is to model and compare the performance of both PLS regression and LASSO to the diabetes mellitus study data which is divided into 30 groups of data redundancy as an example of microarray data. The survival time of diabetes mellitus patients as dependent variable while age, sex, body mass index, blood pressure, and six blood serum measurements as independent variables. By using paired sample t-test of adj R2 value, the result of this research concluded that the mean of adj R2 value of PLS regression is smaller than the mean of adj R2 value of LASSO. In other words, the performance of LASSO is better than PLS regression.


2021 ◽  
Vol 6 (11) ◽  
pp. 11850-11878
Author(s):  
SidAhmed Benchiha ◽  
◽  
Amer Ibrahim Al-Omari ◽  
Naif Alotaibi ◽  
Mansour Shrahili ◽  
...  

<abstract><p>Recently, a new lifetime distribution known as a generalized Quasi Lindley distribution (GQLD) is suggested. In this paper, we modified the GQLD and suggested a two parameters lifetime distribution called as a weighted generalized Quasi Lindley distribution (WGQLD). The main mathematical properties of the WGQLD including the moments, coefficient of variation, coefficient of skewness, coefficient of kurtosis, stochastic ordering, median deviation, harmonic mean, and reliability functions are derived. The model parameters are estimated by using the ordinary least squares, weighted least squares, maximum likelihood, maximum product of spacing's, Anderson-Darling and Cramer-von-Mises methods. The performances of the proposed estimators are compared based on numerical calculations for various values of the distribution parameters and sample sizes in terms of the mean squared error (MSE) and estimated values (Es). To demonstrate the applicability of the new model, four applications of various real data sets consist of the infected cases in Covid-19 in Algeria and Saudi Arabia, carbon fibers and rain fall are analyzed for illustration. It turns out that the WGQLD is empirically better than the other competing distributions considered in this study.</p></abstract>


Sign in / Sign up

Export Citation Format

Share Document