scholarly journals Least Square Regression for Prediction Problems in Machine Learning using R

2018 ◽  
Vol 7 (3.12) ◽  
pp. 960
Author(s):  
Anila. M ◽  
G Pradeepini

The most commonly used prediction technique is Ordinary Least Squares Regression (OLS Regression). It has been applied in many fields like statistics, finance, medicine, psychology and economics. Many people, specially Data Scientists using this technique know that it has not gone with enough training to apply it and should be checked why & when it can or can’t be applied.It’s not easy task to find or explain about why least square regression [1] is faced much criticism when trained and tried to apply it. In this paper, we mention firstly about fundamentals of linear regression and OLS regression along with that popularity of LS method, we present our analysis of difficulties & pitfalls that arise while OLS method is applied, finally some techniques for overcoming these problems.  

Author(s):  
Chisimkwuo John ◽  
Chukwuemeka O. Omekara ◽  
Godwin Okwara

An indicative feature of a principal component analysis (PCA) variant to the multivariate data set is the ability to transform correlated linearly dependent variables to linearly independent principal components. Back-transforming these components with the samples and variables approximated on a single calibrated plot gives rise to the PCA Biplots. In this work, the predictive property of the PCA biplot was augmented in the visualization of anthropometric measurements namely; weight (kg), height (cm), skinfold (cm), arm muscle circumference AMC (cm), mid upper arm circumference MUAC (cm) collected from the students of School of Nursing and Midwifery, Federal Medical Center (FMC), Umuahia, Nigeria. The adequacy and quality of the PCA Biplot was calculated and the predicted samples are then compared with the ordinary least square (OLS) regression predictions since both predictions makes use of an indicative minimization of the error sum of squares. The result suggests that the PCA biplot prediction merits further consideration when handling correlated multivariate data sets as its predictions with mean square error (MSE) of 0.00149 seems to be better when compared to the OLS regression predictions with MSE of 29.452.


Author(s):  
Hector Donaldo Mata ◽  
Mohammed Hadi ◽  
David Hale

Transportation agencies utilize key performance indicators (KPIs) to measure the performance of their traffic networks and business processes. To make effective decisions based on these KPIs, there is a need to align the KPIs at the strategic, tactical, and operational decision levels and to set targets for these KPIs. However, there has been no known effort to develop methods to ensure this alignment producing a correlative model to explore the relationships to support the derivation of the KPI targets. Such development will lead to more realistic target setting and effective decisions based on these targets, ensuring that agency goals are met subject to the available resources. This paper presents a methodology in which the KPIs are represented in a tree-like structure that can be used to depict the association between metrics at the strategic, tactical, and operational levels. Utilizing a combination of business intelligence and machine learning tools, this paper demonstrates that it is possible not only to identify such relationships but also to quantify them. The proposed methodology compares the effectiveness and accuracy of multiple machine learning models including ordinary least squares regression (OLS), least absolute shrinkage and selection operator (LASSO), and ridge regression, for the identification and quantification of interlevel relationships. The output of the model allows the identification of which metrics have more influence on the upper-level KPI targets. The analysis can be performed at the system, facility, and segment levels, providing important insights on what investments are needed to improve system performance.


2009 ◽  
Vol 2009 ◽  
pp. 1-8 ◽  
Author(s):  
Janet Myhre ◽  
Daniel R. Jeske ◽  
Michael Rennie ◽  
Yingtao Bi

A heteroscedastic linear regression model is developed from plausible assumptions that describe the time evolution of performance metrics for equipment. The inherited motivation for the related weighted least squares analysis of the model is an essential and attractive selling point to engineers with interest in equipment surveillance methodologies. A simple test for the significance of the heteroscedasticity suggested by a data set is derived and a simulation study is used to evaluate the power of the test and compare it with several other applicable tests that were designed under different contexts. Tolerance intervals within the context of the model are derived, thus generalizing well-known tolerance intervals for ordinary least squares regression. Use of the model and its associated analyses is illustrated with an aerospace application where hundreds of electronic components are continuously monitored by an automated system that flags components that are suspected of unusual degradation patterns.


Author(s):  
Enivaldo C. Rocha ◽  
Dalson Britto Figueiredo Filho ◽  
Ranulfo Paranhos ◽  
José Alexandre Silva Jr. ◽  
Denisson Silva

This paper presents an active classroom exercise focusing on the interpretation of ordinary least squares regression coefficients. Methodologically, undergraduate students analyze Brazilian soccer data, formulate and test classical hypothesis regarding home team advantage. Technically, our framework is simply adapted for others sports and has no implementation cost. In addition, the exercise is easily conducted by the instructor and highly enjoyable for the students. The intuitive approach also facilitates the understanding of linear regression practical application.


2021 ◽  
Vol 2107 (1) ◽  
pp. 012058
Author(s):  
Sukhairi Sudin ◽  
Azizi Naim Abdul Aziz ◽  
Fathinul Syahir Ahmad Saad ◽  
Nurul Syahirah Khalid ◽  
Ismail Ishaq Ibrahim

Abstract This project examined the influence of the cadence, speed, heart rate and power towards the cycling performance by using Garmin Edge 1000. Any change in cadence will affect the speed, heart rate and power of the novice cyclist and the changes pattern will be observed through mobile devices installed with Garmin Connect application. Every results will be recorded for the next task which analysis the collected data by using machine learning algorithm which is Regression analysis. Regression analysis is a statistical method for modelling the connection between one or more independent variables and a dependent (target) variable. Regression analysis is required to answer these types of prediction problems in machine learning. Regression is a supervised learning technique that aids in the discovery of variable correlations and allows for the prediction of a continuous output variable based on one or more predictor variables. A total of forty days’ worth of events were captured in the dataset. Cadence act as dependent variable, (y) while speed, heart rate and power act as independent variable, (x) in prediction of the cycling performance. Simple linear regression is defined as linear regression with only one input variable (x). When there are several input variables, the linear regression is referred to as multiple linear regression. The research uses a linear regression technique to predict cycling performance based on cadence analysis. The linear regression algorithm reveals a linear relationship between a dependent (y) variable and one or more independent (y) variables, thus the name. Because linear regression reveals a linear relationship, it determines how the value of the dependent variable changes as the value of the independent variable changes. This analysis use the Mean Squared Error (MSE) expense function for Linear Regression, which is the average of squared errors between expected and real values. Value of R squared had been recorded in this project. A low R-squared value means that the independent variable is not describing any of the difference in the dependent variable-regardless of variable importance, this is letting know that the defined independent variable, although meaningful, is not responsible for much of the variance in the dependent variable’s mean. By using multiple regression, the value of R-squared in this project is acceptable because over than 0.7 and as known this project based on human behaviour and usually the R-squared value hardly to have more than 0.3 if involve human factor but in this project the R-squared is acceptable.


2014 ◽  
Vol 71 (1) ◽  
Author(s):  
Bello Abdulkadir Rasheed ◽  
Robiah Adnan ◽  
Seyed Ehsan Saffari ◽  
Kafi Dano Pati

In a linear regression model, the ordinary least squares (OLS) method is considered the best method to estimate the regression parameters if the assumptions are met. However, if the data does not satisfy the underlying assumptions, the results will be misleading. The violation for the assumption of constant variance in the least squares regression is caused by the presence of outliers and heteroscedasticity in the data. This assumption of constant variance (homoscedasticity) is very important in linear regression in which the least squares estimators enjoy the property of minimum variance. Therefor e robust regression method is required to handle the problem of outlier in the data. However, this research will use the weighted least square techniques to estimate the parameter of regression coefficients when the assumption of error variance is violated in the data. Estimation of WLS is the same as carrying out the OLS in a transformed variables procedure. The WLS can easily be affected by outliers. To remedy this, We have suggested a strong technique for the estimation of regression parameters in the existence of heteroscedasticity and outliers. Here we apply the robust regression of M-estimation using iterative reweighted least squares (IRWLS) of Huber and Tukey Bisquare function and resistance regression estimator of least trimmed squares to estimating the model parameters of state-wide crime of united states in 1993. The outcomes from the study indicate the estimators obtained from the M-estimation techniques and the least trimmed method are more effective compared with those obtained from the OLS.


Author(s):  
Scott Alfeld ◽  
Ara Vartanian ◽  
Lucas Newman-Johnson ◽  
Benjamin I.P. Rubinstein

While machine learning systems are known to be vulnerable to data-manipulation attacks at both training and deployment time, little is known about how to adapt attacks when the defender transforms data prior to model estimation. We consider the setting where the defender Bob first transforms the data then learns a model from the result; Alice, the attacker, perturbs Bob’s input data prior to him transforming it. We develop a general-purpose “plug and play” framework for gradient-based attacks based on matrix differentials, focusing on ordinary least-squares linear regression. This allows learning algorithms and data transformations to be paired and composed arbitrarily: attacks can be adapted through the use of the chain rule—analogous to backpropagation on neural network parameters—to compositional learning maps. Bestresponse attacks can be computed through matrix multiplications from a library of attack matrices for transformations and learners. Our treatment of linear regression extends state-ofthe-art attacks at training time, by permitting the attacker to affect both features and targets optimally and simultaneously. We explore several transformations broadly used across machine learning with a driving motivation for our work being autogressive modeling. There, Bob transforms a univariate time series into a matrix of observations and vector of target values which can then be fed into standard learners. Under this learning reduction, a perturbation from Alice to a single value of the time series affects features of several data points along with target values.


Author(s):  
Mohammad Bayu Moha ◽  
Anderson Guntur Kumenaung ◽  
Debby Christina Rotinsulu

Abstrak Pendapatan Asli Daerah (PAD)  merupakan salah satu komponen pendapatan utama pemerintah daerah dalam menunjang anggaran rumah tangganya, semakin tinggi tingkat pendapatan yang dimiliki oleh daerah tentu akan semakin tinggi pula tingkatan kemandiriannya dan bisa memaksimalkan pengalokasian anggaran untuk pembangunan sektor-sektor unggulan. Sedangkan Dana Alokasi Khusus (DAK) menjadi sumber pendapatan daerah yang bisa menambah asset local dan secara agreggat menambah pendapatan melalui peningkatan sumber-sumber perekonomian yang dimiliki. Dalam penelitian ini digunakan Ordinary least square dengan analisis regresi berganda dan mendapatkan hasil uji t dan uji f menunjukan bahwa PAD berpengaruh positif dan signifikan terhadap belanja modal sedangkan DAK tidak memberi pengaruh yang signifikan, namun melalui uji R Square didapatkan hasil 82,7 hal ini berarti secara bersama-sama pengaruh PAD dan DAU terhadap belanja modal adalah 82,7 % (persen) sedangkan sisanya dipengaruhi variable lain. Kata kunci : Pendapatan Asli Daerah (PAD), Dana Alokasi Khusus (DAK), Belanja Modal   Abstract Local Revenue  is one of the major revenue components of the local government in supporting the household budget, the higher the level of income that is owned by the region of course the higher the level of independence and can maximize the budget allocation for the development of leading sectors. While the Special Allocation Fund became a source of local revenue that can increase local assets and collectively increase revenue through increased economic resources owned. This study used the Ordinary least squares regression analysis and obtain test results and test t f showed that PAD positive and significant impact on capital expenditures, while DAK does not give a significant influence, but through R Square test showed 82.7 this means  collectively influence of PAD and DAU towards capital expenditure was 82.7% (percent) while the rest influenced other variables. Keywords: Local Revenue,  the Special Allocation Fund, Capital Expenditure  


Author(s):  
Changhyo Yi ◽  
Kijung Kim

This study aimed to ascertain the applicability of a machine learning approach to the description of residential mobility patterns of households in the Seoul metropolitan region (SMR). The spatial range and temporal scope of the empirical study were set to 2015 to review the most recent residential mobility patterns in the SMR. The analysis data used in this study involve the microdata of Internal Migration Statistics provided by the Microdata Integrated Service of Statistics Korea. We analysed the residential relocation distance of households in the SMR by using machine learning techniques such as ordinary least squares regression and decision tree regression. The results of this study showed that a decision tree model can be more advantageous than ordinary least squares regression in terms of the explanatory power and estimation of moving distance. A large number of residential movements are mainly related to the accessibility to employment markets and some household characteristics. The shortest movements occur when households with two or more members move into densely populated districts. In contrast, job-based residential movements have relatively longer distance. Furthermore, we derived knowledge on residential relocation distance, which can provide significant information on the urban management of metropolitan residential districts and the construction of reasonable housing policies.


Author(s):  
Warha, Abdulhamid Audu ◽  
Yusuf Abbakar Muhammad ◽  
Akeyede, Imam

Linear regression is the measure of relationship between two or more variables known as dependent and independent variables. Classical least squares method for estimating regression models consist of minimising the sum of the squared residuals. Among the assumptions of Ordinary least squares method (OLS) is that there is no correlations (multicollinearity) between the independent variables. Violation of this assumptions arises most often in regression analysis and can lead to inefficiency of the least square method. This study, therefore, determined the efficient estimator between Least Absolute Deviation (LAD) and Weighted Least Square (WLS) in multiple linear regression models at different levels of multicollinearity in the explanatory variables. Simulation techniques were conducted using R Statistical software, to investigate the performance of the two estimators under violation of assumptions of lack of multicollinearity. Their performances were compared at different sample sizes. Finite properties of estimators’ criteria namely, mean absolute error, absolute bias and mean squared error were used for comparing the methods. The best estimator was selected based on minimum value of these criteria at a specified level of multicollinearity and sample size. The results showed that, LAD was the best at different levels of multicollinearity and was recommended as alternative to OLS under this condition. The performances of the two estimators decreased when the levels of multicollinearity was increased.


Sign in / Sign up

Export Citation Format

Share Document