scholarly journals Identification of High Leverage Points in Linear Functional Relationship Model

Author(s):  
Abu Sayed Md. Al Mamun ◽  
A.H.M. R. Imon ◽  
A. G. Hussin ◽  
Y. Z. Zubairi ◽  
Sohel Rana

In a standard linear regression model the explanatory variables, , are considered to be fixed and hence assumed to be free from errors. But in reality, they are variables and consequently can be subjected to errors. In the regression literature there is a clear distinction between outlier in the - space or errors and the outlier in the X-space. The later one is popularly known as high leverage points. If the explanatory variables are subjected to gross error or any unusual pattern we call these observations as outliers in the - space or high leverage points. High leverage points often exert too much influence and consequently become responsible for misleading conclusion about the fitting of a regression model, causing multicollinearity problems, masking and/or swamping of outliers etc. Although a good number of works has been done on the identification of high leverage points in linear regression model, this is still a new and unsolved problem in linear functional relationship model. In this paper, we suggest a procedure for the identification of high leverage points based on deletion of a group of observations. The usefulness of the proposed method for the detection of multiple high leverage points is studied by some well-known data set and Monte Carlo simulations.

2021 ◽  
Vol 27 (127) ◽  
pp. 213-228
Author(s):  
Qasim Mohammed Saheb ◽  
Saja Mohammad Hussein

Linear regression is one of the most important statistical tools through which it is possible to know the relationship between the response variable and one variable (or more) of the independent variable(s), which is often used in various fields of science. Heteroscedastic is one of the linear regression problems, the effect of which leads to inaccurate conclusions. The problem of heteroscedastic may be accompanied by the presence of extreme outliers in the independent variables (High leverage points) (HLPs), the presence of (HLPs) in the data set result unrealistic estimates and misleading inferences. In this paper, we review some of the robust weighted estimation methods that accommodate both Robust and classical methods in the detection of extreme outliers (High leverage points) (HLPs) and the determination of weights. The methods include both Diagnostic Robust Generalized Potential Based on Minimum Volume Ellipsoid (DRGP (MVE)), Diagnostic Robust Generalized Potential Based on Minimum Covariance Determinant (DRGP (MCD)), and Diagnostic Robust Generalized Potential Based on Index Set Equality (DRGP (ISE)). The comparison was made according to the standard error criterion of the estimated parameters  SE ( ) and SE ( ) of general linear regression model, for sample sizes (n=60, n=100, n=160), with different degree (severity) of heterogeneity, and contamination percentage (HLPs) are (τ =10%, τ=30%). it was found through comparison that weighted least squares estimation based on the weights of the DRGP (ISE) method are considered the best in estimating the parameters of the multiple linear regression model because they have the lowest standard error values of the estimators ( ) and ( )  as compared to other methods. Paper type: A case study


1995 ◽  
Vol 3 (3) ◽  
pp. 133-142 ◽  
Author(s):  
M. Hana ◽  
W.F. McClure ◽  
T.B. Whitaker ◽  
M. White ◽  
D.R. Bahler

Two artificial neural network models were used to estimate the nicotine in tobacco: (i) a back-propagation network and (ii) a linear network. The back-propagation network consisted of an input layer, an output layer and one hidden layer. The linear network consisted of an input layer and an output layer. Both networks used the generalised delta rule for learning. Performances of both networks were compared to the multiple linear regression method MLR of calibration. The nicotine content in tobacco samples was estimated for two different data sets. Data set A contained 110 near infrared (NIR) spectra each consisting of reflected energy at eight wavelengths. Data set B consisted of 200 NIR spectra with each spectrum having 840 spectral data points. The Fast Fourier transformation was applied to data set B in order to compress each spectrum into 13 Fourier coefficients. For data set A, the linear regression model gave better results followed by the back-propagation network which was followed by the linear network. The true performance of the linear regression model was better than the back-propagation and the linear networks by 14.0% and 18.1%, respectively. For data set B, the back-propagation network gave the best result followed by MLR and the linear network. Both the linear network and MLR models gave almost the same results. The true performance of the back-propagation network model was better than the MLR and linear network by 35.14%.


2018 ◽  
Vol 7 (2) ◽  
pp. 146
Author(s):  
Silvi Qemo ◽  
Eahab Elsaid

The purpose of this study is to derive a multiple linear regression model of the CAPM. More specifically, to test for other potential explanatory variables that can be added to the basic linear regression model for the expected returns on Apple Inc. The following explanatory variables were examined: share volume, outstanding shares, closing bid/ask spread, high/low spread and average spread. Using daily returns of Apple Inc. stock from 2007 till 2014 we were able to create a multiple linear regression model of CAPM that increase the R2 value from the basic linear regression model and enhances the amount of variability in the returns on an asset. This is an important modification that can help better forecast returns on assets.Keywords: CAPM; multiple linear regression model; average spread; variability in the returns


2021 ◽  
Vol 2 (1) ◽  
pp. 12-20
Author(s):  
Kayode Ayinde, Olusegun O. Alabi ◽  
Ugochinyere Ihuoma Nwosu

Multicollinearity has remained a major problem in regression analysis and should be sustainably addressed. Problems associated with multicollinearity are worse when it occurs at high level among regressors. This review revealed that studies on the subject have focused on developing estimators regardless of effect of differences in levels of multicollinearity among regressors. Studies have considered single-estimator and combined-estimator approaches without sustainable solution to multicollinearity problems. The possible influence of partitioning the regressors according to multicollinearity levels and extracting from each group to develop estimators that will estimate the parameters of a linear regression model when multicollinearity occurs is a new econometrics idea and therefore requires attention. The results of new studies should be compared with existing methods namely principal components estimator, partial least squares estimator, ridge regression estimator and the ordinary least square estimators using wide range of criteria by ranking their performances at each level of multicollinearity parameter and sample size. Based on a recent clue in literature, it is possible to develop innovative estimator that will sustainably solve the problem of multicollinearity through partitioning and extraction of explanatory variables approaches and identify situations where the innovative estimator will produce most efficient result of the model parameters. The new estimator should be applied to real data and popularized for use.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Qingqi Zhang

In this paper, the author first analyzes the major factors affecting housing prices with Spearman correlation coefficient, selects significant factors influencing general housing prices, and conducts a combined analysis algorithm. Then, the author establishes a multiple linear regression model for housing price prediction and applies the data set of real estate prices in Boston to test the method. Through the data analysis and test in this paper, it can be summarized that the multiple linear regression model can effectively predict and analyze the housing price to some extent, while the algorithm can still be improved through more advanced machine learning methods.


PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e8192 ◽  
Author(s):  
Gökhan Karakülah ◽  
Nazmiye Arslan ◽  
Cihangir Yandım ◽  
Aslı Suner

Introduction Recent studies highlight the crucial regulatory roles of transposable elements (TEs) on proximal gene expression in distinct biological contexts such as disease and development. However, computational tools extracting potential TE –proximal gene expression associations from RNA-sequencing data are still missing. Implementation Herein, we developed a novel R package, using a linear regression model, for studying the potential influence of TE species on proximal gene expression from a given RNA-sequencing data set. Our R package, namely TEffectR, makes use of publicly available RepeatMasker TE and Ensembl gene annotations as well as several functions of other R-packages. It calculates total read counts of TEs from sorted and indexed genome aligned BAM files provided by the user, and determines statistically significant relations between TE expression and the transcription of nearby genes under diverse biological conditions. Availability TEffectR is freely available at https://github.com/karakulahg/TEffectR along with a handy tutorial as exemplified by the analysis of RNA-sequencing data including normal and tumour tissue specimens obtained from breast cancer patients.


2020 ◽  
Author(s):  
Lei Liu ◽  
Yizhao Ni ◽  
Andrew F Beck ◽  
Cole Brokamp ◽  
Ryan C Ramphul ◽  
...  

BACKGROUND Day-of-surgery cancellation (DoSC) represents a substantial wastage of hospital resources and can cause significant inconvenience to patients and families. Cancellation is reported to impact between 2% and 20% of the 50 million procedures performed annually in American hospitals. Up to 85% of cancellations may be amenable to the modification of patients’ and families’ behaviors. However, the factors underlying DoSC and the barriers experienced by families are not well understood. OBJECTIVE This study aims to conduct a geospatial analysis of patient-specific variables from electronic health records (EHRs) of Cincinnati Children’s Hospital Medical Center (CCHMC) and of Texas Children’s Hospital (TCH), as well as linked socioeconomic factors measured at the census tract level, to understand potential underlying contributors to disparities in DoSC rates across neighborhoods. METHODS The study population included pediatric patients who underwent scheduled surgeries at CCHMC and TCH. A 5-year data set was extracted from the CCHMC EHR, and addresses were geocoded. An equivalent set of data >5.7 years was extracted from the TCH EHR. Case-based data related to patients’ health care use were aggregated at the census tract level. Community-level variables were extracted from the American Community Survey as surrogates for patients’ socioeconomic and minority status as well as markers of the surrounding context. Leveraging the selected variables, we built spatial models to understand the variation in DoSC rates across census tracts. The findings were compared to those of the nonspatial regression and deep learning models. Model performance was evaluated from the root mean squared error (RMSE) using nested 10-fold cross-validation. Feature importance was evaluated by computing the increment of the RMSE when a single variable was shuffled within the data set. RESULTS Data collection yielded sets of 463 census tracts at CCHMC (DoSC rates 1.2%-12.5%) and 1024 census tracts at TCH (DoSC rates 3%-12.2%). For CCHMC, an L2-normalized generalized linear regression model achieved the best performance in predicting all-cause DoSC rate (RMSE 1.299%, 95% CI 1.21%-1.387%); however, its improvement over others was marginal. For TCH, an L2-normalized generalized linear regression model also performed best (RMSE 1.305%, 95% CI 1.257%-1.352%). All-cause DoSC rate at CCHMC was predicted most strongly by <i>previous no show</i>. As for community-level data, the proportion of African American inhabitants per census tract was consistently an important predictor. In the Texas area, the proportion of overcrowded households was salient to DoSC rate. CONCLUSIONS Our findings suggest that geospatial analysis offers potential for use in targeting interventions for census tracts at a higher risk of cancellation. Our study also demonstrates the importance of home location, socioeconomic disadvantage, and racial minority status on the DoSC of children’s surgery. The success of future efforts to reduce cancellation may benefit from taking social, economic, and cultural issues into account.


MATEMATIKA ◽  
2017 ◽  
Vol 33 (2) ◽  
pp. 159
Author(s):  
Nurkhairany Amyra Mokhtar ◽  
Yong Zulina Zubairi ◽  
Abdul Ghapor Hussin ◽  
Rossita Mohamad Yunus

Replicated linear functional relationship model is often used to describe relationships between two circular variables where both variables have error terms and replicate observations are available. We derive the estimate of the rotation parameter of the model using the maximum likelihood method. The performance of the proposed method is studied through simulation, and it is found that the biasness of the estimates is small, thus implying the suitability of the method. Practical application of the method is illustrated by using a real data set.


2016 ◽  
Vol 4 (1) ◽  
pp. 67-84 ◽  
Author(s):  
Katalin Vér Gáspár ◽  
Attila Madaras ◽  
József Varga

Abstract The education system in Hungary has been greatly criticized in the last decades regarding the standards and quality of education and its ignorance towards labour market demands. The present study focuses on factors affecting the quality of education. The first part of the research analyses the relationship between public education and competitiveness in Hungary. In the second part of the research, with the help of the linear regression model and of other statistical and mathematical tools, we tried to identify those explanatory variables which influence and mostly determine the quality of public education. The quality of education was chosen as the dependent variable of the model. Based on the data of competency measurements in Hungary, we were able to identify two explanatory variables that would also highly satisfy the goodness of fit of the linear regression model. The educational funding rates (GDP-proportionate educational spending rate) and the number of students learning English language turned out to be the two significant explanatory variables. Results show that increasing the GDP-proportionate educational spending rate with only one per cent increases the average value of competency measures with 10.9571 points without any other variable changes. Also increasing the number of English language learners with one person increases the average value with 0.000177253 points with other variables remaining the same.


Author(s):  
ARKADY BOLOTIN

Dichotomization of the outcome by a single cut-off point is an important part of medical studies. Usually the relationship between the resulted dichotomized dependent variable and explanatory variables is analyzed with linear regression, probit or logistic regression. However, in many real-life situations, a certain cut-off point is unknown and can be specified only approximately, i.e. surrounded by some (small) uncertainty. It means that in order to have any practical meaning the regression model must be robust to this uncertainty. In this paper, we test the robustness of the linear regression model and get that neither the beta in the model, nor its significance level is robust to the small variations in the dichotomization cut-off point. As an alternative robust approach to the problem of uncertain categories, we propose to make use of the linear regression model with the fuzzy membership function as a dependent variable. In the paper, we test the robustness of the linear regression model of such fuzzy dependent variable and get that this model can be insensitive against the uncertainty in the cut-off point location. To demonstrate theoretical conclusions, in the paper we present the modelling results from the real study of low haemoglobin levels in infants.


Sign in / Sign up

Export Citation Format

Share Document