influential observations
Recently Published Documents


TOTAL DOCUMENTS

184
(FIVE YEARS 22)

H-INDEX

23
(FIVE YEARS 1)

2021 ◽  
Author(s):  
◽  
Nazrina Aziz

<p>This thesis investigates three research problems which arise in multivariate data and censored regression. The first is the identification of outliers in multivariate data. The second is a dissimilarity measure for clustering purposes. The third is the diagnostics analysis for the Buckley-James method in censored regression. Outliers can be defined simply as an observation (or a subset of observations) that is isolated from the other observations in the data set. There are two main reasons that motivate people to find outliers; the first is the researcher's intention. The second is the effects of an outlier on analyses, i.e. the existence of outliers will affect means, variances and regression coefficients; they will also cause a bias or distortion of estimates; likewise, they will inflate the sums of squares and hence, false conclusions are likely to be created. Sometimes, the identification of outliers is the main objective of the analysis, and whether to remove the outliers or for them to be down-weighted prior to fitting a non-robust model. This thesis does not differentiate between the various justifications for outlier detection. The aim is to advise the analyst of observations that are considerably different from the majority. Note that the techniques for identification of outliers introduce in this thesis is applicable to a wide variety of settings. Those techniques are performed on large and small data sets. In this thesis, observations that are located far away from the remaining data are considered to be outliers. Additionally, it is noted that some techniques for the identification of outliers are available for finding clusters. There are two major challenges in clustering. The first is identifying clusters in high-dimensional data sets is a difficult task because of the curse of dimensionality. The second is a new dissimilarity measure is needed as some traditional distance functions cannot capture the pattern dissimilarity among the objects. This thesis deals with the latter challenge. This thesis introduces Influence Angle Cluster Approach (iaca) that may be used as a dissimilarity matrix and the author has managed to show that iaca successfully develops a cluster when it is used in partitioning clustering, even if the data set has mixed variables, i.e. interval and categorical variables. The iaca is developed based on the influence eigenstructure. The first two problems in this thesis deal with a complete data set. It is also interesting to study about the incomplete data set, i.e. censored data set. The term 'censored' is mostly used in biological science areas such as a survival analysis. Nowadays, researchers are interested in comparing the survival distribution of two samples. Even though this can be done by using the logrank test, this method cannot examine the effects of more than one variable at a time. This difficulty can easily be overcome by using the survival regression model. Examples of the survival regression model are the Cox model, Miller's model, the Buckely James model and the Koul- Susarla-Van Ryzin model. The Buckley James model's performance is comparable with the Cox model and the former performs best when compared both to the Miller model and the Koul-Susarla-Van Ryzin model. Previous comparison studies proved that the Buckley-James estimator is more stable and easier to explain to non-statisticians than the Cox model. Today, researchers are interested in using the Cox model instead of the Buckley-James model. This is because of the lack of function of Buckley-James model in the computer software and choices of diagnostics analysis. Currently, there are only a few diagnostics analyses for Buckley James model that exist. Therefore, this thesis proposes two new diagnostics analyses for the Buckley-James model. The first proposed diagnostics analysis is called renovated Cook's distance. This method produces comparable results with the previous findings. Nevertheless, this method cannot identify influential observations from the censored group. It can only detect influential observations from the uncensored group. This issue needs further investigation because of the possibility of censored points becoming influential cases in censored regression. Secondly, the local influence approach for the Buckley-James model is proposed. This thesis presents the local influence diagnostics of the Buckley-James model which consist of variance perturbation, response variable perturbation, censoring status perturbation, and independent variables perturbation. The proposed diagnostics improves and also challenge findings of the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation.</p>


2021 ◽  
Author(s):  
◽  
Nazrina Aziz

<p>This thesis investigates three research problems which arise in multivariate data and censored regression. The first is the identification of outliers in multivariate data. The second is a dissimilarity measure for clustering purposes. The third is the diagnostics analysis for the Buckley-James method in censored regression. Outliers can be defined simply as an observation (or a subset of observations) that is isolated from the other observations in the data set. There are two main reasons that motivate people to find outliers; the first is the researcher's intention. The second is the effects of an outlier on analyses, i.e. the existence of outliers will affect means, variances and regression coefficients; they will also cause a bias or distortion of estimates; likewise, they will inflate the sums of squares and hence, false conclusions are likely to be created. Sometimes, the identification of outliers is the main objective of the analysis, and whether to remove the outliers or for them to be down-weighted prior to fitting a non-robust model. This thesis does not differentiate between the various justifications for outlier detection. The aim is to advise the analyst of observations that are considerably different from the majority. Note that the techniques for identification of outliers introduce in this thesis is applicable to a wide variety of settings. Those techniques are performed on large and small data sets. In this thesis, observations that are located far away from the remaining data are considered to be outliers. Additionally, it is noted that some techniques for the identification of outliers are available for finding clusters. There are two major challenges in clustering. The first is identifying clusters in high-dimensional data sets is a difficult task because of the curse of dimensionality. The second is a new dissimilarity measure is needed as some traditional distance functions cannot capture the pattern dissimilarity among the objects. This thesis deals with the latter challenge. This thesis introduces Influence Angle Cluster Approach (iaca) that may be used as a dissimilarity matrix and the author has managed to show that iaca successfully develops a cluster when it is used in partitioning clustering, even if the data set has mixed variables, i.e. interval and categorical variables. The iaca is developed based on the influence eigenstructure. The first two problems in this thesis deal with a complete data set. It is also interesting to study about the incomplete data set, i.e. censored data set. The term 'censored' is mostly used in biological science areas such as a survival analysis. Nowadays, researchers are interested in comparing the survival distribution of two samples. Even though this can be done by using the logrank test, this method cannot examine the effects of more than one variable at a time. This difficulty can easily be overcome by using the survival regression model. Examples of the survival regression model are the Cox model, Miller's model, the Buckely James model and the Koul- Susarla-Van Ryzin model. The Buckley James model's performance is comparable with the Cox model and the former performs best when compared both to the Miller model and the Koul-Susarla-Van Ryzin model. Previous comparison studies proved that the Buckley-James estimator is more stable and easier to explain to non-statisticians than the Cox model. Today, researchers are interested in using the Cox model instead of the Buckley-James model. This is because of the lack of function of Buckley-James model in the computer software and choices of diagnostics analysis. Currently, there are only a few diagnostics analyses for Buckley James model that exist. Therefore, this thesis proposes two new diagnostics analyses for the Buckley-James model. The first proposed diagnostics analysis is called renovated Cook's distance. This method produces comparable results with the previous findings. Nevertheless, this method cannot identify influential observations from the censored group. It can only detect influential observations from the uncensored group. This issue needs further investigation because of the possibility of censored points becoming influential cases in censored regression. Secondly, the local influence approach for the Buckley-James model is proposed. This thesis presents the local influence diagnostics of the Buckley-James model which consist of variance perturbation, response variable perturbation, censoring status perturbation, and independent variables perturbation. The proposed diagnostics improves and also challenge findings of the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation.</p>


Symmetry ◽  
2021 ◽  
Vol 13 (11) ◽  
pp. 2030
Author(s):  
Ali Mohammed Baba ◽  
Habshah Midi ◽  
Mohd Bakri Adam ◽  
Nur Haizum Abd Rahman

Influential observations (IOs), which are outliers in the x direction, y direction or both, remain a problem in the classical regression model fitting. Spatial regression models have a peculiar kind of outliers because they are local in nature. Spatial regression models are also not free from the effect of influential observations. Researchers have adapted some classical regression techniques to spatial models and obtained satisfactory results. However, masking or/and swamping remains a stumbling block for such methods. In this article, we obtain a measure of spatial Studentized prediction residuals that incorporate spatial information on the dependent variable and the residuals. We propose a robust spatial diagnostic plot to classify observations into regular observations, vertical outliers, good and bad leverage points using a classification based on spatial Studentized prediction residuals and spatial diagnostic potentials, which we refer to as and . Observations that fall into the vertical outliers and bad leverage points categories are referred to as IOs. Representations of some classical regression measures of diagnostic in general spatial models are presented. The commonly used diagnostic measure in spatial diagnostics, the Cook’s distance, is compared to some robust methods, (using robust and non-robust measures), and our proposed and plots. Results of our simulation study and applications to real data showed that the Cook’s distance, non-robust and robust were not very successful in detecting IOs. The suffered from the masking effect, and the robust suffered from swamping in general spatial models. Interestingly, the results showed that the proposed plot, followed by the plot, was very successful in classifying observations into the correct groups, hence correctly detecting the real IOs.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Aamna Khan ◽  
Muhammad Amanullah ◽  
Muhammad Amin ◽  
Randa Alharbi ◽  
Abdisalam Hassan Muse ◽  
...  

There is a long history of interest in modeling Poisson regression in different fields of study. The focus of this work is on handling the issues that occur after modeling the count data. For the prediction and analysis of count data, it is valuable to study the factors that influence the performance of the model and the decision based on the analysis of that model. In regression analysis, multicollinearity and influential observations separately and jointly affect the model estimation and inferences. In this article, we focused on multicollinearity and influential observations simultaneously. To evaluate the reliability and quality of regression estimates and to overcome the problems in model fitting, we proposed new diagnostic methods based on Sherman–Morrison Woodbury (SMW) theorem to detect the influential observations using approximate deletion formulas for the Poisson regression model with the Liu estimator. A Monte Carlo method is done for the assessment of the proposed diagnostic methods. Real data are also considered for the evaluation of the proposed methods. Results show the superiority of the proposed diagnostic methods in detecting unusual observations in the presence of multicollinearity compared to the traditional maximum likelihood estimation method.


2021 ◽  
Vol 4 (4) ◽  
pp. 561-569
Author(s):  
Sani Muhammad ◽  
Suleiman Shamsuddeen ◽  
Ismail G. Baoku

Panel data estimators can strongly be biased and inconsistent in the presence of heteroscedasticity and anomalous observations called influential observations (IOs) in Random effect (RE) panel data model. The existing methods (LWS, WLSF, WLSDRGP) address only the problem of IO but fail to remedy the combine problem of heteroscedasticity and IOs.  Therefore, in this research we develop a method that will remedy the combine problem of heteroscedasticity and IOs based on robust heteroscedasticity consistent covariance matrix (RHCCM) estimator and fast improvised influential distance (FIID) weighting method denoted by WLSFIID. The simulation and numerical evidences show that our proposed estimation method is more efficient than the existing methods by providing smallest bias, and smallest standard error of HC4 and HC5.


Author(s):  
Ali Mohammed Baba ◽  
Habshah Midi ◽  
Mohd Bakri Adam ◽  
Nur Haizum Bint Abd Rahman

Influential Observations, which are outliers in x direction, y direction or both, remain a hitch in classical regression model fitting. Spatial regression model, with peculiar nature of outliers due to their local nature, is not free from the effect of such influential observations. Researchers have adapted some classical regression techniques to the spatial models and yielded satisfactory results. However, masking or/and swamping remain stumbling block to such methods. We obtained the spatial representation of the classical regression measures of diagnostic in general spatial model. Commonly used diagnostic measure in spatial diagnostic, the Cook's distance, is compared to some robust methods, Hi2 (using robust and non-robust measures), and classification based on generalized residuals and diagnostic generalized potentials, ISRs-Posi and ESRs-Posi, with the help of the obtained spatial prediction residuals and the spatial leverage term. Results of simulation and applications to real data have shown the advantage of the ISRs-Posi and ESRs-Posi due to classification of outliers over Cook's distance and non-robust Hsi12, which suffer from masking, and robust Hsi22 which suffer from swamping in general spatial model.


2021 ◽  
Vol 50 (7) ◽  
pp. 2085-2094
Author(s):  
Habshah Midi ◽  
Muhammad Sani ◽  
Shelan Saied Ismaeel ◽  
Jayanthi Arasan

Influential observations (IO) are those observations that are responsible for misleading conclusions about the fitting of a multiple linear regression model. The existing IO identification methods such as influential distance (ID) is not very successful in detecting IO. It is suspected that the ID employed inefficient method with long computational running time for the identification of the suspected IO at the initial step. Moreover, this method declares good leverage observations as IO, resulting in misleading conclusion. In this paper, we proposed fast improvised influential distance (FIID) that can successfully identify IO, good leverage observations, and regular observations with shorter computational running time. Monte Carlo simulation study and real data examples show that the FIID correctly identify genuine IO in multiple linear regression model with no masking and a negligible swamping rate.


Sign in / Sign up

Export Citation Format

Share Document