scholarly journals Analysis and Diagnostics for  Censored Regression and  Multivariate Data

2021 ◽  
Author(s):  
◽  
Nazrina Aziz

<p>This thesis investigates three research problems which arise in multivariate data and censored regression. The first is the identification of outliers in multivariate data. The second is a dissimilarity measure for clustering purposes. The third is the diagnostics analysis for the Buckley-James method in censored regression. Outliers can be defined simply as an observation (or a subset of observations) that is isolated from the other observations in the data set. There are two main reasons that motivate people to find outliers; the first is the researcher's intention. The second is the effects of an outlier on analyses, i.e. the existence of outliers will affect means, variances and regression coefficients; they will also cause a bias or distortion of estimates; likewise, they will inflate the sums of squares and hence, false conclusions are likely to be created. Sometimes, the identification of outliers is the main objective of the analysis, and whether to remove the outliers or for them to be down-weighted prior to fitting a non-robust model. This thesis does not differentiate between the various justifications for outlier detection. The aim is to advise the analyst of observations that are considerably different from the majority. Note that the techniques for identification of outliers introduce in this thesis is applicable to a wide variety of settings. Those techniques are performed on large and small data sets. In this thesis, observations that are located far away from the remaining data are considered to be outliers. Additionally, it is noted that some techniques for the identification of outliers are available for finding clusters. There are two major challenges in clustering. The first is identifying clusters in high-dimensional data sets is a difficult task because of the curse of dimensionality. The second is a new dissimilarity measure is needed as some traditional distance functions cannot capture the pattern dissimilarity among the objects. This thesis deals with the latter challenge. This thesis introduces Influence Angle Cluster Approach (iaca) that may be used as a dissimilarity matrix and the author has managed to show that iaca successfully develops a cluster when it is used in partitioning clustering, even if the data set has mixed variables, i.e. interval and categorical variables. The iaca is developed based on the influence eigenstructure. The first two problems in this thesis deal with a complete data set. It is also interesting to study about the incomplete data set, i.e. censored data set. The term 'censored' is mostly used in biological science areas such as a survival analysis. Nowadays, researchers are interested in comparing the survival distribution of two samples. Even though this can be done by using the logrank test, this method cannot examine the effects of more than one variable at a time. This difficulty can easily be overcome by using the survival regression model. Examples of the survival regression model are the Cox model, Miller's model, the Buckely James model and the Koul- Susarla-Van Ryzin model. The Buckley James model's performance is comparable with the Cox model and the former performs best when compared both to the Miller model and the Koul-Susarla-Van Ryzin model. Previous comparison studies proved that the Buckley-James estimator is more stable and easier to explain to non-statisticians than the Cox model. Today, researchers are interested in using the Cox model instead of the Buckley-James model. This is because of the lack of function of Buckley-James model in the computer software and choices of diagnostics analysis. Currently, there are only a few diagnostics analyses for Buckley James model that exist. Therefore, this thesis proposes two new diagnostics analyses for the Buckley-James model. The first proposed diagnostics analysis is called renovated Cook's distance. This method produces comparable results with the previous findings. Nevertheless, this method cannot identify influential observations from the censored group. It can only detect influential observations from the uncensored group. This issue needs further investigation because of the possibility of censored points becoming influential cases in censored regression. Secondly, the local influence approach for the Buckley-James model is proposed. This thesis presents the local influence diagnostics of the Buckley-James model which consist of variance perturbation, response variable perturbation, censoring status perturbation, and independent variables perturbation. The proposed diagnostics improves and also challenge findings of the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation.</p>

2021 ◽  
Author(s):  
◽  
Nazrina Aziz

<p>This thesis investigates three research problems which arise in multivariate data and censored regression. The first is the identification of outliers in multivariate data. The second is a dissimilarity measure for clustering purposes. The third is the diagnostics analysis for the Buckley-James method in censored regression. Outliers can be defined simply as an observation (or a subset of observations) that is isolated from the other observations in the data set. There are two main reasons that motivate people to find outliers; the first is the researcher's intention. The second is the effects of an outlier on analyses, i.e. the existence of outliers will affect means, variances and regression coefficients; they will also cause a bias or distortion of estimates; likewise, they will inflate the sums of squares and hence, false conclusions are likely to be created. Sometimes, the identification of outliers is the main objective of the analysis, and whether to remove the outliers or for them to be down-weighted prior to fitting a non-robust model. This thesis does not differentiate between the various justifications for outlier detection. The aim is to advise the analyst of observations that are considerably different from the majority. Note that the techniques for identification of outliers introduce in this thesis is applicable to a wide variety of settings. Those techniques are performed on large and small data sets. In this thesis, observations that are located far away from the remaining data are considered to be outliers. Additionally, it is noted that some techniques for the identification of outliers are available for finding clusters. There are two major challenges in clustering. The first is identifying clusters in high-dimensional data sets is a difficult task because of the curse of dimensionality. The second is a new dissimilarity measure is needed as some traditional distance functions cannot capture the pattern dissimilarity among the objects. This thesis deals with the latter challenge. This thesis introduces Influence Angle Cluster Approach (iaca) that may be used as a dissimilarity matrix and the author has managed to show that iaca successfully develops a cluster when it is used in partitioning clustering, even if the data set has mixed variables, i.e. interval and categorical variables. The iaca is developed based on the influence eigenstructure. The first two problems in this thesis deal with a complete data set. It is also interesting to study about the incomplete data set, i.e. censored data set. The term 'censored' is mostly used in biological science areas such as a survival analysis. Nowadays, researchers are interested in comparing the survival distribution of two samples. Even though this can be done by using the logrank test, this method cannot examine the effects of more than one variable at a time. This difficulty can easily be overcome by using the survival regression model. Examples of the survival regression model are the Cox model, Miller's model, the Buckely James model and the Koul- Susarla-Van Ryzin model. The Buckley James model's performance is comparable with the Cox model and the former performs best when compared both to the Miller model and the Koul-Susarla-Van Ryzin model. Previous comparison studies proved that the Buckley-James estimator is more stable and easier to explain to non-statisticians than the Cox model. Today, researchers are interested in using the Cox model instead of the Buckley-James model. This is because of the lack of function of Buckley-James model in the computer software and choices of diagnostics analysis. Currently, there are only a few diagnostics analyses for Buckley James model that exist. Therefore, this thesis proposes two new diagnostics analyses for the Buckley-James model. The first proposed diagnostics analysis is called renovated Cook's distance. This method produces comparable results with the previous findings. Nevertheless, this method cannot identify influential observations from the censored group. It can only detect influential observations from the uncensored group. This issue needs further investigation because of the possibility of censored points becoming influential cases in censored regression. Secondly, the local influence approach for the Buckley-James model is proposed. This thesis presents the local influence diagnostics of the Buckley-James model which consist of variance perturbation, response variable perturbation, censoring status perturbation, and independent variables perturbation. The proposed diagnostics improves and also challenge findings of the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation.</p>


Sensors ◽  
2019 ◽  
Vol 19 (1) ◽  
pp. 166 ◽  
Author(s):  
Rahim Khan ◽  
Ihsan Ali ◽  
Saleh M. Altowaijri ◽  
Muhammad Zakarya ◽  
Atiq Ur Rahman ◽  
...  

Multivariate data sets are common in various application areas, such as wireless sensor networks (WSNs) and DNA analysis. A robust mechanism is required to compute their similarity indexes regardless of the environment and problem domain. This study describes the usefulness of a non-metric-based approach (i.e., longest common subsequence) in computing similarity indexes. Several non-metric-based algorithms are available in the literature, the most robust and reliable one is the dynamic programming-based technique. However, dynamic programming-based techniques are considered inefficient, particularly in the context of multivariate data sets. Furthermore, the classical approaches are not powerful enough in scenarios with multivariate data sets, sensor data or when the similarity indexes are extremely high or low. To address this issue, we propose an efficient algorithm to measure the similarity indexes of multivariate data sets using a non-metric-based methodology. The proposed algorithm performs exceptionally well on numerous multivariate data sets compared with the classical dynamic programming-based algorithms. The performance of the algorithms is evaluated on the basis of several benchmark data sets and a dynamic multivariate data set, which is obtained from a WSN deployed in the Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology. Our evaluation suggests that the proposed algorithm can be approximately 39.9% more efficient than its counterparts for various data sets in terms of computational time.


2021 ◽  
Vol 2 (2) ◽  
pp. 40-47
Author(s):  
Sunil Kumar ◽  
Vaibhav Bhatnagar

Machine learning is one of the active fields and technologies to realize artificial intelligence (AI). The complexity of machine learning algorithms creates problems to predict the best algorithm. There are many complex algorithms in machine learning (ML) to determine the appropriate method for finding regression trends, thereby establishing the correlation association in the middle of variables is very difficult, we are going to review different types of regressions used in Machine Learning. There are mainly six types of regression model Linear, Logistic, Polynomial, Ridge, Bayesian Linear and Lasso. This paper overview the above-mentioned regression model and will try to find the comparison and suitability for Machine Learning. A data analysis prerequisite to launch an association amongst the innumerable considerations in a data set, association is essential for forecast and exploration of data. Regression Analysis is such a procedure to establish association among the datasets. The effort on this paper predominantly emphases on the diverse regression analysis model, how they binning to custom in context of different data sets in machine learning. Selection the accurate model for exploration is the most challenging assignment and hence, these models considered thoroughly in this study. In machine learning by these models in the perfect way and thru accurate data set, data exploration and forecast can provide the maximum exact outcomes.


1999 ◽  
Vol 09 (03) ◽  
pp. 195-202 ◽  
Author(s):  
JOSÉ ALFREDO FERREIRA COSTA ◽  
MÁRCIO LUIZ DE ANDRADE NETTO

Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately [Formula: see text] possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are similar. The number of regions and their labels are automatically found and they are related to the number of clusters in a multivariate data set. New data can be classified by labeling it according to the best match neuron. Simulations using data sets drawn from finite mixtures of p-variate normal densities are presented as well as related advantages and drawbacks of the method.


2006 ◽  
Vol 5 (2) ◽  
pp. 125-136 ◽  
Author(s):  
Jimmy Johansson ◽  
Patric Ljung ◽  
Mikael Jern ◽  
Matthew Cooper

Parallel coordinates is a well-known technique used for visualization of multivariate data. When the size of the data sets increases the parallel coordinates display results in an image far too cluttered to perceive any structure. We tackle this problem by constructing high-precision textures to represent the data. By using transfer functions that operate on the high-precision textures, it is possible to highlight different aspects of the entire data set or clusters of the data. Our methods are implemented in both standard 2D parallel coordinates and 3D multi-relational parallel coordinates. Furthermore, when visualizing a larger number of clusters, a technique called ‘feature animation’ may be used as guidance by presenting various cluster statistics. A case study is also performed to illustrate the analysis process when analysing large multivariate data sets using our proposed techniques.


1999 ◽  
Vol 21 (s-1) ◽  
pp. 1-23 ◽  
Author(s):  
Peter J. Frischmann ◽  
Edward W. Frees

We empirically investigate the demand for income tax preparation services by examining factors that affect both the choice and level of utilization of service. We identify the demand factors as the taxpayer's (1) opportunity costs, (2) estimated tax savings when using a preparer, and (3) historical uncertainty in tax liability. Our panel data set allows us to measure individual-specific uncertainty, a new measure in assessing determinants of tax service demand. Consistent with prior research, choice is measured by whether a taxpayer uses a professional paid preparer. The preparer's fee is the measure of utilization level. Fee information is heavily censored in part because fees only need to be disclosed when taxpayers itemize deductions and have miscellaneous itemized deductions above the 2 percent limit. We develop a partially censored regression model to accommodate the censoring. Similar to Cragg (1971), we decouple the choice and level of utilization models; findings indicate differences between these models. Generally, taxpayers choose paid preparers for time savings and uncertainty protection. Fees, however, reflect the purchase of time and tax savings, not uncertainty protection. These results suggest that pricing structures for professional tax preparation services could be adjusted to more closely reflect the services provided.


2021 ◽  
Vol 4 (2) ◽  
pp. 225-240
Author(s):  
Pinki Sagar ◽  
◽  
Prinima Gupta ◽  
Rohit Tanwar ◽  
◽  
...  

Regression analysis is a statistical technique that is most commonly used for forecasting. Data sets are becoming very large due to continuous transactions in today's high-paced world. The data is difficult to manage and interpret. All the independent variables can’t be considered for the prediction because it costs high for maintenance of the data set. A novel algorithm for prediction has been implemented in this paper. Its emphasis is on extraction of efficient independent variables from various variables of the data set. The selection of variables is based on Mean Square Errors (MSE) as well as on the coefficient of determination r2p, after that the final prediction equation for the algorithm is framed on the basis of deviation of actual mean. This is a statistical based prediction algorithm which is used to evaluate the prediction based on four parameters: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and residuals. This algorithm has been implemented for a multivariate data set with low maintenance costs, preprocessing costs, lower root mean square error and residuals. For one dimensional, two-dimensional, frequent stream data, time series data and continuous data, the proposed prediction algorithm can also be used. The impact of this algorithm is to enhance the accuracy rate of forecasting and minimized average error rate.


2018 ◽  
Vol 10 (3) ◽  
pp. 1473-1490 ◽  
Author(s):  
Birgit Hassler ◽  
Stefanie Kremser ◽  
Greg E. Bodeker ◽  
Jared Lewis ◽  
Kage Nesbit ◽  
...  

Abstract. An updated and improved version of a global, vertically resolved, monthly mean zonal mean ozone database has been calculated – hereafter referred to as the BSVertOzone (Bodeker Scientific Vertical Ozone) database. Like its predecessor, it combines measurements from several satellite-based instruments and ozone profile measurements from the global ozonesonde network. Monthly mean zonal mean ozone concentrations in mixing ratio and number density are provided in 5∘ latitude bins, spanning 70 altitude levels (1 to 70 km), or 70 pressure levels that are approximately 1 km apart (878.4 to 0.046 hPa). Different data sets or “tiers” are provided: Tier 0 is based only on the available measurements and therefore does not completely cover the whole globe or the full vertical range uniformly; the Tier 0.5 monthly mean zonal means are calculated as a filled version of the Tier 0 database where missing monthly mean zonal mean values are estimated from correlations against a total column ozone (TCO) database. The Tier 0.5 data set includes the full range of measurement variability and is created as an intermediate step for the calculation of the Tier 1 data where a least squares regression model is used to attribute variability to various known forcing factors for ozone. Regression model fit coefficients are expanded in Fourier series and Legendre polynomials (to account for seasonality and latitudinal structure, respectively). Four different combinations of contributions from selected regression model basis functions result in four different Tier 1 data sets that can be used for comparisons with chemistry–climate model (CCM) simulations that do not exhibit the same unforced variability as reality (unless they are nudged towards reanalyses). Compared to previous versions of the database, this update includes additional satellite data sources and ozonesonde measurements to extend the database period to 2016. Additional improvements over the previous version of the database include the following: (i) adjustments of measurements to account for biases and drifts between different data sources (using a chemistry-transport model, CTM, simulation as a transfer standard), (ii) a more objective way to determine the optimum number of Fourier and Legendre expansions for the basis function fit coefficients, and (iii) the derivation of methodological and measurement uncertainties on each database value are traced through all data modification steps. Comparisons with the ozone database from SWOOSH (Stratospheric Water and OzOne Satellite Homogenized data set) show good agreement in many regions of the globe. Minor differences are caused by different bias adjustment procedures for the two databases. However, compared to SWOOSH, BSVertOzone additionally covers the troposphere. Version 1.0 of BSVertOzone is publicly available at https://doi.org/http://doi.org/10.5281/zenodo.1217184.


2021 ◽  
Vol 4 (3) ◽  
pp. 47-63
Author(s):  
Owhondah P.S. ◽  
Enegesele D. ◽  
Biu O.E. ◽  
Wokoma D.S.A.

The study deals with discriminating between the second-order models with/without interaction on central tendency estimation using the ordinary least square (OLS) method for the estimation of the model parameters. The paper considered two different sets of data (small and large) sample size. The small sample size used data of unemployment rate as a response, inflation rate and exchange rate as the predictors from 2007 to 2018 and the large sample size was data of flow-rate on hydrate formation for Niger Delta deep offshore field. The〖 R〗^2, AIC, SBC, and SSE were computed for both data sets to test for adequacy of the models. The results show that all three models are similar for smaller data set while for large data set the second-order model centered on the median with/without interaction is the best base on the number of significant parameters. The model’s selection criterion values (R^2, AIC, SBC, and SSE) were found to be equal for models centered on median and mode for both large and small data sets. However, the model centered on median and mode with/without interaction were better than the model centered on the mean for large data sets. This study shows that the second-order regression model centered on median and mode are better than the model centered on the mean for large data set, while they are similar for smaller data set. Hence, the second-order regression model centered on median and mode with or without interaction are better than the second-order regression model centered on the mean.


2016 ◽  
Vol 8 (1) ◽  
Author(s):  
József Kovács ◽  
Nikolett Bodnár ◽  
Ákos Török

AbstractThe paper presents the evaluation of engineering geological laboratory test results of core drillings along the new metro line (line 4) in Budapest by using a multivariate data analysis. A data set of 30 core drillings with a total coring length of over 1500 meters was studied. Of the eleven engineering geological parameters considered in this study, only the five most reliable (void ratio, dry bulk density, angle of internal friction, cohesion and compressive strength) representing 1260 data points were used for multivariate (cluster and discriminant) analyses. To test the results of the cluster analysis discriminant analysis was used. The results suggest that the use of multivariate analyses allows the identification of different groups of sediments even when the data sets are overlapping and contain several uncertainties. The tests also prove that the use of these methods for seemingly very scattered parameters is crucial in obtaining reliable engineering geological data for design.


Sign in / Sign up

Export Citation Format

Share Document