dissimilarity measure Latest Research Papers

Abstract Investigating thermal energy demand is crucial for the development of sustainable cities and efficient use of renewable sources. Despite the advances made in this field, the analysis of energy data provided by smart grids is currently a demanding challenge. In this paper, we develop a clustering methodology based on a novel dissimilarity measure to analyze a high temporal resolution panel data for district heating demand in the Italian city Bozen-Bolzano. Starting from the characteristics of this data, we explore the usefulness of the Ali-Mikhail-Haq copula in defining a new dissimilarity measure to cluster variables in a hierarchical framework. We show that our proposal is particularly sensitive to small dissimilarities based on tiny differences in the dependence level. Therefore, the proposed measure is able to better distinguish between objects with low dissimilarity than classic rank-based dissimilarity measures. Moreover, our proposal is defined in a spatial version that is able to take into account the spatial location of the compared objects. We investigate the proposed measure through Monte Carlo studies and compare it with the corresponding spatial Kendall's correlation-based dissimilarity measure. Finally, the application to real data makes it possible to find clusters of buildings homogeneous with respect to their main characteristics, such as energy efficiency and heating surface, to support the design, expansion and management of district heating systems.

Download Full-text

Improved generalized dissimilarity measure‐based VIKOR method for Pythagorean fuzzy sets

International Journal of Intelligent Systems ◽

10.1002/int.22757 ◽

2021 ◽

Author(s):

Muhammad Jabir Khan ◽

Muhammad Irfan Ali ◽

Poom Kumam ◽

Wiyada Kumam ◽

Muhammad Aslam ◽

...

Keyword(s):

Fuzzy Sets ◽

Dissimilarity Measure ◽

Vikor Method ◽

Pythagorean Fuzzy Sets

Download Full-text

Analysis and Diagnostics for Censored Regression and Multivariate Data

10.26686/wgtn.16973998.v1 ◽

2021 ◽

Author(s):

◽

Nazrina Aziz

Keyword(s):

Regression Model ◽

Cox Model ◽

Multivariate Data ◽

Dissimilarity Measure ◽

Local Influence ◽

Data Sets ◽

Influential Observations ◽

Censored Regression ◽

Data Set ◽

Survival Regression

<p>This thesis investigates three research problems which arise in multivariate data and censored regression. The first is the identification of outliers in multivariate data. The second is a dissimilarity measure for clustering purposes. The third is the diagnostics analysis for the Buckley-James method in censored regression. Outliers can be defined simply as an observation (or a subset of observations) that is isolated from the other observations in the data set. There are two main reasons that motivate people to find outliers; the first is the researcher's intention. The second is the effects of an outlier on analyses, i.e. the existence of outliers will affect means, variances and regression coefficients; they will also cause a bias or distortion of estimates; likewise, they will inflate the sums of squares and hence, false conclusions are likely to be created. Sometimes, the identification of outliers is the main objective of the analysis, and whether to remove the outliers or for them to be down-weighted prior to fitting a non-robust model. This thesis does not differentiate between the various justifications for outlier detection. The aim is to advise the analyst of observations that are considerably different from the majority. Note that the techniques for identification of outliers introduce in this thesis is applicable to a wide variety of settings. Those techniques are performed on large and small data sets. In this thesis, observations that are located far away from the remaining data are considered to be outliers. Additionally, it is noted that some techniques for the identification of outliers are available for finding clusters. There are two major challenges in clustering. The first is identifying clusters in high-dimensional data sets is a difficult task because of the curse of dimensionality. The second is a new dissimilarity measure is needed as some traditional distance functions cannot capture the pattern dissimilarity among the objects. This thesis deals with the latter challenge. This thesis introduces Influence Angle Cluster Approach (iaca) that may be used as a dissimilarity matrix and the author has managed to show that iaca successfully develops a cluster when it is used in partitioning clustering, even if the data set has mixed variables, i.e. interval and categorical variables. The iaca is developed based on the influence eigenstructure. The first two problems in this thesis deal with a complete data set. It is also interesting to study about the incomplete data set, i.e. censored data set. The term 'censored' is mostly used in biological science areas such as a survival analysis. Nowadays, researchers are interested in comparing the survival distribution of two samples. Even though this can be done by using the logrank test, this method cannot examine the effects of more than one variable at a time. This difficulty can easily be overcome by using the survival regression model. Examples of the survival regression model are the Cox model, Miller's model, the Buckely James model and the Koul- Susarla-Van Ryzin model. The Buckley James model's performance is comparable with the Cox model and the former performs best when compared both to the Miller model and the Koul-Susarla-Van Ryzin model. Previous comparison studies proved that the Buckley-James estimator is more stable and easier to explain to non-statisticians than the Cox model. Today, researchers are interested in using the Cox model instead of the Buckley-James model. This is because of the lack of function of Buckley-James model in the computer software and choices of diagnostics analysis. Currently, there are only a few diagnostics analyses for Buckley James model that exist. Therefore, this thesis proposes two new diagnostics analyses for the Buckley-James model. The first proposed diagnostics analysis is called renovated Cook's distance. This method produces comparable results with the previous findings. Nevertheless, this method cannot identify influential observations from the censored group. It can only detect influential observations from the uncensored group. This issue needs further investigation because of the possibility of censored points becoming influential cases in censored regression. Secondly, the local influence approach for the Buckley-James model is proposed. This thesis presents the local influence diagnostics of the Buckley-James model which consist of variance perturbation, response variable perturbation, censoring status perturbation, and independent variables perturbation. The proposed diagnostics improves and also challenge findings of the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation.</p>

Download Full-text

Analysis and Diagnostics for Censored Regression and Multivariate Data

10.26686/wgtn.16973998 ◽

2021 ◽

Author(s):

◽

Nazrina Aziz

Keyword(s):

Regression Model ◽

Cox Model ◽

Multivariate Data ◽

Dissimilarity Measure ◽

Local Influence ◽

Data Sets ◽

Influential Observations ◽

Censored Regression ◽

Data Set ◽

Survival Regression

<p>This thesis investigates three research problems which arise in multivariate data and censored regression. The first is the identification of outliers in multivariate data. The second is a dissimilarity measure for clustering purposes. The third is the diagnostics analysis for the Buckley-James method in censored regression. Outliers can be defined simply as an observation (or a subset of observations) that is isolated from the other observations in the data set. There are two main reasons that motivate people to find outliers; the first is the researcher's intention. The second is the effects of an outlier on analyses, i.e. the existence of outliers will affect means, variances and regression coefficients; they will also cause a bias or distortion of estimates; likewise, they will inflate the sums of squares and hence, false conclusions are likely to be created. Sometimes, the identification of outliers is the main objective of the analysis, and whether to remove the outliers or for them to be down-weighted prior to fitting a non-robust model. This thesis does not differentiate between the various justifications for outlier detection. The aim is to advise the analyst of observations that are considerably different from the majority. Note that the techniques for identification of outliers introduce in this thesis is applicable to a wide variety of settings. Those techniques are performed on large and small data sets. In this thesis, observations that are located far away from the remaining data are considered to be outliers. Additionally, it is noted that some techniques for the identification of outliers are available for finding clusters. There are two major challenges in clustering. The first is identifying clusters in high-dimensional data sets is a difficult task because of the curse of dimensionality. The second is a new dissimilarity measure is needed as some traditional distance functions cannot capture the pattern dissimilarity among the objects. This thesis deals with the latter challenge. This thesis introduces Influence Angle Cluster Approach (iaca) that may be used as a dissimilarity matrix and the author has managed to show that iaca successfully develops a cluster when it is used in partitioning clustering, even if the data set has mixed variables, i.e. interval and categorical variables. The iaca is developed based on the influence eigenstructure. The first two problems in this thesis deal with a complete data set. It is also interesting to study about the incomplete data set, i.e. censored data set. The term 'censored' is mostly used in biological science areas such as a survival analysis. Nowadays, researchers are interested in comparing the survival distribution of two samples. Even though this can be done by using the logrank test, this method cannot examine the effects of more than one variable at a time. This difficulty can easily be overcome by using the survival regression model. Examples of the survival regression model are the Cox model, Miller's model, the Buckely James model and the Koul- Susarla-Van Ryzin model. The Buckley James model's performance is comparable with the Cox model and the former performs best when compared both to the Miller model and the Koul-Susarla-Van Ryzin model. Previous comparison studies proved that the Buckley-James estimator is more stable and easier to explain to non-statisticians than the Cox model. Today, researchers are interested in using the Cox model instead of the Buckley-James model. This is because of the lack of function of Buckley-James model in the computer software and choices of diagnostics analysis. Currently, there are only a few diagnostics analyses for Buckley James model that exist. Therefore, this thesis proposes two new diagnostics analyses for the Buckley-James model. The first proposed diagnostics analysis is called renovated Cook's distance. This method produces comparable results with the previous findings. Nevertheless, this method cannot identify influential observations from the censored group. It can only detect influential observations from the uncensored group. This issue needs further investigation because of the possibility of censored points becoming influential cases in censored regression. Secondly, the local influence approach for the Buckley-James model is proposed. This thesis presents the local influence diagnostics of the Buckley-James model which consist of variance perturbation, response variable perturbation, censoring status perturbation, and independent variables perturbation. The proposed diagnostics improves and also challenge findings of the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation.</p>

Download Full-text

Improving structural variant clustering to reduce the negative effect of the breakpoint uncertainty problem

BMC Bioinformatics ◽

10.1186/s12859-021-04374-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jan Geryk ◽

Alzbeta Zinkova ◽

Iveta Zedníková ◽

Halina Simková ◽

Vlastimil Stenzl ◽

...

Keyword(s):

Mendelian Inheritance ◽

Dissimilarity Measure ◽

Constrained Clustering ◽

Structural Variants ◽

Short Read Sequencing ◽

Population Analyses ◽

Heuristic Strategy ◽

Negative Effect ◽

Critical Problems ◽

Hardy Weinberg Equilibrium

Abstract Background Structural variants (SVs) represent an important source of genetic variation. One of the most critical problems in their detection is breakpoint uncertainty associated with the inability to determine their exact genomic position. Breakpoint uncertainty is a characteristic issue of structural variants detected via short-read sequencing methods and complicates subsequent population analyses. The commonly used heuristic strategy reduces this issue by clustering/merging nearby structural variants of the same type before the data from individual samples are merged. Results We compared the two most used dissimilarity measures for SV clustering in terms of Mendelian inheritance errors (MIE), kinship prediction, and deviation from Hardy–Weinberg equilibrium. We analyzed the occurrence of Mendelian-inconsistent SV clusters that can be collapsed into one Mendelian-consistent SV as a new measure of dataset consistency. We also developed a new method based on constrained clustering that explicitly identifies these types of clusters. Conclusions We found that the dissimilarity measure based on the distance between SVs breakpoints produces slightly better results than the measure based on SVs overlap. This difference is evident in trivial and corrected clustering strategy, but not in constrained clustering strategy. However, constrained clustering strategy provided the best results in all aspects, regardless of the dissimilarity measure used.

Download Full-text

Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data

Applied Sciences ◽

10.3390/app11188416 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8416

Author(s):

Changki Lee ◽

Uk Jung

Keyword(s):

Learning Outcomes ◽

Categorical Data ◽

Dissimilarity Measure ◽

Machine Learning Algorithms ◽

Distance Measures ◽

Categorical Variables ◽

Continuous Data ◽

Clustering Problem ◽

Data Clusters ◽

Categorical Data Clustering

Measuring the dissimilarity between two observations is the basis of many data mining and machine learning algorithms, and its effectiveness has a significant impact on learning outcomes. The dissimilarity or distance computation has been a manageable problem for continuous data because many numerical operations can be successfully applied. However, unlike continuous data, defining a dissimilarity between pairs of observations with categorical variables is not straightforward. This study proposes a new method to measure the dissimilarity between two categorical observations, called a context-based geodesic dissimilarity measure, for the categorical data clustering problem. The proposed method considers the relationships between categorical variables and discovers the implicit topological structures in categorical data. In other words, it can effectively reflect the nonlinear patterns of arbitrarily shaped categorical data clusters. Our experimental results confirm that the proposed measure that considers both nonlinear data patterns and relationships among the categorical variables yields better clustering performance than other distance measures.

Download Full-text