ASSESSMENT OF NORMALIZATION TECHNIQUES ON THE ACCURACY OF HYPERSPECTRAL DATA CLUSTERING

Partitioning clustering algorithms, such as k-means, is the most widely used clustering algorithms in the remote sensing community. They are the process of identifying clusters within multidimensional data based on some similarity measures (SM). SMs assign more weights to features with large ranges than those with small ranges. In this way, small-range features are suppressed by large-range features so that they cannot have any effect during clustering procedure. This problem deteriorates for the high-dimensional data such as hyperspectral remotely sensed images. To address this problem, the feature normalization (FN) can be used. However, since different FN methods have different performances, in this study, the effects of ten FN methods on hyperspectral data clustering were studied. The proposed method was implemented on both real and synthetic hyperspectral datasets. The evaluations demonstrated that FN could lead to better results than the case that FN is not performed. More importantly, obtained results showed that the rank-based FN with 15.7% and 12.8% improvement, respectively, in the synthetic and real datasets can be considered as the best FN method for hyperspectral data clustering.

Download Full-text

Robust models and novel similarity measures for high-dimensional data clustering

10.32657/10356/48657 ◽

2012 ◽

Author(s):

Duc Thang Nguyen

Keyword(s):

Data Clustering ◽

High Dimensional Data ◽

Similarity Measures ◽

High Dimensional

Download Full-text

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures

BioMed Research International ◽

10.1155/2019/6750296 ◽

2019 ◽

Vol 2019 ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Ameera M. Almasoud ◽

Hend S. Al-Khalifa ◽

Abdulmalik S. Al-Salman

Keyword(s):

Big Data ◽

Semantic Similarity ◽

Data Clustering ◽

Input Data ◽

Distributed Processing ◽

Clustering Algorithms ◽

Similarity Measures ◽

Parallel And Distributed Processing ◽

Time Reduction ◽

Improved Performance

In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.

Download Full-text

Cross Breed Clustering Algorithm for High Dimensional Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a5313.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 5049-5052

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

High Dimensional Data ◽

High Dimensional ◽

Growing Domain ◽

Present World

Clustering plays a major role in machine learning and also in data mining. Deep learning is fast growing domain in present world. Improving the quality of the clustering results by adopting the deep learning algorithms. Many clustering algorithm process various datasets to get the better results. But for the high dimensional data clustering is still an issue to process and get the quality clustering results with the existing clustering algorithms. In this paper, the cross breed clustering algorithm for high dimensional data is utilized. Various datasets are used to get the results.

Download Full-text

Hyperspectral Data Clustering Using Hellinger Divergence

Journal of Physics Conference Series ◽

10.1088/1742-6596/2096/1/012170 ◽

2021 ◽

Vol 2096 (1) ◽

pp. 012170

Author(s):

E Myasnikov

Keyword(s):

Image Processing ◽

Data Clustering ◽

Gradient Descent ◽

Hyperspectral Image ◽

Clustering Algorithms ◽

Hyperspectral Data ◽

Dissimilarity Measures ◽

Clustering Technique ◽

Hyperspectral Image Processing ◽

Hellinger Divergence

Abstract Clustering is an important task in hyperspectral image processing. Despite the existence of a large number of clustering algorithms, little attention has been paid to the use of non-Euclidean dissimilarity measures in the clustering of hyperspectral data. This paper proposes a clustering technique based on the Hellinger divergence as a dissimilarity measure. The proposed technique uses Lloyd’s ideas of the k-means algorithm and gradient descent-based procedure to update clusters centroids. The proposed technique is compared with an alternative fast k-medoid algorithm implemented using the same metric from the viewpoint of clustering error and runtime. Experiments carried out using an open hyperspectral scene have shown the advantages of the proposed technique.

Download Full-text

Improved Text Clustering Using k-Mean Bayesian Vectoriser

Journal of Information & Knowledge Management ◽

10.1142/s0219649214500269 ◽

2014 ◽

Vol 13 (03) ◽

pp. 1450026 ◽

Cited By ~ 4

Author(s):

Hanan M. Alghamdi ◽

Ali Selamat ◽

Nor Shahriza Abdul Karim

Keyword(s):

Probability Distribution ◽

Euclidean Distance ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Similarity Measures ◽

Text Clustering ◽

High Dimensional ◽

Document Representation ◽

Specific Category ◽

Squared Euclidean Distance

In literature studies, high-dimensional data reduces the efficiency of clustering algorithms and maximises execution time. Therefore, in this paper, we propose an approach called a BV-kmeans (Bayesian Vectorisation along with k-means) that aims to improve document representation models for text clustering. This approach consists of integrating the k-means document clustering with the Bayesian Vectoriser that is used to compute the probability distribution of the documents in the vector space in order to overcome the problems of high-dimensional data and lower the consumption time. We have used various similarity measures which are namely: K divergence, Squared Euclidean distance and Squared χ2 distance in order to determine the effective metrics for modelling the similarity between documents with the proposed approach. We have evaluated the proposed approach on a set of common newspaper websites that have highly dimensional data. Experimental results show that the proposed approach can increase the degree to which a cluster encases documents from a specific category by 85%. This is in comparison with the standard k-means algorithm and it has succeeded in lowering the runtime using the proposed approach by 95% compared to the standard k-means algorithm.

Download Full-text

A robustness metric for biological data clustering algorithms

BMC Bioinformatics ◽

10.1186/s12859-019-3089-6 ◽

2019 ◽

Vol 20 (S15) ◽

Author(s):

Yuping Lu ◽

Charles A. Phillips ◽

Michael A. Langston

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Measures ◽

Parameter Tuning ◽

Biological Data ◽

Algorithm Selection ◽

The Stability ◽

Microarray Datasets ◽

Cluster Quality

Abstract Background Cluster analysis is a core task in modern data-centric computation. Algorithmic choice is driven by factors such as data size and heterogeneity, the similarity measures employed, and the type of clusters sought. Familiarity and mere preference often play a significant role as well. Comparisons between clustering algorithms tend to focus on cluster quality. Such comparisons are complicated by the fact that algorithms often have multiple settings that can affect the clusters produced. Such a setting may represent, for example, a preset variable, a parameter of interest, or various sorts of initial assignments. A question of interest then is this: to what degree do the clusters produced vary as setting values change? Results This work introduces a new metric, termed simply “robustness”, designed to answer that question. Robustness is an easily-interpretable measure of the propensity of a clustering algorithm to maintain output coherence over a range of settings. The robustness of eleven popular clustering algorithms is evaluated over some two dozen publicly available mRNA expression microarray datasets. Given their straightforwardness and predictability, hierarchical methods generally exhibited the highest robustness on most datasets. Of the more complex strategies, the paraclique algorithm yielded consistently higher robustness than other algorithms tested, approaching and even surpassing hierarchical methods on several datasets. Other techniques exhibited mixed robustness, with no clear distinction between them. Conclusions Robustness provides a simple and intuitive measure of the stability and predictability of a clustering algorithm. It can be a useful tool to aid both in algorithm selection and in deciding how much effort to devote to parameter tuning.

Download Full-text

A General Framework for Mixed and Incomplete Data Clustering Based on Swarm Intelligence Algorithms

Mathematics ◽

10.3390/math9070786 ◽

2021 ◽

Vol 9 (7) ◽

pp. 786

Author(s):

Yenny Villuendas-Rey ◽

Eley Barroso-Cubas ◽

Oscar Camacho-Nieto ◽

Cornelio Yáñez-Márquez

Keyword(s):

Swarm Intelligence ◽

Data Clustering ◽

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Bat Algorithm ◽

Hybrid Features ◽

Bee Colony ◽

Learning Tasks ◽

Clustering Data

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.

Download Full-text

Development for modification of Torgerson projection method using cumulative curve analysis in outlier detection problem for high-dimensional data

Вычислительные технологии ◽

10.25743/ict.2020.25.3.013 ◽

2020 ◽

pp. 119-129

Author(s):

Никита Сергеевич Олейник ◽

Владислав Юрьевич Щеколдин

Keyword(s):

Multidimensional Scaling ◽

Outlier Detection ◽

High Dimensional Data ◽

Quality Data ◽

Multidimensional Data ◽

High Dimensional ◽

Detection Problem ◽

Gravity Center ◽

Largest Eigenvalues ◽

Cumulative Curves

Рассмотрена задача выявления аномальных наблюдений в данных больших размерностей на основе метода многомерного шкалирования с учетом возможности построения качественной визуализации данных. Предложен алгоритм модифицированного метода главных проекций Торгерсона, основанный на построении подпространства проектирования исходных данных путем изменения способа факторизации матрицы скалярных произведений при помощи метода анализа кумулятивных кривых. Построено и проанализировано эмпирическое распределение F -меры для разных вариантов проектирования исходных данных Purpose. Purpose of the article. The paper aims at the development of methods for multidimensional data presentation for solving classification problems based on the cumulative curves analysis. The paper considers the outlier detection problem for high-dimensional data based on the multidimensional scaling, in order to construct high-quality data visualization. An abnormal observation (or outlier), according to D. Hawkins, is an observation that is so different from others that it may be assumed as appeared in the sample in a fundamentally different way. Methods. One of the conceptual approaches that allow providing the classification of sample observations is multidimensional scaling, representing by the classical Orlochi method, the Torgerson main projections and others. The Torgerson method assumes that when converting data to construct the most convenient classification, the origin must be placed at the gravity center of the analyzed data, after which the matrix of scalar products of vectors with the origin at the gravity center is calculated, the two largest eigenvalues and corresponding eigenvectors are chosen and projection matrix is evaluated. Moreover, the method assumes the linear partitioning of regular and anomalous observations, which arises rarely. Therefore, it is logical to choose among the possible axes for designing those that allow obtaining more effective results for solving the problem of detecting outlier observations. A procedure of modified CC-ABOD (Cumulative Curves for Angle Based Outlier Detection) to estimate the visualization quality has been applied. It is based on the estimation of the variances of angles assumed by particular observation and remaining observations in multidimensional space. Further the cumulative curves analysis is implemented, which allows partitioning out groups of closely localized observations (in accordance with the chosen metric) and form classes of regular, intermediate, and anomalous observations. Results. A proposed modification of the Torgerson method is developed. The F1-measure distribution is constructed and analyzed for different design options in the source data. An analysis of the empirical distribution showed that in a number of cases the best axes are corresponding to the second, third, or even fourth largest eigenvalues. Findings. The multidimensional scaling methods for constructing visualizations of multi-dimensional data and solving problems of outlier detection have been considered. It was found out that the determination of design is an ambiguous problem.

Download Full-text