A Comprehensive Study on the Importance of the Elbow and the Silhouette Metrics in Cluster Count Prediction for Partition Cluster Models

Proper selection of cluster count gives better clustering results in partition models. Partition clustering methods are very simple as well as efficient. Kmeans and its modified versions are very efficient cluster models and the results are very sensitive to the chosen K value. The partition clustering algorithms are more suitable in applications where the data are arranged in a uniform manner. This work aims to evaluate the importance of assigning cluster count value in order to improve the efficiency of partition clustering algorithms using two well known statistical methods, the Elbow method and the Silhouette method. The performance of the Silhouette method and Elbow method are compared with different data sets from the UCI data repository. The values obtained using these methods are compared with the results of cluster performance obtained using the statistical analysis tool Weka on the selected data sets. Performance was evaluated on cluster efficiency for small and large data sets by varying the cluster count values. Similar results obtained from the three methods, the Elbow method, the Silhouette method and the clustering by Weka. It was also observed that the fast reduction in clustering efficiency for small changes in cluster count when the cluster count is small.

Download Full-text

Introduction to Clustering

Dynamic and Advanced Data Mining for Progressing Technological Development ◽

10.4018/978-1-60566-908-3.ch010 ◽

2010 ◽

pp. 224-254

Author(s):

Raymond Greenlaw ◽

Sanpawat Kantabutra

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Research Directions ◽

History Of ◽

Representative Points ◽

Parallel Clustering ◽

Extensive List

This chapter provides the reader with an introduction to clustering algorithms and applications. A number of important well-known clustering methods are surveyed. The authors present a brief history of the development of the field of clustering, discuss various types of clustering, and mention some of the current research directions in the field of clustering. Algorithms are described for top-down and bottom-up hierarchical clustering, as are algorithms for K-Means clustering and for K-Medians clustering. The technique of representative points is also presented. Given the large data sets involved with clustering, the need to apply parallel computing to clustering arises, so they discuss issues related to parallel clustering as well. Throughout the chapter references are provided to works that contain a large number of experimental results. A comparison of the various clustering methods is given in tabular format. They conclude the chapter with a summary and an extensive list of references.

Download Full-text

Big Data Clustering And Its Applications Examination

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1466.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 3687-3693

Keyword(s):

Data Mining ◽

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Data Sets ◽

Clustering Methods ◽

Time Saving ◽

Data Set ◽

The Many

Clustering is a type of mining process where the data set is categorized into various sub classes. Clustering process is very much essential in classification, grouping, and exploratory pattern of analysis, image segmentation and decision making. And we can explain about the big data as very large data sets which are examined computationally to show techniques and associations and also which is associated to the human behavior and their interactions. Big data is very essential for several organisations but in few cases very complex to store and it is also time saving. Hence one of the ways of overcoming these issues is to develop the many clustering methods, moreover it suffers from the large complexity. Data mining is a type of technique where the useful information is extracted, but the data mining models cannot utilized for the big data because of inherent complexity. The main scope here is to introducing a overview of data clustering divisions for the big data And also explains here few of the related work for it. This survey concentrates on the research of several clustering algorithms which are working basically on the elements of big data. And also the short overview of clustering algorithms which are grouped under partitioning, hierarchical, grid based and model based are seenClustering is major data mining and it is used for analyzing the big data.the problems for applying clustering patterns to big data and also we phase new issues come up with big data

Download Full-text

A batch-wise non-linear fitting and analysis tool for treating large X-ray diffraction data sets

Journal of Applied Crystallography ◽

10.1107/s0021889805035351 ◽

2006 ◽

Vol 39 (2) ◽

pp. 262-266 ◽

Cited By ~ 7

Author(s):

R. J. Davies

Keyword(s):

Diffraction Data ◽

Operation Mode ◽

Large Data ◽

Scattering Data ◽

Data Sets ◽

Analysis Tool ◽

Data Set ◽

X Ray ◽

Linear Fitting ◽

Non Linear

Synchrotron sources offer high-brilliance X-ray beams which are ideal for spatially and time-resolved studies. Large amounts of wide- and small-angle X-ray scattering data can now be generated rapidly, for example, during routine scanning experiments. Consequently, the analysis of the large data sets produced has become a complex and pressing issue. Even relatively simple analyses become difficult when a single data set can contain many thousands of individual diffraction patterns. This article reports on a new software application for the automated analysis of scattering intensity profiles. It is capable of batch-processing thousands of individual data files without user intervention. Diffraction data can be fitted using a combination of background functions and non-linear peak functions. To compliment the batch-wise operation mode, the software includes several specialist algorithms to ensure that the results obtained are reliable. These include peak-tracking, artefact removal, function elimination and spread-estimate fitting. Furthermore, as well as non-linear fitting, the software can calculate integrated intensities and selected orientation parameters.

Download Full-text

clusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences

10.1101/2021.02.22.432291 ◽

2021 ◽

Author(s):

Sebastiaan Valkiers ◽

Max Van Houcke ◽

Kris Laukens ◽

Pieter Meysman

Keyword(s):

T Cell ◽

Large Data ◽

Cell Receptor ◽

Amino Acid Sequences ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Link Type ◽

Large Sets ◽

Similar Accuracy

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).

Download Full-text

Uncertainty-Based Clustering Algorithms for Large Data Sets

Modern Technologies for Big Data Classification and Clustering - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-2805-0.ch001 ◽

2018 ◽

pp. 1-33 ◽

Cited By ~ 1

Author(s):

B. K. Tripathy ◽

Hari Seetha ◽

M. N. Murty

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Mining Machine ◽

Data Sets ◽

Fuzzy C Means ◽

Intuitionistic Fuzzy ◽

New Algorithms

Data clustering plays a very important role in Data mining, machine learning and Image processing areas. As modern day databases have inherent uncertainties, many uncertainty-based data clustering algorithms have been developed in this direction. These algorithms are fuzzy c-means, rough c-means, intuitionistic fuzzy c-means and the means like rough fuzzy c-means, rough intuitionistic fuzzy c-means which base on hybrid models. Also, we find many variants of these algorithms which improve them in different directions like their Kernelised versions, possibilistic versions, and possibilistic Kernelised versions. However, all the above algorithms are not effective on big data for various reasons. So, researchers have been trying for the past few years to improve these algorithms in order they can be applied to cluster big data. The algorithms are relatively few in comparison to those for datasets of reasonable size. It is our aim in this chapter to present the uncertainty based clustering algorithms developed so far and proposes a few new algorithms which can be developed further.

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

CLUSTERING CATEGORICAL AND NUMERICAL DATA: A NEW PROCEDURE USING MULTIDIMENSIONAL SCALING

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622003000549 ◽

2003 ◽

Vol 02 (01) ◽

pp. 135-159 ◽

Cited By ~ 12

Author(s):

SUNG-GI LEE ◽

DEOK-KYUN YUN

Keyword(s):

Multidimensional Scaling ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Careful Analysis ◽

Mixed Data ◽

Coordinate Space ◽

Data Sets ◽

Categorical Attributes ◽

Categorical Attribute

In this paper, we present a concept based on the similarity of categorical attribute values considering implicit relationships and propose a new and effective clustering procedure for mixed data. Our procedure obtains similarities between categorical values from careful analysis and maps the values in each categorical attribute into points in two-dimensional coordinate space using multidimensional scaling. These mapped values make it possible to interpret the relationships between attribute values and to directly apply categorical attributes to clustering algorithms using a Euclidean distance. After trivial modifications, our procedure for clustering mixed data uses the k-means algorithm, well known for its efficiency in clustering large data sets. We use the familiar soybean disease and adult data sets to demonstrate the performance of our clustering procedure. The satisfactory results that we have obtained demonstrate the effectiveness of our algorithm in discovering structure in data.

Download Full-text

Low-Rank Matrix Factorization and Co-clustering Algorithms for Analyzing Large Data Sets

Lecture Notes in Computer Science - Data Engineering and Management ◽

10.1007/978-3-642-27872-3_41 ◽

2012 ◽

pp. 272-279 ◽

Cited By ~ 2

Author(s):

Archana Donavalli ◽

Manjeet Rege ◽

Xumin Liu ◽

Kourosh Jafari-Khouzani

Keyword(s):

Matrix Factorization ◽

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Low Rank ◽

Data Sets ◽

Rank Matrix ◽

Low Rank Matrix

Download Full-text

Searching for Pulsating Stars Using Clustering Algorithms

Proceedings of the International Astronomical Union ◽

10.1017/s1743921318002855 ◽

2017 ◽

Vol 14 (S339) ◽

pp. 310-313

Author(s):

R. Kgoadi ◽

I. Whittingham ◽

C. Engelbrecht

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Relevant Information ◽

Variable Stars ◽

Data Sets ◽

Specific Class ◽

Pulsating Stars ◽

Expectation Maximisation ◽

Input Variables ◽

Physical Features

AbstractClustering algorithms constitute a multi-disciplinary analytical tool commonly used to summarise large data sets. Astronomical classifications are based on similarity, where celestial objects are assigned to a specific class according to specific physical features. The aim of this project is to obtain relevant information from high-dimensional data (at least three input variables in a data-frame) derived from stellar light-curves using a number of clustering algorithms such as K-means and Expectation Maximisation. In addition to identifying the best performing algorithm, we also identify a subset of features that best define stellar groups. Three methodologies are applied to a sample of Kepler time series in the temperature range 6500–19,000 K. In that spectral range, at least four classes of variable stars are expected to be found: δ Scuti, γ Doradus, Slowly Pulsating B (SPB), and (the still equivocal) Maia stars.

Download Full-text

Analysis of regional development disparities in Ukraine with fuzzy clustering technique

SHS Web of Conferences ◽

10.1051/shsconf/20196504008 ◽

2019 ◽

Vol 65 ◽

pp. 04008

Author(s):

Kateryna Gorbatiuk ◽

Olha Mantalyuk ◽

Oksana Proskurovych ◽

Oleksandr Valkov

Keyword(s):

Fuzzy Clustering ◽

Maximum Growth ◽

Data Sets ◽

Economic Policies ◽

Clustering Methods ◽

Economic Activities ◽

Development Indicators ◽

Clustering Technique ◽

Regional Cluster ◽

Partition Clustering

Disparities in the development of regions in any country affect the entire national economy. Detecting the disparities can help formulate the proper economic policies for each region by taking action against the factors that slow down the economic growth. This study was conducted with the aim of applying clustering methods to analyse regional disparities based on the economic development indicators of the regions of Ukraine. There were considered fuzzy clustering methods, which generalize partition clustering methods by allowing objects to be partially classified into more than one cluster. Fuzzy clustering technique was applied using R packages to the data sets with the statistic indicators concerned to the economic activities in all administrative regions of Ukraine in 2017. Sets of development indicators for different sectors of economic activity, such as industry, agriculture, construction and services, were reviewed and analysed. The study showed that the regional cluster classification results strongly depend on the input development indicators and the clustering technique used for this purpose. Consideration of different partitions into fuzzy clusters opens up new opportunities in developing recommendations on how to differentiate economic policies in order to achieve maximum growth for the regions and the entire country.

Download Full-text