optimal number of clusters
Recently Published Documents


TOTAL DOCUMENTS

159
(FIVE YEARS 73)

H-INDEX

12
(FIVE YEARS 3)

2022 ◽  
Author(s):  
Gabriel A. Vignolle ◽  
Robert L. Mach ◽  
Astrid R. Mach-Aigner ◽  
Christian Derntl

Coevolution is an important biological process that shapes interacting species or even proteins – may it be physically interacting proteins or consecutive enzymes in a metabolic pathway. The detection of co-evolved proteins will contribute to a better understanding of biological systems. Previously, we developed a semi-automated method, termed FunOrder, for the detection of co-evolved genes from an input gene or protein set. We demonstrated the usability and applicability of FunOrder by identifying essential genes in biosynthetic gene clusters from different ascomycetes. A major drawback of this original method was the need for a manual assessment, which may create a user bias and prevents a high-throughput application. Here we present a fully automated version of this method termed FunOrder 2.0. To fully automatize the method, we used several mathematical indices to determine the optimal number of clusters in the FunOrder output, and a subsequent k-means clustering based on the first three principal components of a principal component analysis of the FunOrder output. Further, we replaced the BLAST with the DIAMOND tool, which enhanced speed and allows the future integration of larger proteome databases. The introduced changes slightly decreased the sensitivity of this method, which is outweighed by enhanced overall speed and specificity. Additionally, the changes lay the foundation for future high-throughput applications of FunOrder 2.0 in different phyla to solve different biological problems.


2021 ◽  
Vol 10 (3) ◽  
pp. 359-366
Author(s):  
Hanik Malikhatin ◽  
Agus Rusgiyono ◽  
Di Asih I Maruddani

Prospective TKI workers who apply for passports at the Immigration Office Class I Non TPI Pati have countries destinations and choose different PPTKIS agencies. Therefore, the grouping of characteristics prospective TKI needed so that can be used as a reference for the government in an effort to improve the protection of TKI in destination countries and carry out stricter supervision of PPTKIS who manage TKI. The purpose of this research is to classify the characteristics of prospective TKI workers with the optimal number of clusters. The method used is k-Modes Clustering with values of k = 2, 3, 4, and 5. This method can agglomerate categorical data. The optimal number of clusters can be determined using the Dunn Index. For grouping data easily, then compiled a Graphical User Interface (GUI) based application with RStudio. Based on the analysis, the optimal number of clusters is two clusters with a Dunn Index value of 0,4. Cluster 1 consists of mostly male TKI workers (51,04%), aged ≥ 20 years old (91,93%), with the destination Malaysia country (47%), and choosing PPTKIS Surya Jaya Utama Abadi (37,51%), while cluster 2, mostly of male TKI workers (94,10%), aged ≥ 20 years old (82,31%), with the destination Korea Selatan country (77,95%), and choosing PPTKIS BNP2TKI (99,78%). 


Author(s):  
Afdelia Novianti ◽  
Irsyifa Mayzela Afnan ◽  
Rafi Ilmi Badri Utama ◽  
Edy Widodo

Poverty is an essential issue for every country, including Indonesia. Poverty can be caused by the scarcity of basic necessities or the difficulty of accessing education and employment. In 2019 Papua Province became the province with the highest poverty percentage at 27.53%. Seeing this, the district groupings formed in describing poverty conditions in Papua Province are based on similar characteristics using the variables Percentage of Poor Population, Gross Regional Domestic Product, Open Unemployment Rate, Life Expectancy, Literacy Rate, and Population Working in the Agricultural Sector using K-medoids clustering algorithm. The results of this study indicate that the optimal number of clusters to describe poverty conditions in Papua Province is 4 clusters with a variance of 0.012, where the first cluster consists of 10 districts, the second cluster consists of 5 districts, the third cluster consists of 12 districts, and the fourth cluster consists of 2 districts.


Author(s):  
Vadim Romanuke

In the field of technical diagnostics, many tasks are solved by using automated classification. For this, such classifiers like probabilistic neural networks fit best owing to their simplicity. To obtain a probabilistic neural network pattern matrix for technical diagnostics, expert estimations or measurements are commonly involved. The pattern matrix can be deduced straightforwardly by just averaging over those estimations. However, averages are not always the best way to process expert estimations. The goal is to suggest a method of optimally deducing the pattern matrix for technical diagnostics based on expert estimations. The main criterion of the optimality is maximization of the performance, in which the subcriterion of maximization of the operation speed is included. First of all, the maximal width of the pattern matrix is determined. The width does not exceed the number of experts. Then, for every state of an object, the expert estimations are clustered. The clustering can be done by using the k-means method or similar. The centroids of these clusters successively form the pattern matrix. The optimal number of clusters determines the probabilistic neural network optimality by its performance maximization. In general, most results of the error rate percentage of probabilistic neural networks appear to be near-exponentially decreasing as the number of clustered expert estimations is increased. Therefore, if the optimal number of clusters defines a too “wide” pattern matrix whose operation speed is intolerably slow, the performance maximization implies a tradeoff between the error rate percentage minimum and maximally tolerable slowness in the probabilistic neural network operation speed. The optimal number of clusters is found at an asymptotically minimal error rate percentage, or at an acceptable error rate percentage which corresponds to maximally tolerable slowness in operation speed. The optimality is practically referred to the simultaneous acceptability of error rate and operation speed.


2021 ◽  
pp. 1-16
Author(s):  
Aikaterini Karanikola ◽  
Charalampos M. Liapis ◽  
Sotiris Kotsiantis

In short, clustering is the process of partitioning a given set of objects into groups containing highly related instances. This relation is determined by a specific distance metric with which the intra-cluster similarity is estimated. Finding an optimal number of such partitions is usually the key step in the entire process, yet a rather difficult one. Selecting an unsuitable number of clusters might lead to incorrect conclusions and, consequently, to wrong decisions: the term “optimal” is quite ambiguous. Furthermore, various inherent characteristics of the datasets, such as clusters that overlap or clusters containing subclusters, will most often increase the level of difficulty of the task. Thus, the methods used to detect similarities and the parameter selection of the partition algorithm have a major impact on the quality of the groups and the identification of their optimal number. Given that each dataset constitutes a rather distinct case, validity indices are indicators introduced to address the problem of selecting such an optimal number of clusters. In this work, an extensive set of well-known validity indices, based on the approach of the so-called relative criteria, are examined comparatively. A total of 26 cluster validation measures were investigated in two distinct case studies: one in real-world and one in artificially generated data. To ensure a certain degree of difficulty, both real-world and generated data were selected to exhibit variations and inhomogeneity. Each of the indices is being deployed under the schemes of 9 different clustering methods, which incorporate 5 different distance metrics. All results are presented in various explanatory forms.


Author(s):  
Ryoichi Kojima ◽  
Roberto Legaspi ◽  
Toshiaki Murofushi ◽  
◽  

Despite the significance of assortativity as a property of networks that paves for the emergence of new structural types, surprisingly, there has been little research done on assortativity. Assortative networks are perhaps among the most prominent examples of complex networks believed to be governed by common phenomena, thereby producing structures far from random. Further, certain vertices possess high centrality and can be regarded as significant and influential vertices that can become cluster centers that connect with high membership to many of the surrounding vertices. We propose a fuzzy clustering method to meaningfully characterize assortative, as well as disassortative, networks by adapting the Bonacichi’s power centrality to seek the high degree centrality vertices to become cluster centers. Moreover, we leverage our novel modularity function to determine the optimal number of clusters, as well as the optimal membership among clusters. However, due to the difficulty of finding real-world assortative network datasets that come with ground truths, we evaluated our method using synthetic data but possibly bearing resemblance to real-world network datasets as they were generated by the Lancichinetti–Fortunato–Radicchi benchmark. Our results show our non-hierarchical method outperforms a known hierarchical fuzzy clustering method, and also performs better than a well-known membership-based modularity function. Our method proved to perform beyond satisfactory for both assortative and disassortative networks.


Author(s):  
Ali Kaveh ◽  
Mohammad Reza Seddighian ◽  
Pouya Hassani

In this paper, an automatic data clustering approach is presented using some concepts of the graph theory. Some Cluster Validity Index (CVI) is mentioned, and DB Index is defined as the objective function of meta-heuristic algorithms. Six Finite Element meshes are decomposed containing two- and three- dimensional types that comprise simple and complex meshes. Six meta-heuristic algorithms are utilized to determine the optimal number of clusters and minimize the decomposition problem. Finally, corresponding statistical results are compared.


Author(s):  
Muhammed-Fatih Kaya ◽  
Mareike Schoop

AbstractThe systematic processing of unstructured communication data as well as the milestone of pattern recognition in order to determine communication groups in negotiations bears many challenges in Machine Learning. In particular, the so-called curse of dimensionality makes the pattern recognition process demanding and requires further research in the negotiation environment. In this paper, various selected renowned clustering approaches are evaluated with regard to their pattern recognition potential based on high-dimensional negotiation communication data. A research approach is presented to evaluate the application potential of selected methods via a holistic framework including three main evaluation milestones: the determination of optimal number of clusters, the main clustering application, and the performance evaluation. Hence, quantified Term Document Matrices are initially pre-processed and afterwards used as underlying databases to investigate the pattern recognition potential of clustering techniques by considering the information regarding the optimal number of clusters and by measuring the respective internal as well as external performances. The overall research results show that certain cluster separations are recommended by internal and external performance measures by means of a holistic evaluation approach, whereas three of the clustering separations are eliminated based on the evaluation results.


Energies ◽  
2021 ◽  
Vol 14 (18) ◽  
pp. 5902
Author(s):  
Fachrizal Aksan ◽  
Michał Jasiński ◽  
Tomasz Sikorski ◽  
Dominika Kaczorowska ◽  
Jacek Rezmer ◽  
...  

In this article, a case study is presented on applying cluster analysis techniques to evaluate the level of power quality (PQ) parameters of a virtual power plant. The conducted research concerns the application of the K-means algorithm in comparison with the agglomerative algorithm for PQ data, which have different sizes of features. The object of the study deals with the standardized datasets containing classical PQ parameters from two sub-studies. Moreover, the optimal number of clusters for both algorithms is discussed using the elbow method and a dendrogram. The experimental results show that the dendrogram method requires a long processing time but gives a consistent result of the optimal number of clusters when there are additional parameters. In comparison, the elbow method is easy to compute but gives inconsistent results. According to the Calinski–Harabasz index and silhouette coefficient, the K-means algorithm performs better than the agglomerative algorithm in clustering the data points when there are no additional features of PQ data. Finally, based on the standard EN 50160, the result of the cluster analysis from both algorithms shows that all PQ parameters for each cluster in the two study objects are still below the limit level and work under normal operating conditions.


Sign in / Sign up

Export Citation Format

Share Document