Benchmarking Studies Aimed at Clustering and Classification Tasks Using K-Means, Fuzzy C-Means and Evolutionary Neural Networks

Adam Pickens; Saptarshi Sengupta

doi:10.3390/make3030035

Benchmarking Studies Aimed at Clustering and Classification Tasks Using K-Means, Fuzzy C-Means and Evolutionary Neural Networks

Machine Learning and Knowledge Extraction ◽

10.3390/make3030035 ◽

2021 ◽

Vol 3 (3) ◽

pp. 695-719

Author(s):

Adam Pickens ◽

Saptarshi Sengupta

Keyword(s):

Neural Network ◽

Clustering Algorithm ◽

Random Search ◽

Point Of View ◽

Data Sets ◽

Real World Data ◽

Computational Point ◽

Frequent Use ◽

Validity Indices ◽

Clustering And Classification

Clustering is a widely used unsupervised learning technique across data mining and machine learning applications and finds frequent use in diverse fields ranging from astronomy, medical imaging, search and optimization, geology, geophysics, and sentiment analysis, to name a few. It is therefore important to verify the effectiveness of the clustering algorithm in question and to make reasonably strong arguments for the acceptance of the end results generated by the validity indices that measure the compactness and separability of clusters. This work aims to explore the successes and limitations of two popular clustering mechanisms by comparing their performance over publicly available benchmarking data sets that capture a variety of data point distributions as well as the number of attributes, especially from a computational point of view by incorporating techniques that alleviate some of the issues that plague these algorithms. Sensitivity to initialization conditions and stagnation to local minima are explored. Further, an implementation of a feedforward neural network utilizing a fully connected topology in particle swarm optimization is introduced. This serves to be a guided random search technique for the neural network weight optimization. The algorithms utilized here are studied and compared, from which their applications are explored. The study aims to provide a handy reference for practitioners to both learn about and verify benchmarking results on commonly used real-world data sets from both a supervised and unsupervised point of view before application in more tailored, complex problems.

The Application of Probabilistic Neural Network in Speech Recognition Based on Partition Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.2173 ◽

2012 ◽

Vol 263-266 ◽

pp. 2173-2178

Author(s):

Xin Guang Li ◽

Min Feng Yao ◽

Li Rui Jian ◽

Zhen Jiang Li

Keyword(s):

Neural Network ◽

Speech Recognition ◽

Clustering Algorithm ◽

Probabilistic Neural Network ◽

Back Propagation ◽

Back Propagation Neural Network ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Partition Clustering

A probabilistic neural network (PNN) speech recognition model based on the partition clustering algorithm is proposed in this paper. The most important advantage of PNN is that training is easy and instantaneous. Therefore, PNN is capable of dealing with real time speech recognition. Besides, in order to increase the performance of PNN, the selection of data set is one of the most important issues. In this paper, using the partition clustering algorithm to select data is proposed. The proposed model is tested on two data sets from the field of spoken Arabic numbers, with promising results. The performance of the proposed model is compared to single back propagation neural network and integrated back propagation neural network. The final comparison result shows that the proposed model performs better than the other two neural networks, and has an accuracy rate of 92.41%.

Helicopter Motion Control Using a General Regression Neural Network

Computational Intelligence in Control ◽

10.4018/978-1-59140-037-0.ch003 ◽

2011 ◽

pp. 41-68 ◽

Cited By ~ 2

Author(s):

T. G.B. Amaral ◽

M. M. Crisostomo ◽

V. Fernao Pires

Keyword(s):

Neural Network ◽

Clustering Algorithm ◽

Training Data ◽

General Regression Neural Network ◽

Parallel Structure ◽

Data Sets ◽

Learning Capability ◽

Fast Learning ◽

Control Flight ◽

General Regression

This chapter describes the application of a general regression neural network (GRNN) to control the flight of a helicopter. This GRNN is an adaptive network that provides estimates of continuous variables and is a one-pass learning algorithm with a highly parallel structure. Even with sparse data in a multidimensional measurement space, the algorithm provides smooth transitions from one observed value to another. An important reason for using the GRNN as a controller is the fast learning capability and its non-iterative process. The disadvantage of this neural network is the amount of computation required to produce an estimate, which can become large if many training instances are gathered. To overcome this problem, it is described as a clustering algorithm to produce representative exemplars from a group of training instances that are close to one another reducing the computation amount to obtain an estimate. The reduction of training data used by the GRNN can make it possible to separate the obtained representative exemplars, for example, in two data sets for the coarse and fine control. Experiments are performed to determine the degradation of the performance of the clustering algorithm with less training data. In the control flight system, data training is also reduced to obtain faster controllers, maintaining the desired performance.

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Modularity in Artificial Neural Networks

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch161 ◽

2011 ◽

pp. 1095-1101

Author(s):

Ricardo Téllez ◽

Cecilio Angulo

Keyword(s):

Neural Network ◽

Neural Networks ◽

Artificial Neural Networks ◽

Complex Systems ◽

Intelligent Systems ◽

Point Of View ◽

Main Concern ◽

Complex Structures ◽

Computational Point ◽

Points Of View

The concept of modularity is a main concern for the generation of artificially intelligent systems. Modularity is an ubiquitous organization principle found everywhere in natural and artificial complex systems (Callebaut, 2005). Evidences from biological and philosophical points of view (Caelli and Wen, 1999) (Fodor, 1983), indicate that modularity is a requisite for complex intelligent behaviour. Besides, from an engineering point of view, modularity seems to be the only way for the construction of complex structures. Hence, whether complex neural programs for complex agents are desired, modularity is required. This article introduces the concepts of modularity and module from a computational point of view, and how they apply to the generation of neural programs based on modules. Two levels, strategic and tactical, at which modularity can be implemented, are identified. How they work and how they can be combined for the generation of a completely modular controller for a neural network based agent is presented.

Metric-Based Semi-Supervised Fuzzy C-Means Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.268-270.166 ◽

2011 ◽

Vol 268-270 ◽

pp. 166-171

Author(s):

Xue Song Yin ◽

Qi Huang ◽

Liang Ming Li

Keyword(s):

Real World ◽

Side Information ◽

Data Sets ◽

Membership Degree ◽

Real World Data ◽

Fuzzy C Means ◽

Fuzzy C Means Clustering ◽

Clustering And Classification ◽

Classification Tasks ◽

Fuzzy C Means Algorithm

This paper presents a metric-based semi-supervised fuzzy c-means algorithm called MSFCM. Through using side information and unlabeled data together, MSFCM can be applied to both clustering and classification tasks. The resulting algorithm has the following advantages compared with semi-supervised clustering: firstly, membership degree as side information is used to guide the clustering of the data; secondly, through the metric learned, clustering accuracy can be greatly improved. Experimental results on a collection of real-world data sets demonstrated the effectiveness of the proposed algorithm.

Structure Identification-Based Clustering According to Density Consistency

Mathematical Problems in Engineering ◽

10.1155/2011/890901 ◽

2011 ◽

Vol 2011 ◽

pp. 1-14 ◽

Cited By ~ 1

Author(s):

Chunzhong Li ◽

Zongben Xu

Keyword(s):

High Dimension ◽

Real World ◽

Clustering Algorithm ◽

Density Difference ◽

Structure Identification ◽

Data Sets ◽

Critical Importance ◽

Real World Data ◽

Data Set ◽

High Dimension Data

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.

ANALYSIS OF MULTIDIMENSIONAL XOR CLASSIFICATION PROBLEM WITH EVOLUTIONARY FEEDFORWARD NEURAL NETWORKS

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213007003229 ◽

2007 ◽

Vol 16 (01) ◽

pp. 111-120 ◽

Cited By ~ 5

Author(s):

MANISH MANGAL ◽

MANU PRATAP SINGH

Keyword(s):

Neural Network ◽

Genetic Algorithm ◽

Neural Networks ◽

Search Algorithm ◽

Random Search ◽

Feedforward Neural Networks ◽

Feedforward Neural Network ◽

Data Sets ◽

Classification Problems ◽

Random Search Algorithm

This paper describes the application of two evolutionary algorithms to the feedforward neural networks used in classification problems. Besides of a simple backpropagation feedforward algorithm, the paper considers the genetic algorithm and random search algorithm. The objective is to analyze the performance of GAs over the simple backpropagation feedforward in terms of accuracy or speed in this problem. The experiments considered a feedforward neural network trained with genetic algorithm/random search algorithm and 39 types of network structures and artificial data sets. In most cases, the evolutionary feedforward neural networks seemed to have better of equal accuracy than the original backpropagation feedforward neural network. We found few differences in the accuracy of the networks solved by applying the EAs, but found ample differences in the execution time. The results suggest that the evolutionary feedforward neural network with random search algorithm might be the best algorithm on the data sets we tested.

Interpretation of differential gene expression results of RNA-seq data: review and integration

Briefings in Bioinformatics ◽

10.1093/bib/bby067 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2044-2054 ◽

Cited By ~ 16

Author(s):

Adam McDermaid ◽

Brandon Monier ◽

Jing Zhao ◽

Bingqiang Liu ◽

Qin Ma

Keyword(s):

Gene Expression ◽

Differential Gene Expression ◽

Point Of View ◽

Data Sets ◽

Bioconductor Package ◽

Rna Seq ◽

Real World Data ◽

Vitis Riparia ◽

Differential Gene ◽

Functional Point

Abstract Differential gene expression (DGE) analysis is one of the most common applications of RNA-sequencing (RNA-seq) data. This process allows for the elucidation of differentially expressed genes across two or more conditions and is widely used in many applications of RNA-seq data analysis. Interpretation of the DGE results can be nonintuitive and time consuming due to the variety of formats based on the tool of choice and the numerous pieces of information provided in these results files. Here we reviewed DGE results analysis from a functional point of view for various visualizations. We also provide an R/Bioconductor package, Visualization of Differential Gene Expression Results using R, which generates information-rich visualizations for the interpretation of DGE results from three widely used tools, Cuffdiff, DESeq2 and edgeR. The implemented functions are also tested on five real-world data sets, consisting of one human, one Malus domestica and three Vitis riparia data sets.

The Generalization Complexity Measure for Continuous Input Data

The Scientific World JOURNAL ◽

10.1155/2014/815156 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Iván Gómez ◽

Sergio A. Cannas ◽

Omar Osenda ◽

José M. Jerez ◽

Leonardo Franco

Keyword(s):

Neural Network ◽

Network Architecture ◽

Input Data ◽

Continuous Functions ◽

Complexity Measure ◽

Data Sets ◽

Practical Case ◽

Real World Data ◽

Continuous Input ◽

Hidden Layer

We introduce in this work an extension for the generalization complexity measure to continuous input data. The measure, originally defined in Boolean space, quantifies the complexity of data in relationship to the prediction accuracy that can be expected when using a supervised classifier like a neural network, SVM, and so forth. We first extend the original measure for its use with continuous functions to later on, using an approach based on the use of the set of Walsh functions, consider the case of having a finite number of data points (inputs/outputs pairs), that is, usually the practical case. Using a set of trigonometric functions a model that gives a relationship between the size of the hidden layer of a neural network and the complexity is constructed. Finally, we demonstrate the application of the introduced complexity measure, by using the generated model, to the problem of estimating an adequate neural network architecture for real-world data sets.

A New Edge betweenness Measure Using a Game Theoretical Approach : An Application to Hierarchical Community Detection

10.20944/preprints202109.0208.v1 ◽

2021 ◽

Author(s):

Daniel Gómez ◽

Javier Castro ◽

Inmaculada Gutiérrez García-Pardo ◽

Rosa Espínola

Keyword(s):

Hierarchical Clustering ◽

Shortest Path ◽

Clustering Algorithm ◽

Point Of View ◽

Computational Point ◽

Hierarchical Partition ◽

Edge Betweenness ◽

Game Theoretical Approach ◽

Sparse Networks ◽

Hierarchical Clustering Algorithm

In this paper we formally define the hierarchical clustering network problem (HCNP) as the problem to find a good hierarchical partition of a network. This new problem focuses on the dynamic process of the clustering rather than on the final picture of the clustering process. To address it, we introduce a new hierarchical clustering algorithm in networks, based on a new shortest path betweenness measure. To calculate it, the communication between each pair of nodes is weighed by the importance of the nodes that establish this communication. The weights or importance associated to each pair of nodes are calculated as the Shapley value of a game, named as the linear modularity game. This new measure, (the node-game shortest path betweenness measure), is used to obtain a hierarchical partition of the network by eliminating the link with the highest value. To evaluate the performance of our algorithm, we introduce several criteria that allow us to compare different dendrograms of a network from two point of view: modularity and homogeneity. Finally, we propose a faster algorithm based on a simplification of the node-game shortest path betweenness measure, whose order is quadratic on sparse networks. This fast version is competitive from a computational point of view with other hierarchical fast algorithms, and, in general, it provides better results.