Dunn’s index for cluster tendency assessment of pharmacological data sets

2012 ◽  
Vol 90 (4) ◽  
pp. 425-433 ◽  
Author(s):  
Oscar Miguel Rivera-Borroto ◽  
Mónica Rabassa-Gutiérrez ◽  
Ricardo del Corazón Grau-Ábalo ◽  
Yovani Marrero-Ponce ◽  
José Manuel García-de la Vega

Cluster tendency assessment is an important stage in cluster analysis. In this sense, a group of promising techniques named visual assessment of tendency (VAT) has emerged in the literature. The presence of clusters can be detected easily through the direct observation of a dark blocks structure along the main diagonal of the intensity image. Alternatively, if the Dunn’s index for a single linkage partition is greater than 1, then it is a good indication of the blocklike structure. In this report, the Dunn’s index is applied as a novel measure of tendency on 8 pharmacological data sets, represented by machine-learning-selected molecular descriptors. In all cases, observed values are less than 1, thus indicating a weak tendency for data to form compact clusters. Other results suggest that there is an increasing relationship between the Dunn’s index as a measure of cluster separability and the classification accuracy of various cluster algorithms tested on the same data sets.

2011 ◽  
Vol 51 (12) ◽  
pp. 3036-3049 ◽  
Author(s):  
Oscar Miguel Rivera-Borroto ◽  
Yovani Marrero-Ponce ◽  
José Manuel García-de la Vega ◽  
Ricardo del Corazón Grau-Ábalo

2021 ◽  
Author(s):  
Xingang Jia ◽  
Qiuhong Han ◽  
Zuhong Lu

Abstract Background: Phages are the most abundant biological entities, but the commonly used clustering techniques are difficult to separate them from other virus families and classify the different phage families together.Results: This work uses GI-clusters to separate phages from other virus families and classify the different phage families, where GI-clusters are constructed by GI-features, GI-features are constructed by the togetherness with F-features, training data, MG-Euclidean and Icc-cluster algorithms, F-features are the frequencies of multiple-nucleotides that are generated from genomes of viruses, MG-Euclidean algorithm is able to put the nearest neighbors in the same mini-groups, and Icc-cluster algorithm put the distant samples to the different mini-clusters. For these viruses that the maximum element of their GI-features are in the same locations, they are put to the same GI-clusters, where the families of viruses in test data are identified by GI-clusters, and the families of GI-clusters are defined by viruses of training data.Conclusions: From analysis of 4 data sets that are constructed by the different family viruses, we demonstrate that GI-clusters are able to separate phages from other virus families, correctly classify the different phage families, and correctly predict the families of these unknown phages also.


Author(s):  
Orhan Bölükbaş ◽  
Harun Uğuz

Artificial immune systems inspired by the natural immune system are used in problems such as classification, optimization, anomaly detection, and error detection. In these problems, clonal selection algorithm, artificial immune network algorithm, and negative selection algorithm are generally used. This chapter aims to solve the problem of correct identification and classification of patients using negative selection (NS) and variable detector negative selection (V-DET NS) algorithms. The authors examine the performance of NSA and V-DET NSA algorithms using three sets of medical data sets from Parkinson, carotid artery doppler, and epilepsy patients. According to the obtained results, NSA achieved 92.45%, 91.46%, and 92.21% detection accuracy and 92.46%, 93.40%, and 90.57% classification accuracy. V-DET NSA achieved 94.34%, 94.52%, and 91.51% classification accuracy and 94.23%, 94.40%, and 89.29% detection accuracy. As can be seen from these values, V-Det NSA yielded a better result. Artificial immune system emerges as an effective and promising system in terms of problem-solving performance.


Author(s):  
Amit Saxena ◽  
John Wang

This paper presents a two-phase scheme to select reduced number of features from a dataset using Genetic Algorithm (GA) and testing the classification accuracy (CA) of the dataset with the reduced feature set. In the first phase of the proposed work, an unsupervised approach to select a subset of features is applied. GA is used to select stochastically reduced number of features with Sammon Error as the fitness function. Different subsets of features are obtained. In the second phase, each of the reduced features set is applied to test the CA of the dataset. The CA of a data set is validated using supervised k-nearest neighbor (k-nn) algorithm. The novelty of the proposed scheme is that each reduced feature set obtained in the first phase is investigated for CA using the k-nn classification with different Minkowski metric i.e. non-Euclidean norms instead of conventional Euclidean norm (L2). Final results are presented in the paper with extensive simulations on seven real and one synthetic, data sets. It is revealed from the proposed investigation that taking different norms produces better CA and hence a scope for better feature subset selection.


Symmetry ◽  
2020 ◽  
Vol 12 (3) ◽  
pp. 434 ◽  
Author(s):  
Huilin Ge ◽  
Zhiyu Zhu ◽  
Kang Lou ◽  
Wei Wei ◽  
Runbang Liu ◽  
...  

Infrared image recognition technology can work day and night and has a long detection distance. However, the infrared objects have less prior information and external factors in the real-world environment easily interfere with them. Therefore, infrared object classification is a very challenging research area. Manifold learning can be used to improve the classification accuracy of infrared images in the manifold space. In this article, we propose a novel manifold learning algorithm for infrared object detection and classification. First, a manifold space is constructed with each pixel of the infrared object image as a dimension. Infrared images are represented as data points in this constructed manifold space. Next, we simulate the probability distribution information of infrared data points with the Gaussian distribution in the manifold space. Then, based on the Gaussian distribution information in the manifold space, the distribution characteristics of the data points of the infrared image in the low-dimensional space are derived. The proposed algorithm uses the Kullback-Leibler (KL) divergence to minimize the loss function between two symmetrical distributions, and finally completes the classification in the low-dimensional manifold space. The efficiency of the algorithm is validated on two public infrared image data sets. The experiments show that the proposed method has a 97.46% classification accuracy and competitive speed in regards to the analyzed data sets.


2018 ◽  
Vol 8 (12) ◽  
pp. 2574 ◽  
Author(s):  
Qinghua Mao ◽  
Hongwei Ma ◽  
Xuhui Zhang ◽  
Guangming Zhang

Skewness Decision Tree Support Vector Machine (SDTSVM) algorithm is widely known as a supervised learning model for multi-class classification problems. However, the classification accuracy of the SDTSVM algorithm depends on the perfect selection of its parameters and the classification order. Therefore, an improved SDTSVM (ISDTSVM) algorithm is proposed in order to improve the classification accuracy of steel cord conveyor belt defects. In the proposed model, the classification order is determined by the sum of the Euclidean distances between multi-class sample centers and the parameters are optimized by the inertia weight Particle Swarm Optimization (PSO) algorithm. In order to verify the effectiveness of the ISDTSVM algorithm with different feature space, experiments were conducted on multiple UCI (University of California Irvine) data sets and steel cord conveyor belt defects using the proposed ISDTSVM algorithm and the conventional SDTSVM algorithm respectively. The average classification accuracies of five-fold cross-validation were obtained, based on two kinds of kernel functions respectively. For the Vowel, Zoo, and Wine data sets of the UCI data sets, as well as the steel cord conveyor belt defects, the ISDTSVM algorithm improved the classification accuracy by 3%, 3%, 1% and 4% respectively, compared to the SDTSVM algorithm. The classification accuracy of the radial basis function kernel were higher than the polynomial kernel. The results indicated that the proposed ISDTSVM algorithm improved the classification accuracy significantly, compared to the conventional SDTSVM algorithm.


Author(s):  
Bruno Almeida Pimentel ◽  
Renata M. C. R. De Souza

Outliers may have many anomalous causes, for example, credit card fraud, cyberintrusion or breakdown of a system. Several research areas and application domains have investigated this problem. The popular fuzzy c-means algorithm is sensitive to noise and outlying data. In contrast, the possibilistic partitioning methods are used to solve these problems and other ones. The goal of this paper is to introduce cluster algorithms for partitioning a set of symbolic interval-type data using the possibilistic approach. In addition, a new way of measuring the membership value, according to each feature, is proposed. Experiments with artificial and real symbolic interval-type data sets are used to evaluate the methods. The results of the proposed methods are better than the traditional soft clustering ones.


1984 ◽  
Vol 35 (3) ◽  
pp. 423 ◽  
Author(s):  
BJ McGuirk ◽  
KD Atkins

Fleece rot was observed on hogget ewes and rams in unselected medium-wool Peppin Merino flocks over a 15-year period at Trangie Agricultural Research Centre, N.S.W. Each sheep was given a score of from 0 (no fleece rot) to 5 (very severe fleece rot) following a subjective visual assessment along the backline prior to shearing. Data were analysed as two measures of susceptibility to fleece rot: a score as defined above; and incidence, where fleece rot was treated as an all-or-none trait. The average incidence of fleece rot over 23 flock x year combinations was 36%, but this varied widely, from 8 to 90%. Other environmental factors (sex, birth type and age of dam) generally had small and non-significant effects on fleece rot. After adjusting for significant environmental effects, the half-sib heritability estimates for sires:with at least three progeny were 0.36 (�0.07) for score and 0.23 (�0.07) for incidence. Separate anaiyses were conducted for flock x year data sets of low, intermediate and high incidences. The highest heritability estimate for incidence was, as expectedly theoretically, obtained in the data set of intermediate incidence. Offspring-dam heritability estimates ( � s.e.) for fleece rot score and incidence were respectively 0.21 (� 0 0 5 ) and 0.14(�0.04). Corresponding offspring-sire estimates for score and incidence were 0.20 (�0.06) and 0.17 (�0.05). It is concluded that a realistic estimate of the heritability of the underlying liability to fleece rat is of the order of 40%.


2009 ◽  
Vol 75 (18) ◽  
pp. 5863-5870 ◽  
Author(s):  
L. Zinger ◽  
E. Coissac ◽  
P. Choler ◽  
R. A. Geremia

ABSTRACT Understanding how microbial community structure and diversity respond to environmental conditions is one of the main challenges in environmental microbiology. However, there is often confusion between determining the phylogenetic structure of microbial communities and assessing the distribution and diversity of molecular operational taxonomic units (MOTUs) in these communities. This has led to the use of sequence analysis tools such as multiple alignments and hierarchical clustering that are not adapted to the analysis of large and diverse data sets and not always justified for characterization of MOTUs. Here, we developed an approach combining a pairwise alignment algorithm and graph partitioning by using MCL (Markov clustering) in order to generate discrete groups for nuclear large-subunit rRNA gene and internal transcript spacer 1 sequence data sets obtained from a yearly monitoring study of two spatially close but ecologically contrasting alpine soils (namely, early and late snowmelt locations). We compared MCL with a classical single-linkage method (Ccomps) and showed that MCL reduced bias such as the chaining effect. Using MCL, we characterized fungal communities in early and late snowmelt locations. We found contrasting distributions of MOTUs in the two soils, suggesting that there is a high level of habitat filtering in the assembly of alpine soil fungal communities. However, few MOTUs were specific to one location.


Sign in / Sign up

Export Citation Format

Share Document