A review of clustering algorithms: Comparison of DBSCAN and K-mean with oversampling and t-SNE

Abstract: The two most widely used and easily implementable algorithm for clustering and classification-based analysis of data in the unsupervised learning domain are Density-Based Spatial Clustering of Applications with Noise and K-mean cluster analysis. These two techniques can handle most cases effective when the data has a lot of randomness with no clear set to use as a parameter as in case of linear or logistic regression algorithms. However few papers exist that pit these two against each other in a controlled environment to observe which one reigns supreme and conditions required for the same. In this paper, a renal adenocarcinoma dataset is analyzed and thereafter both DBSCAN and K-mean are applied on the dataset with subsequent examination of the results. The efficacy of both the techniques in this study is compared and based on them the merits and demerits observed are enumerated. Further, the interaction of t-SNE with the generated clusters are explored.

Download Full-text

COMPARISON OF CLUSTER ANALYSIS ALGORITHMS IN OBJECT RECOGNITION

Collection of scientific works of the State University of Infrastructure and Technologies series Transport Systems and Technologies ◽

10.32703/2617-9040-2020-36-12 ◽

2020 ◽

pp. 112-120

Author(s):

M. Botvin ◽

A. Gertsiy

Keyword(s):

Image Processing ◽

Cluster Analysis ◽

Spatial Clustering ◽

Clustering Algorithms ◽

Mean Shift ◽

Comparative Modeling ◽

Data Sets ◽

Scale Parameters ◽

Mean Shift Clustering ◽

Synthetic Datasets

The article is an overview of the direction of graphic image processing based on clustering algorithms. The analysis of prospects of application of algorithms of cluster analysis in digital image processing, in particular, at segmentation and compression of graphic images, and also at recognition of images in transport sphere of activity is carried out. Comparative modeling of such algorithms of cluster analysis as K-means, Mean-Shift (clustering of average shift) and DBSCAN (based on density of spatial clustering for applications with noise) on various types of data is carried out. The simulation was performed on synthetic datasets in a Jupyter Notebook environment using the Scikit-learn library. In particular, four data sets were generated in this environment, to which these clustering algorithms were applied. The simulation results showed that the K-means algorithm can effectively describe relatively simple shapes. In contrast, the mean shift does not require assumptions about the number of clusters and the shape of the distribution, but its performance depends on the choice of scale parameters. The DBSCAN algorithm can successfully detect more complex shapes, which emphasizes one of the strengths of this algorithm - the clustering of arbitrary data. The disadvantages of the selected algorithms are also given and it is indicated on which types of images they effectively work with the estimation of computational speed.

Download Full-text

Current State-of-the-Art of Clustering Methods for Gene Expression Data with RNA-Seq

10.5772/intechopen.94069 ◽

2020 ◽

Author(s):

Ismail Jamail ◽

Ahmed Moussa

Keyword(s):

Gene Expression ◽

Cluster Analysis ◽

Data Analysis ◽

Gene Expression Data ◽

Expression Profile ◽

Clustering Algorithms ◽

Expression Data ◽

Rna Seq ◽

Clustering Methods ◽

Clustering And Classification

Latest developments in high-throughput cDNA sequencing (RNA-seq) have revolutionized gene expression profiling. This analysis aims to compare the expression levels of multiple genes between two or more samples, under specific circumstances or in a specific cell to give a global picture of cellular function. Thanks to these advances, gene expression data are being generated in large throughput. One of the primary data analysis tasks for gene expression studies involves data-mining techniques such as clustering and classification. Clustering, which is an unsupervised learning technique, has been widely used as a computational tool to facilitate our understanding of gene functions and regulations involved in a biological process. Cluster analysis aims to group the large number of genes present in a sample of gene expression profile data, such that similar or related genes are in same clusters, and different or unrelated genes are in distinct ones. Classification on the other hand can be used for grouping samples based on their expression profile. There are many clustering and classification algorithms that can be applied in gene expression experiments, the most widely used are hierarchical clustering, k-means clustering and model-based clustering that depend on a model to sort out the number of clusters. Depending on the data structure, a fitting clustering method must be used. In this chapter, we present a state of art of clustering algorithms and statistical approaches for grouping similar gene expression profiles that can be applied to RNA-seq data analysis and software tools dedicated to these methods. In addition, we discuss challenges in cluster analysis, and compare the performance of height commonly used clustering methods on four different public datasets from recount2.

Download Full-text

Classification of Observations through Combination of the Dimension Reduction and the Cluster Analysis

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.13 ◽

2017 ◽

Vol 7 (8) ◽

pp. 30

Author(s):

Hyeuk Kim

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Cluster Analysis ◽

Unsupervised Learning ◽

Principal Component ◽

Component Analysis ◽

Baseball Players ◽

Partitioning Around Medoids ◽

Different Characteristics

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.

Download Full-text

DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification

Vegetation of Russia ◽

10.31111/vegrus/2015.27.125 ◽

2015 ◽

pp. 125-138 ◽

Cited By ~ 2

Author(s):

I. V. Goncharenko

Keyword(s):

Cluster Analysis ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Protein Structures ◽

Hierarchical Cluster ◽

Vegetation Classification ◽

K Nearest Neighbor ◽

Neighbor Graph ◽

Nearest Neighbor Graph

In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification. The method of k-nearest neighbor (k-NN) classiﬁcation was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.

Download Full-text

An Enhanced Spectral Clustering Algorithm with S-Distance

Symmetry ◽

10.3390/sym13040596 ◽

2021 ◽

Vol 13 (4) ◽

pp. 596

Author(s):

Krishna Kumar Sharma ◽

Ayan Seal ◽

Enrique Herrera-Viedma ◽

Ondrej Krejcar

Keyword(s):

Spectral Clustering ◽

Clustering Algorithm ◽

Spatial Clustering ◽

Clustering Algorithms ◽

Rank Test ◽

Customer Churn ◽

Signed Rank ◽

Signed Rank Test ◽

Spectral Clustering Algorithm ◽

Industrial Databases

Calculating and monitoring customer churn metrics is important for companies to retain customers and earn more profit in business. In this study, a churn prediction framework is developed by modified spectral clustering (SC). However, the similarity measure plays an imperative role in clustering for predicting churn with better accuracy by analyzing industrial data. The linear Euclidean distance in the traditional SC is replaced by the non-linear S-distance (Sd). The Sd is deduced from the concept of S-divergence (SD). Several characteristics of Sd are discussed in this work. Assays are conducted to endorse the proposed clustering algorithm on four synthetics, eight UCI, two industrial databases and one telecommunications database related to customer churn. Three existing clustering algorithms—k-means, density-based spatial clustering of applications with noise and conventional SC—are also implemented on the above-mentioned 15 databases. The empirical outcomes show that the proposed clustering algorithm beats three existing clustering algorithms in terms of its Jaccard index, f-score, recall, precision and accuracy. Finally, we also test the significance of the clustering results by the Wilcoxon’s signed-rank test, Wilcoxon’s rank-sum test, and sign tests. The relative study shows that the outcomes of the proposed algorithm are interesting, especially in the case of clusters of arbitrary shape.

Download Full-text

Coping with a Cluttered Marketplace: Athlete Choice of Products to Support Training*

Journal of Sport Management ◽

10.1123/jsm.27.1.59 ◽

2013 ◽

Vol 27 (1) ◽

pp. 59-72 ◽

Cited By ~ 2

Author(s):

Brianna L. Newland ◽

Laurence Chalip ◽

John L. Ivy

Keyword(s):

Cluster Analysis ◽

Logistic Regression ◽

Conjoint Analysis ◽

Multinomial Logistic Regression ◽

Scientific Evidence ◽

Optimal Choice ◽

Market Segment ◽

Market Segments ◽

Fat Taste ◽

Postexercise Recovery

To determine whether athletes are confused about supplementation, this study examines the relative levels of adult runners’ and triathletes’ preferences for postexercise recovery drink attributes (price, fat, taste, scientific evidence, and endorsement by a celebrity athlete), and the ways those preferences segment. It then examines the effect of athlete characteristics on segment and drink choice. Only a plurality of athletes (40.6%) chose a carbohydrate-protein postexercise recovery drink (the optimal choice), despite the fact that they valued scientific evidence highly. Athletes disliked or were indifferent to endorsement by a celebrity athlete, moderately disliked fat, and slightly preferred better tasting products. Cluster analysis of part-worths from conjoint analysis identified six market segments, showing that athletes anchored on one or two product attributes when choosing among alternatives. Multinomial logistic regression revealed that media influence, hours trained, market segment, gender, and the athlete’s sport significantly predicted drink choice, and that segment partially mediated the effect of sport on drink choice. Findings demonstrate confusion among athletes when there are competing products that each claim to support their training.

Download Full-text

Unsupervised Learning: Using Clustering Algorithms to Detect Peer to Peer Botnet Flows

Advances in Intelligent Systems and Computing - Security with Intelligent Computing and Big-Data Services 2019 ◽

10.1007/978-3-030-46828-6_26 ◽

2020 ◽

pp. 299-311

Author(s):

Andrea E. Medina Paredes ◽

Hung-Min Sun

Keyword(s):

Unsupervised Learning ◽

Clustering Algorithms ◽

Peer To Peer

Download Full-text

Study of a Privacy Preserving Logistic Regression Algorithm (PPLRA) For Data Privacy in the Context of Big Data

Journal of Physics Conference Series ◽

10.1088/1742-6596/2083/3/032059 ◽

2021 ◽

Vol 2083 (3) ◽

pp. 032059

Author(s):

Qiang Chen ◽

Meiling Deng

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Privacy Protection ◽

Data Privacy ◽

Absolute Error ◽

Average Absolute Error ◽

Regression Algorithms ◽

Hadoop Platform ◽

Logistic Regression Algorithm ◽

Computing Speed

Abstract Regression algorithms are commonly used in machine learning. Based on encryption and privacy protection methods, the current key hot technology regression algorithm and the same encryption technology are studied. This paper proposes a PPLAR based algorithm. The correlation between data items is obtained by logistic regression formula. The algorithm is distributed and parallelized on Hadoop platform to improve the computing speed of the cluster while ensuring the average absolute error of the algorithm.

Download Full-text

The impact of the number of visits and the level of satisfaction on the intention to recommend a tourist destination. The example of Gdańsk

Journal of Geography Politics and Society ◽

10.26881/jpgs.2021.1.05 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Tomasz Wiskulski

Keyword(s):

Cluster Analysis ◽

Logistic Regression ◽

Tourist Destination ◽

Entire Sample ◽

Level Of Satisfaction ◽

Number Of Visits ◽

The Impact

The article focuses on examining the intention to recommend Gdańsk as a tourist destination to family and friends. The study was based on the results of a survey (Bęben et al., 2018) conducted among 2,508 respondents visiting Gdańsk in 2017. The method of cluster analysis was applied, thanks to which it was possible to divide the respondents into three clusters. Then, logistic regression was used to analyze the variables influencing the intention to recommend a destination. The study shows that for the entire sample the level of satisfaction from a visit to Gdańsk remains the factor supporting the decision to recommend a destination. Importantly, the total number of visits to Gdańsk is negatively correlated with the intention to recommend the destination, which proves only partial loyalty.

Download Full-text

Predicting Lung Cancer Survivability using SVM and Logistic Regression Algorithms

International Journal of Computer Applications ◽

10.5120/ijca2017915325 ◽

2017 ◽

Vol 174 (2) ◽

pp. 19-24 ◽

Cited By ~ 5

Author(s):

Animesh Hazra ◽

Nanigopal Bera ◽

Avijit Mandal

Keyword(s):

Lung Cancer ◽

Logistic Regression ◽

Regression Algorithms

Download Full-text