Ensemble Hybrid K- Means and DBSCAN Clustering Algorithm – HDKA for Cancer Dataset

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Semi-Supervised Clustering Algorithm Based on Small Size of Labeled Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.121-126.4675 ◽

2011 ◽

Vol 121-126 ◽

pp. 4675-4679

Author(s):

Ming Wei Leng ◽

Xiao Yun Chen ◽

Jian Jun Cheng ◽

Long Jie Li

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Nearest Neighbors ◽

Experimental Results ◽

K Nearest Neighbors ◽

Supervised Clustering ◽

The Core ◽

Knn Classification ◽

Core Problem

In many data mining domains, labeled data is very expensive to generate, how to make the best use of labeled data to guide the process of unlabeled clustering is the core problem of semi-supervised clustering. Most of semi-supervised clustering algorithms require a certain amount of labeled data and need set the values of some parameters, different values maybe have different results. In view of this, a new algorithm, called semi-supervised clustering algorithm based on small size of labeled data, is presented, which can use the small size of labeled data to expand labeled dataset by labeling their k-nearest neighbors and only one parameter. We demonstrate our clustering algorithm with three UCI datasets, compared with SSDBSCAN[4] and KNN, the experimental results confirm that accuracy of our clustering algorithm is close to that of KNN classification algorithm.

Download Full-text

SVM Classifier on K-means Clustering Algorithm with Normalization in Data Mining for Prediction

International Journal on Recent and Innovation Trends in Computing and Communication ◽

10.17762/ijritcc.v7i6.5318 ◽

2019 ◽

Vol 7 (6) ◽

pp. 29-34

Author(s):

Vasu Deep ◽

Himanshu Sharma

Keyword(s):

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Research Work ◽

Initial Point ◽

Computational Time ◽

Svm Classifier ◽

Point Selection ◽

Cancer Dataset ◽

Number Of Iterations

This work is belonging to K-means clustering algorithms classifier is used with this algorithm to classified data and Min Max normalization technique also used is to enhance the results of this work over simply K- Means algorithm. K-means algorithm is a clustering algorithm and basically used for discovering the cluster within a dataset. Here cancer dataset is used for this research work and dataset is classified in two categories – Cancer and Non-Cancer, after execution of the implemented algorithm with SVM and Normalization technique. The initial point selection effects on the results of the algorithm, both in the number of clusters found and their centroids. In this work enhance the k-means clustering algorithm methods are discussed. This technique helps to improve efficiency, accuracy, performance and computational time. Some enhanced variations improve the efficiency and accuracy of algorithm. The main of all methods is to decrees the number of iterations which will less computational time. K-means algorithm in clustering is most popular technique which is widely used technique in data mining. Various enhancements done on K-mean are collected, so by using these enhancements one can build a new proposed algorithm which will be more efficient, accurate and less time consuming than the previous work. More focus of this studies is to decrease the number of iterations which is less time consuming and second one is to gain more accuracy using normalization technique overall belonging to improve time and accuracy than previous studies.

Download Full-text

A Customized Non-Exclusive Clustering Algorithm for News Recommendation Systems

Journal of University of Babylon for Pure and Applied Sciences ◽

10.29196/jubpas.v27i1.2192 ◽

2019 ◽

Vol 27 (1) ◽

pp. 368-379 ◽

Cited By ~ 1

Author(s):

Asghar Darvishy ◽

Hamidah Ibrahim ◽

Fatimah Sidi ◽

Aida Mustapha

Keyword(s):

Machine Learning ◽

Data Mining ◽

Clustering Algorithm ◽

Recommendation Systems ◽

Experimental Results ◽

News Item ◽

Real Dataset ◽

News Websites ◽

News Recommendation

Clustering is one of the main tasks in machine learning and data mining and is being utilized in many applications including news recommendation systems. In this paper, we propose a new non-exclusive clustering algorithm named Ordered Clustering (OC) with the aim is to increase the accuracy of news recommendation for online users. The basis of OC is a new initialization technique that groups news items into clusters based on the highest similarities between news items to accommodate news nature in which a news item can belong to different categories. Hence, in OC, multiple memberships in clusters are allowed. An experiment is carried out using a real dataset which is collected from the news websites. The experimental results demonstrated that the OC outperforms the k-means algorithm with respect to Precision, Recall, and F1-Score.

Download Full-text

Dimensionality Reduction of Production Data Using PCA and DBSCAN Techniques

Intelligent Systems and Computer Technology - Advances in Parallel Computing ◽

10.3233/apc200184 ◽

2020 ◽

Author(s):

Umadevi S ◽

NirmalaSugirthaRajini

Keyword(s):

Data Mining ◽

Data Analysis ◽

Dimensionality Reduction ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Production Data ◽

Related Data ◽

Dbscan Clustering ◽

Analysis Process ◽

Research Article

Now a day’s data mining concepts are applied in various fields like medical, agriculture, production, etc. Creation of cluster is one of the major problems in data analysis process. Various clustering algorithms are used for data analysis purpose which depends upon the applications. DBSCAN is the famous method to create cluster. This article describes DBSCAN clustering concept applied on production database. The main objective of this research article is to collect and group the related data from large amount of data and remove the unwanted data. This clustering algorithm removes the unwanted attributes and groups the related data based upon density value.

Download Full-text

K-MEANS CLUSTERING ALGORITHM FOR SERVICE DATA ANALYSIS BASED ON CUSTOMERS COMBINATION

Unes journal of Information System ◽

10.31933/ujis.3.1.001-007.2018 ◽

2018 ◽

Vol 3 (1) ◽

pp. 001

Author(s):

Zulhendra Zulhendra ◽

Gunadi Widi Nurcahyo ◽

Julius Santony

Keyword(s):

Data Mining ◽

Data Analysis ◽

Clustering Algorithm ◽

Customer Complaints ◽

Using Data ◽

Clustering Data ◽

Service Data ◽

Selection Of

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.

Download Full-text

A UMDA-Based Discretization Method for Continuous Attributes

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.403-408.1834 ◽

2011 ◽

Vol 403-408 ◽

pp. 1834-1838

Author(s):

Jing Zhao ◽

Chong Zhao Han ◽

Bin Wei ◽

De Qiang Han

Keyword(s):

Machine Learning ◽

Data Mining ◽

Evolutionary Algorithms ◽

Marginal Distribution ◽

Convergence Speed ◽

Fast Convergence ◽

Experimental Results ◽

Discretization Method ◽

Bottom Up ◽

Global Dynamic

Discretization of continuous attributes have played an important role in machine learning and data mining. They can not only improve the performance of the classifier, but also reduce the space of the storage. Univariate Marginal Distribution Algorithm is a modified Evolutionary Algorithms, which has some advantages over classical Evolutionary Algorithms such as the fast convergence speed and few parameters need to be tuned. In this paper, we proposed a bottom-up, global, dynamic, and supervised discretization method on the basis of Univariate Marginal Distribution Algorithm.The experimental results showed that the proposed method could effectively improve the accuracy of classifier.

Download Full-text

An improved OPTICS clustering algorithm for discovering clusters with uneven densities

Intelligent Data Analysis ◽

10.3233/ida-205497 ◽

2021 ◽

Vol 25 (6) ◽

pp. 1453-1471

Author(s):

Chunhua Tang ◽

Han Wang ◽

Zhiwen Wang ◽

Xiangkun Zeng ◽

Huaran Yan ◽

...

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Substantial Improvement ◽

Experimental Results ◽

High Time ◽

Parameter Setting ◽

K Nearest Neighbor ◽

Density Based Clustering

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.

Download Full-text

Canonical PSO Based K-Means Clustering Approach for Real Datasets

International Scholarly Research Notices ◽

10.1155/2014/414013 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Lopamudra Dey ◽

Sanjay Chakraborty

Keyword(s):

Data Mining ◽

Air Pollution ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Cluster Validity ◽

Validity Assessment ◽

Different Types ◽

Clustering Approach ◽

Validity Measure

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

Download Full-text

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Download Full-text