Develop a dynamic DBSCAN algorithm for solving initial parameter selection problem of the DBSCAN algorithm

The amount of data has been increasing exponentially in every sector such as banking securities, healthcare, education, manufacturing, consumer-trade, transportation, and energy. Most of these data are noise, different in shapes, and outliers. In such cases, it is challenging to find the desired data clusters using conventional clustering algorithms. DBSCAN is a popular clustering algorithm which is widely used for noisy, arbitrary shape, and outlier data. However, its performance highly depends on the proper selection of cluster radius (Eps) and the minimum number of points (MinPts) that are required for forming clusters for the given dataset. In the case of real-world clustering problems, it is a difficult task to select the exact value of Eps and (MinPts) to perform the clustering on unknown datasets. To address these, this paper proposes a dynamic DBSCAN algorithm that calculates the suitable value for (Eps) and (MinPts) dynamically by which the clustering quality of the given problem will be increased. This paper evaluates the performance of the dynamic DBSCAN algorithm over seven challenging datasets. The experimental results confirm the effectiveness of the dynamic DBSCAN algorithm over the well-known clustering algorithms.

Download Full-text

MR-BIRCH: A scalable MapReduce-based birch clustering algorithm

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202079 ◽

2020 ◽

pp. 1-11

Author(s):

Yufeng Li ◽

HaiTian Jiang ◽

Jiyong Lu ◽

Xiaozhong Li ◽

Zhiwei Sun ◽

...

Keyword(s):

Big Data ◽

Real World ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Statistical Information ◽

Main Memory ◽

Acceptable Result ◽

Clustering Quality ◽

Synthetic Datasets

Many classical clustering algorithms have been fitted into MapReduce, which provides a novel solution for clustering big data. However, several iterations are required to reach an acceptable result in most of the algorithms. For each iteration, a new MapReduce job must be executed to load the dataset into main memory, which results in high I/O overhead and poor efficiency. BIRCH algorithm stores only the statistical information of objects with CF entries and CF tree to cluster big data, but with the increase of the tree nodes, the main memory will be insufficient to contain more objects. Hence, BIRCH has to reduce the tree, which will degrade the clustering quality and decelerate the whole execution efficiency. To deal with the problem, BIRCH was fitted into MapReduce called MR-BIRCH in this paper. In contrast to a great number of MapReduce-based algorithms, MR-BIRCH loads dataset only once, and the dataset is processed parallel in several machines. The complexity and scalability were analyzed to evaluate the quality of MR-BIRCH, and MR-BIRCH was compared with Python sklearn BIRCH and Apache Mahout k-means on real-world and synthetic datasets. Experimental results show, most of the time, MR-BIRCH was better or equal to sklearn BIRCH, and it was competitive to Mahout k-means.

Download Full-text

An improved ant algorithm with LDA-based representation for text document clustering

Journal of Information Science ◽

10.1177/0165551516638784 ◽

2016 ◽

Vol 43 (2) ◽

pp. 275-292 ◽

Cited By ~ 24

Author(s):

Aytug Onan ◽

Hasan Bulut ◽

Serdar Korukoglu

Keyword(s):

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Clustering Algorithms ◽

Document Clustering ◽

Clustering Methods ◽

Initial Value ◽

Text Document ◽

Clustering Quality ◽

Text Features

Document clustering can be applied in document organisation and browsing, document summarisation and classification. The identification of an appropriate representation for textual documents is extremely important for the performance of clustering or classification algorithms. Textual documents suffer from the high dimensionality and irrelevancy of text features. Besides, conventional clustering algorithms suffer from several shortcomings, such as slow convergence and sensitivity to the initial value. To tackle the problems of conventional clustering algorithms, metaheuristic algorithms are frequently applied to clustering. In this paper, an improved ant clustering algorithm is presented, where two novel heuristic methods are proposed to enhance the clustering quality of ant-based clustering. In addition, the latent Dirichlet allocation (LDA) is used to represent textual documents in a compact and efficient way. The clustering quality of the proposed ant clustering algorithm is compared to the conventional clustering algorithms using 25 text benchmarks in terms of F-measure values. The experimental results indicate that the proposed clustering scheme outperforms the compared conventional and metaheuristic clustering methods for textual documents.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Big Data Mining Based on Computational Intelligence and Fuzzy Clustering

Handbook of Research on Trends and Future Directions in Big Data and Web Intelligence - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-8505-5.ch007 ◽

2015 ◽

pp. 130-148 ◽

Cited By ~ 2

Author(s):

Usman Akhtar ◽

Mehdi Hassan

Keyword(s):

Big Data ◽

Computational Intelligence ◽

Clustering Algorithms ◽

Heterogeneous Data ◽

Computational Time ◽

Statistical Features ◽

Feature Sets ◽

Clustering Quality ◽

The Given ◽

Different Sources

The availability of a huge amount of heterogeneous data from different sources to the Internet has been termed as the problem of Big Data. Clustering is widely used as a knowledge discovery tool that separate the data into manageable parts. There is a need of clustering algorithms that scale on big databases. In this chapter we have explored various schemes that have been used to tackle the big databases. Statistical features have been extracted and most important and relevant features have been extracted from the given dataset. Reduce and irrelevant features have been eliminated and most important features have been selected by genetic algorithms (GA).Clustering with reduced feature sets requires lower computational time and resources. Experiments have been performed at standard datasets and results indicate that the proposed scheme based clustering offers high clustering accuracy. To check the clustering quality various quality measures have been computed and it has been observed that the proposed methodology results improved significantly. It has been observed that the proposed technique offers high quality clustering.

Download Full-text

Big Data Mining Based on Computational Intelligence and Fuzzy Clustering

Web Services ◽

10.4018/978-1-5225-7501-6.ch024 ◽

2019 ◽

pp. 413-430

Author(s):

Usman Akhtar ◽

Mehdi Hassan

Keyword(s):

Big Data ◽

Computational Intelligence ◽

Clustering Algorithms ◽

Heterogeneous Data ◽

Computational Time ◽

Statistical Features ◽

Feature Sets ◽

Clustering Quality ◽

The Given ◽

Different Sources

The availability of a huge amount of heterogeneous data from different sources to the Internet has been termed as the problem of Big Data. Clustering is widely used as a knowledge discovery tool that separate the data into manageable parts. There is a need of clustering algorithms that scale on big databases. In this chapter we have explored various schemes that have been used to tackle the big databases. Statistical features have been extracted and most important and relevant features have been extracted from the given dataset. Reduce and irrelevant features have been eliminated and most important features have been selected by genetic algorithms (GA). Clustering with reduced feature sets requires lower computational time and resources. Experiments have been performed at standard datasets and results indicate that the proposed scheme based clustering offers high clustering accuracy. To check the clustering quality various quality measures have been computed and it has been observed that the proposed methodology results improved significantly. It has been observed that the proposed technique offers high quality clustering.

Download Full-text

Enriching Documents by Linking Salient Entities and Lexical-Semantic Expansion

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0098 ◽

2018 ◽

Vol 29 (1) ◽

pp. 1109-1121

Author(s):

Mohsen Pourvali ◽

Salvatore Orlando

Keyword(s):

Clustering Algorithms ◽

Ensemble Clustering ◽

British Broadcasting Corporation ◽

Text Documents ◽

Classical Text ◽

Text Corpora ◽

Clustering Quality ◽

Semantic Expansion ◽

Document Representations

Abstract This paper explores a multi-strategy technique that aims at enriching text documents for improving clustering quality. We use a combination of entity linking and document summarization in order to determine the identity of the most salient entities mentioned in texts. To effectively enrich documents without introducing noise, we limit ourselves to the text fragments mentioning the salient entities, in turn, belonging to a knowledge base like Wikipedia, while the actual enrichment of text fragments is carried out using WordNet. To feed clustering algorithms, we investigate different document representations obtained using several combinations of document enrichment and feature extraction. This allows us to exploit ensemble clustering, by combining multiple clustering results obtained using different document representations. Our experiments indicate that our novel enriching strategies, combined with ensemble clustering, can improve the quality of classical text clustering when applied to text corpora like The British Broadcasting Corporation (BBC) NEWS.

Download Full-text

Clustering Algorithms for Direct Current Track Coded Signals

2019 Joint Rail Conference ◽

10.1115/jrc2019-1300 ◽

2019 ◽

Author(s):

Song Qin ◽

Nenad Mijatovic ◽

Jeffrey Fries ◽

James Kiss

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

High Availability ◽

Digital Analysis ◽

Signaling Systems ◽

Track Circuits ◽

Service Conditions ◽

Track Circuit ◽

Fail Safe

Designed for detecting train presence on tracks, track circuits must maintain a level of high availability for railway signaling systems. Due to the fail-safe nature of these critical devices, any failures will result in a declaration of occupancy in a section of track which restricts train movements. It is possible to automatically diagnose and, in some cases, predict the failures of track circuits by performing analytics on the track signals. In order to perform these analytics, we need to study the coded signals transmitted to and received from the track. However, these signals consist of heterogeneous pulses that are noisy for data analysis. Thus, we need techniques which will automatically group homogeneous pulses into similar groups. In this paper, we present data cleansing techniques which will cluster pulses based on digital analysis and machine learning. We report the results of our evaluation of clustering algorithms that improve the quality of analytic data. The data were captured under revenue service conditions operated by Alstom. For clustering algorithm, we used the k-means algorithm to cluster heterogeneous pulses. By tailoring the parameters for this algorithm, we can control the pulses of the cluster, allowing for further analysis of the track circuit signals in order to gain insight regarding its performance.

Download Full-text

Fine-Tuning an Algorithm for Semantic Document Clustering Using a Similarity Graph

International Journal of Semantic Computing ◽

10.1142/s1793351x16400195 ◽

2016 ◽

Vol 10 (04) ◽

pp. 527-555

Author(s):

Lubomir Stanchev

Keyword(s):

English Language ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Document Clustering ◽

Fine Tuning ◽

Human Judgment ◽

Multiple Parameters ◽

Similarity Graph ◽

Multiple Metrics

In this article, we examine an algorithm for document clustering using a similarity graph. The graph stores words and common phrases from the English language as nodes and it can be used to compute the degree of semantic similarity between any two phrases. One application of the similarity graph is semantic document clustering, that is, grouping documents based on the meaning of the words in them. Since our algorithm for semantic document clustering relies on multiple parameters, we examine how fine-tuning these values affects the quality of the result. Specifically, we use the Reuters-21578 benchmark, which contains [Formula: see text] newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We evaluate the results of the clustering algorithms using multiple metrics, such as precision, recall, f-score, entropy, and purity.

Download Full-text

Improved Clustering Algorithm for Design Structure Matrix

Volume 3: 38th Design Automation Conference, Parts A and B ◽

10.1115/detc2012-70076 ◽

2012 ◽

Cited By ~ 16

Author(s):

Fredrik Borjesson ◽

Katja Hölttä-Otto

Keyword(s):

Clustering Algorithm ◽

Hill Climbing ◽

Design Structure Matrix ◽

Computationally Efficient ◽

Vacuum Cleaner ◽

Local Optima ◽

Structure Matrix ◽

Design Structure ◽

Clustering Problems

For clustering a large Design Structure Matrix (DSM), computerized algorithms are necessary. A common algorithm by Thebeau uses stochastic hill-climbing to avoid local optima. The output of the algorithm is stochastic, and to be certain a very good clustering solution has been obtained, it may be necessary to run the algorithm thousands of times. To make this feasible in practice, the algorithm must be computationally efficient. Two algorithmic improvements are presented. Together they improve the quality of the results obtained and increase speed significantly for normal clustering problems. The proposed new algorithm is applied to a cordless handheld vacuum cleaner.

Download Full-text

Research on Parallel DBSCAN Algorithm Design Based on MapReduce

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.301-303.1133 ◽

2011 ◽

Vol 301-303 ◽

pp. 1133-1138 ◽

Cited By ~ 17

Author(s):

Yan Xiang Fu ◽

Wei Zhong Zhao ◽

Hui Fang Ma

Keyword(s):

Data Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Algorithm Design ◽

Document Retrieval ◽

Commodity Hardware ◽

Dbscan Clustering ◽

Dbscan Algorithm ◽

Parallel Clustering

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

Download Full-text