Implementation of Clustering Algorithms for Real Time Large Datasets

Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.

Download Full-text

A review on density-based clustering algorithms for big data analysis

2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) ◽

10.1109/i-smac.2017.8058322 ◽

2017 ◽

Cited By ~ 4

Author(s):

K. Shyam Sunder Reddy ◽

C. Shoba Bindu

Keyword(s):

Big Data ◽

Data Analysis ◽

Clustering Algorithms ◽

Big Data Analysis ◽

Density Based Clustering

Download Full-text

A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v22.i1.pp552-562 ◽

2021 ◽

Vol 22 (1) ◽

pp. 552

Author(s):

Shapol M. Mohammed ◽

Karwan Jacksi ◽

Subhi R. M. Zeebaree

Keyword(s):

Semantic Similarity ◽

State Of The Art ◽

Clustering Algorithms ◽

Document Clustering ◽

Accuracy Evaluation ◽

Similar Data ◽

Document Similarity ◽

Density Based Clustering ◽

Data Points ◽

The Common

<p><span>Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.</span></p>

Download Full-text

Locality-Sensitive Hashing for Information Retrieval System on Multiple GPGPU Devices

Applied Sciences ◽

10.3390/app10072539 ◽

2020 ◽

Vol 10 (7) ◽

pp. 2539 ◽

Cited By ~ 1

Author(s):

Toan Nguyen Mau ◽

Yasushi Inoguchi

Keyword(s):

Big Data ◽

Information Retrieval ◽

Retrieval System ◽

Hash Table ◽

Information Retrieval System ◽

Main Memory ◽

Locality Sensitive Hashing ◽

Data Sets ◽

Similar Data ◽

Data Set

It is challenging to build a real-time information retrieval system, especially for systems with high-dimensional big data. To structure big data, many hashing algorithms that map similar data items to the same bucket to advance the search have been proposed. Locality-Sensitive Hashing (LSH) is a common approach for reducing the number of dimensions of a data set, by using a family of hash functions and a hash table. The LSH hash table is an additional component that supports the indexing of hash values (keys) for the corresponding data/items. We previously proposed the Dynamic Locality-Sensitive Hashing (DLSH) algorithm with a dynamically structured hash table, optimized for storage in the main memory and General-Purpose computation on Graphics Processing Units (GPGPU) memory. This supports the handling of constantly updated data sets, such as songs, images, or text databases. The DLSH algorithm works effectively with data sets that are updated with high frequency and is compatible with parallel processing. However, the use of a single GPGPU device for processing big data is inadequate, due to the small memory capacity of GPGPU devices. When using multiple GPGPU devices for searching, we need an effective search algorithm to balance the jobs. In this paper, we propose an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system. Different search strategies on multiple DLSH clusters are also proposed to adapt our parallelized system. With significant results in terms of performance and accuracy, we show that DLSH can be applied to real-life dynamic database systems.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

Clustering Big Data streams: recent challenges and contributions

it - Information Technology ◽

10.1515/itit-2016-0007 ◽

2016 ◽

Vol 58 (4) ◽

Cited By ~ 1

Author(s):

Marwan Hassani ◽

Thomas Seidl

Keyword(s):

Big Data ◽

Data Streams ◽

Clustering Algorithms ◽

Subspace Clustering ◽

High Dimensional ◽

Evaluation Measures ◽

Stream Clustering ◽

Density Based Clustering ◽

Static Data ◽

Big Data Streams

AbstractTraditional clustering algorithms merely considered static data. Today'sSince the growth of dataIn this article, novel methods for an efficient subspace clustering of high-dimensional big data streams are presented. Approaches that efficiently combine the anytime clustering concept with the stream subspace clustering paradigm are discussed. Additionally, efficient and adaptive density-based clustering algorithms are presented for high-dimensional data streams. Novel open-source assessment framework and evaluation measures are additionally presented for subspace stream clustering.

Download Full-text

The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data

Journal of Industrial Integration and Management ◽

10.1142/s2424862218500173 ◽

2019 ◽

Vol 04 (01) ◽

pp. 1850017 ◽

Cited By ~ 3

Author(s):

Weiru Chen ◽

Jared Oliverio ◽

Jin Ho Kim ◽

Jiayue Shen

Keyword(s):

Data Mining ◽

Big Data ◽

Data Reduction ◽

Data Clustering ◽

Clustering Algorithms ◽

High Volume ◽

Clustering Methods ◽

Data Set ◽

Processing Methods ◽

Integration Data

Big Data is a popular cutting-edge technology nowadays. Techniques and algorithms are expanding in different areas including engineering, biomedical, and business. Due to the high-volume and complexity of Big Data, it is necessary to conduct data pre-processing methods when data mining. The pre-processing methods include data cleaning, data integration, data reduction, and data transformation. Data clustering is the most important step of data reduction. With data clustering, mining on the reduced data set should be more efficient yet produce quality analytical results. This paper presents the different data clustering methods and related algorithms for data mining with Big Data. Data clustering can increase the efficiency and accuracy of data mining.

Download Full-text

Query Computation Time and Its Performance over Distributed Database Frameworks

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l2769.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 5153-5160

Keyword(s):

Data Privacy ◽

Privacy Preservation ◽

Clustering Algorithms ◽

Distributed Databases ◽

Computation Time ◽

Distributed Database ◽

Vital Role ◽

Data Set ◽

Security Issues ◽

Distributed Database System

It is essential to maintain a relevant methodology for data fragmentation to employ resources, and thus, it needs to choose an accurate and efficient fragmentation methodology to improve authority of distributed database system. This leads the challenges on data reliability, stable storage space and costs, Communication costs, and security issues. In Distributed database framework, query computation and data privacy plays a vital role over portioned distributed databases such as vertical, horizontal and hybrid models, Privacy of any information is regarded as the essential issue in nowadays hence we show an approach by that we can use privacy preservation over the two parties which are actually distributing their data horizontally or vertically. In this chapter, I present an approach by which the concept of hierarchal clustering applied over the horizontally partitioned data set. We also explain the desired algorithm like hierarchal clustering, algorithms for finding the minimum closest cluster. Furthermore, it explores the performance of Query Computation over portioned databases with the analysis of Efficiency and Privacy.

Download Full-text

Review on Density Based Clustering Algorithms for Big Data

International Journal of Data Mining Techniques and Applications ◽

10.20894/ijdmta.102.007.001.003 ◽

2018 ◽

Vol 7 (1) ◽

pp. 13-20 ◽

Cited By ~ 1

Author(s):

Miranda Lakshmi T ◽

◽

Josephine Sahana R ◽

Keyword(s):

Big Data ◽

Clustering Algorithms ◽

Density Based Clustering

Download Full-text

Fast Fuzzy Search for Mixed Data Using Locality Sensitive Hashing

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.462-463.321 ◽

2013 ◽

Vol 462-463 ◽

pp. 321-325 ◽

Cited By ~ 1

Author(s):

Kyung Mi Lee ◽

Keon Myung Lee

Keyword(s):

Locality Sensitive Hashing ◽

Distance Measures ◽

Mixed Data ◽

Similar Data ◽

Data Set ◽

Fuzzy Constraints ◽

Fuzzy Search ◽

Indexing Structure ◽

Drastic Increase ◽

Data Objects

The drastic increase in data volume strongly demands efficient search techniques for similar data to queries. It is sometimes useful to specify data of interest with fuzzy constraints. When data objects contain both numerical and categorical attributes, it is usually not easy to define commonly-accepted distance measures between data objects. With no efficient indexing structure, it costs much to search for specific data objects because a linear search needs to be conducted over the whole data set. This paper proposes a method to use locality sensitive hashing technique and fuzzy constrained queries to search for interesting ones from big data. The method builds up a locality sensitive hashing-based indexing structure only with constituting continuous attributes, collects a small number of candidate data objects to which query is examined, and then evaluates their satisfaction degree to fuzzy constrained query so that data objects satisfying the query are determined.

Download Full-text

Clustering Algorithms: An Exploratory Review

10.5772/intechopen.100376 ◽

2021 ◽

Author(s):

R.S.M. Lakshmi Patibandla ◽

Veeranjaneyulu N

Keyword(s):

Standard Deviation ◽

Evolutionary Algorithms ◽

Root Mean Square ◽

Optimization Problems ◽

Clustering Algorithms ◽

Similar Data ◽

Mean Square ◽

Data Set ◽

Validation Criteria ◽

Data Points

A process of similar data items into groups is called data clustering. Partitioning a Data Set into some groups based on the resemblance within a group by using various algorithms. Partition Based algorithms key idea is to split the data points into partitions and each one replicates one cluster. The performance of partition depends on certain objective functions. Evolutionary algorithms are used for the evolution of social aspects and to provide optimum solutions for huge optimization problems. In this paper, a survey of various partitioning and evolutionary algorithms can be implemented on a benchmark dataset and proposed to apply some validation criteria methods such as Root-Mean-Square Standard Deviation, R-square and SSD, etc., on some algorithms like Leader, ISODATA, SGO and PSO, and so on.

Download Full-text