scholarly journals Maxmin Data Range Heuristic-based Initial Centroid Method of Partitional Clustering for Big Data Mining

2022 ◽  
Vol 12 (1) ◽  
pp. 0-0

The centroid-based clustering algorithm depends on the number of clusters, initial centroid, distance measures, and statistical approach of central tendencies. The initial centroid initialization algorithm defines convergence speed, computing efficiency, execution time, scalability, memory utilization, and performance issues for big data clustering. Nowadays various researchers have proposed the cluster initialization techniques, where some initialization techniques reduce the number of iterations with the lowest cluster quality, and some initialization techniques increase the cluster quality with high iterations. For these reasons, this study proposed the initial centroid initialization based Maxmin Data Range Heuristic (MDRH) method for K-Means (KM) clustering that reduces the execution times, iterations, and improves quality for big data clustering. The proposed MDRH method has compared against the classical KM and KM++ algorithms with four real datasets. The MDRH method has achieved better effectiveness and efficiency over RS, DB, CH, SC, IS, and CT quantitative measurements.

2018 ◽  
Vol 27 (04) ◽  
pp. 1860006
Author(s):  
Nikolaos Tsapanos ◽  
Anastasios Tefas ◽  
Nikolaos Nikolaidis ◽  
Ioannis Pitas

Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. A classic clustering algorithm is the so-called k-Means. It is very popular, however, it is also unable to handle cases in which the clusters are not linearly separable. Kernel k-Means is a state of the art clustering algorithm, which employs the kernel trick, in order to perform clustering on a higher dimensionality space, thus overcoming the limitations of classic k-Means regarding the non-linear separability of the input data. With respect to the challenges of Big Data research, a field that has established itself in the last few years and involves performing tasks on extremely large amounts of data, several adaptations of the Kernel k-Means have been proposed, each of which has different requirements in processing power and running time, while also incurring different trade-offs in performance. In this paper, we present several issues and techniques involving the usage of Kernel k-Means for Big Data clustering and how the combination of each component in a clustering framework fares in terms of resources, time and performance. We use experimental results, in order to evaluate several combinations and provide a recommendation on how to approach a Big Data clustering problem.


Author(s):  
Hind Bangui ◽  
Mouzhi Ge ◽  
Barbora Buhnova

Due to the massive data increase in different Internet of Things (IoT) domains such as healthcare IoT and Smart City IoT, Big Data technologies have been emerged as critical analytics tools for analyzing the IoT data. Among the Big Data technologies, data clustering is one of the essential approaches to process the IoT data. However, how to select a suitable clustering algorithm for IoT data is still unclear. Furthermore, since Big Data technology are still in its initial stage for different IoT domains, it is thus valuable to propose and structure the research challenges between Big Data and IoT. Therefore, this article starts by reviewing and comparing the data clustering algorithms that can be applied in IoT datasets, and then extends the discussions to a broader IoT context such as IoT dynamics and IoT mobile networks. Finally, this article identifies a set of research challenges that harvest a research roadmap for the Big Data research in IoT domains. The proposed research roadmap aims at bridging the research gaps between Big Data and various IoT contexts.


Author(s):  
Zhanqiu Yu

To explore the Internet of things logistics system application, an Internet of things big data clustering analysis algorithm based on K-mans was discussed. First of all, according to the complex event relation and processing technology, the big data processing of Internet of things was transformed into the extraction and analysis of complex relational schema, so as to provide support for simplifying the processing complexity of big data in Internet of things (IOT). The traditional K-means algorithm was optimized and improved to make it fit the demand of big data RFID data network. Based on Hadoop cloud cluster platform, a K-means cluster analysis was achieved. In addition, based on the traditional clustering algorithm, a center point selection technology suitable for RFID IOT data clustering was selected. The results showed that the clustering efficiency was improved to some extent. As a result, an RFID Internet of things clustering analysis prototype system is designed and realized, which further tests the feasibility.


Author(s):  
Michele Ianni ◽  
Elio Masciari ◽  
Giuseppe M. Mazzeo ◽  
Carlo Zaniolo

Author(s):  
David Pfander ◽  
Gregor Daiß ◽  
Dirk Pflüger

Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The underlying density estimation approach enables the detection of clusters with non-convex shapes and without a predetermined number of clusters. In this work, we introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm that is suited for big data settings. Our compute kernels were implemented in OpenCL to enable portability across a wide range of architectures. For distributed environments, we added a manager-worker scheme that was implemented using MPI. In experiments on two supercomputers, Piz Daint and Hazel Hen, with up to 100 million data points in a 10-dimensional dataset, we show the performance and scalability of our approach. The dataset with 100 million data points was clustered in 1198s using 128 nodes of Piz Daint. This translates to an overall performance of 352TFLOPS. On the node-level, we provide results for two GPUs, Nvidia's Tesla P100 and the AMD FirePro W8100, and one processor-based platform that uses Intel Xeon E5-2680v3 processors. In these experiments, we achieved between 43% and 66% of the peak performance across all compute kernels and devices, demonstrating the performance portability of our approach.


2019 ◽  
Vol 8 (4) ◽  
pp. 1657-1664

In the area of information technology, a speedy sensational technology is big data. Big data brings tremendous challenges to extract valuable hidden knowledge. Data mining techniques can be used over big data to extract valuable knowledge for decision making. Big data results in high heterogeneity because it consists of various inter-related kinds of objects such as audios, texts, and images. In addition to this, the inter-related kinds of objects carry different information. So, in this paper clustering techniques are introduced to separate objects into several clusters. It also reduces the computational complexity of classifiers. A Possibilistic c-Means (PCM) algorithm was introduced to group the objects in big data. PCM replicated the characteristic of each object to different clusters effectively and it had capability to avoid the corruption of noise in the clustering process. However, PCM is not more efficient for big data and it cannot confine the complex correlation over multiple modalities of the heterogeneous data objects. So, a Parallel Semi-supervised Multi-Ant Colonies Clustering (PSMACC) is introduced for big data clustering. Initially, the PSMACC splits the data into number of partitions and each partition is processed in mappers. Each mapper generates a diverse collection of three clustering components using the semisupervised ant colony clustering algorithm with various moving speeds. Then, a hyper graph model was used to combine three clustering components. Finally, two constraints such as MustLink (ML) and Cannot-Link (CL) are included to form a consensus clustering. Finally, the intermediate results of each mapper are combined in the reducer. However, the overhead of iteration in PSMACC is overwhelming which affects the performance of PSMACC. So, a Parallel Semi-supervised MultiImperialist Competitive Algorithm (PSMICA) is proposed to cluster the big data. In PSMICA, each mapper processes the ICA where initial population is called countries. Some of the best countries in the population chosen as the imperialists and the remaining countries form the colonies of these imperialists. The colonies move towards the imperialists based on the distance between them. The intermediate results of each mapper are combined in reducer to get the final clustering result.


2021 ◽  
Vol 15 ◽  
Author(s):  
An Yan ◽  
Wei Wang ◽  
Yi Ren ◽  
HongWei Geng

The problems of data abnormalities and missing data are puzzling the traditional multi-modal heterogeneous big data clustering. In order to solve this issue, a multi-view heterogeneous big data clustering algorithm based on improved Kmeans clustering is established in this paper. At first, for the big data which involve heterogeneous data, based on multi view data analyzing, we propose an advanced Kmeans algorithm on the base of multi view heterogeneous system to determine the similarity detection metrics. Then, a BP neural network method is used to predict the missing attribute values, complete the missing data and restore the big data structure in heterogeneous state. Last, we ulteriorly propose a data denoising algorithm to denoise the abnormal data. Based on the above methods, we construct a framework namely BPK-means to resolve the problems of data abnormalities and missing data. Our solution approach is evaluated through rigorous performance evaluation study. Compared with the original algorithm, both theoretical verification and experimental results show that the accuracy of the proposed method is greatly improved.


Sign in / Sign up

Export Citation Format

Share Document