Incomplete Big Data Distributed Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1496 ◽

2014 ◽

Vol 687-691 ◽

pp. 1496-1499

Author(s):

Yong Lin Leng

Keyword(s):

Big Data ◽

Incomplete Data ◽

Clustering Algorithm ◽

Design Method ◽

Complete Data ◽

Similarity Metrics ◽

Distributed Clustering ◽

Computing Technology ◽

Data Set ◽

Affinity Propagation Clustering

Partially missing or blurring attribute values make data become incomplete during collecting data. Generally we use inputation or discarding method to deal with incomplete data before clustering. In this paper we proposed an a new similarity metrics algorithm based on incomplete information system. First algorithm divided the data set into a complete data set and non complete data set, and then the complete data set was clustered using the affinity propagation clustering algorithm, incomplete data according to the design method of the similarity metric is divided into the corresponding cluster. In order to improve the efficiency of the algorithm, designing the distributed clustering algorithm based on cloud computing technology. Experiment demonstrates the proposed algorithm can cluster the incomplete big data directly and improve the accuracy and effectively.

Download Full-text

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2018070102 ◽

2018 ◽

Vol 9 (3) ◽

pp. 15-30 ◽

Cited By ~ 4

Author(s):

S. Vengadeswaran ◽

S. R. Balasundaram

Keyword(s):

Big Data ◽

Execution Time ◽

Clustering Algorithm ◽

Graph Clustering ◽

Data Placement ◽

Data Locality ◽

Query Execution ◽

Data Set ◽

Statistical Measures ◽

Default Data

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

Download Full-text

Research on Clustering Algorithm of Heterogeneous Network Privacy Big Data Set Based on Cloud Computing

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Advanced Hybrid Information Processing ◽

10.1007/978-3-030-67871-5_33 ◽

2021 ◽

pp. 367-376

Author(s):

Ming-hao Ding

Keyword(s):

Cloud Computing ◽

Big Data ◽

Heterogeneous Network ◽

Clustering Algorithm ◽

Data Set

Download Full-text

Different Approaches for Missing Data Handling in Fuzzy Clustering: A Review

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096512666191127121710 ◽

2020 ◽

Vol 13 (6) ◽

pp. 833-846

Author(s):

Sonia Goel ◽

Meena Tushir

Keyword(s):

Missing Data ◽

Fuzzy Clustering ◽

Incomplete Data ◽

Clustering Algorithm ◽

Linear Interpolation ◽

Performance Criteria ◽

Data Sets ◽

Data Set ◽

Fcm Clustering ◽

Missing Attributes

Introduction: Incomplete data sets containing some missing attributes is a prevailing problem in many research areas. The reasons for the lack of missing attributes may be several; human error in tabulating/recording the data, machine failure, errors in data acquisition or refusal of a patient/customer to answer few questions in a questionnaire or survey. Further, clustering of such data sets becomes a challenge. Objective: In this paper, we presented a critical review of various methodologies proposed for handling missing data in clustering. The focus of this paper is the comparison of various imputation techniques based FCM clustering and the four clustering strategies proposed by Hathway and Bezdek. Methods: In this paper, we imputed the missing values in incomplete datasets by various imputation/ non-imputation techniques to complete the data set and then conventional fuzzy clustering algorithm is applied to get the clustering results. Results: Experiments on various synthetic data sets and real data sets from UCI repository are carried out. To evaluate the performance of the various imputation/ non-imputation based FCM clustering algorithm, several performance criteria and statistical tests are considered. Experimental results on various data sets show that the linear interpolation based FCM clustering performs significantly better than other imputation as well as non-imputation techniques. Conclusion: It is concluded that the clustering algorithm is data specific, no clustering technique can give good results on all data sets. It depends upon both the data type and the percentage of missing attributes in the dataset. Through this study, we have shown that the linear interpolation based FCM clustering algorithm can be used effectively for clustering of incomplete data set.

Download Full-text

User online behavior based on big data distributed clustering algorithm

International Journal of Advanced Robotic Systems ◽

10.1177/1729881420917293 ◽

2020 ◽

Vol 17 (2) ◽

pp. 172988142091729 ◽

Cited By ~ 1

Author(s):

Yan Wang

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

Utilization Rate ◽

Clear Understanding ◽

Campus Network ◽

Online Behavior ◽

Distributed Clustering ◽

Network Resources ◽

Network Bandwidth ◽

Network Usage

With the development of big data technology more and more perfect, many colleges and universities have begun to use it to analyze the construction work. In daily life, such as class, study, and entertainment, the campus network exists. The purpose of this article is to study the online behavior of users, analyze students’ use of the campus network by analyzing students, and not only have a clear understanding of the students’ online access but also feedback on the operation and maintenance of the campus network. Based on the big data, this article uses distributed clustering algorithm to study the online behavior of users. This article selects a college online user as the research object and studies and analyzes the online behavior of school users. This study found that the second-year student network usage is as high as 330,000, which is 60.98% more than the senior. In addition, the majority of student users spend most of their online time on the weekend, and the other time is not much different. The duration is concentrated within 1 h, 1–2 h, 2–3 h in these three time periods. By studying the user’s online behavior, you can understand the utilization rate of the campus network bandwidth resources and the distribution of the use of the network, to prevent students from indulging in the virtual network world, and to ensure that the network users can improve the online experience of the campus network while accessing the network resources reasonably. The research provides a reference for network administrators to adjust network bandwidth and optimize the network.

Download Full-text

Affinity propagation clustering algorithm based on large-scale data-set

International Journal of Computers and Applications ◽

10.1080/1206212x.2018.1425184 ◽

2018 ◽

Vol 40 (3) ◽

pp. 1-6 ◽

Cited By ~ 4

Author(s):

Limin Wang ◽

Kaiyue Zheng ◽

Xing Tao ◽

Xuming Han

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Affinity Propagation ◽

Data Set ◽

Large Scale Data ◽

Affinity Propagation Clustering ◽

Scale Data

Download Full-text

An Extended Affinity Propagation Clustering Method Based on Different Data Density Types

Computational Intelligence and Neuroscience ◽

10.1155/2015/828057 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 1

Author(s):

XiuLi Zhao ◽

WeiXiang Xu

Keyword(s):

Clustering Algorithm ◽

Affinity Propagation ◽

Similar Degree ◽

Clustering Method ◽

Data Set ◽

Affinity Propagation Clustering ◽

Initial Cluster ◽

Data Density ◽

Data Points ◽

Ap Clustering

Affinity propagation (AP) algorithm, as a novel clustering method, does not require the users to specify the initial cluster centers in advance, which regards all data points as potential exemplars (cluster centers) equally and groups the clusters totally by the similar degree among the data points. But in many cases there exist some different intensive areas within the same data set, which means that the data set does not distribute homogeneously. In such situation the AP algorithm cannot group the data points into ideal clusters. In this paper, we proposed an extended AP clustering algorithm to deal with such a problem. There are two steps in our method: firstly the data set is partitioned into several data density types according to the nearest distances of each data point; and then the AP clustering method is, respectively, used to group the data points into clusters in each data density type. Two experiments are carried out to evaluate the performance of our algorithm: one utilizes an artificial data set and the other uses a real seismic data set. The experiment results show that groups are obtained more accurately by our algorithm than OPTICS and AP clustering algorithm itself.

Download Full-text

Fuzzyc-Means and Cluster Ensemble with Random Projection for Big Data Clustering

Mathematical Problems in Engineering ◽

10.1155/2016/6529794 ◽

2016 ◽

Vol 2016 ◽

pp. 1-13 ◽

Cited By ~ 5

Author(s):

Mao Ye ◽

Wenfen Liu ◽

Jianghong Wei ◽

Xuexian Hu

Keyword(s):

Big Data ◽

Clustering Algorithm ◽

State Of The Art ◽

Random Projection ◽

Aggregation Method ◽

Data Set ◽

Cluster Ensemble ◽

Positive Effects ◽

Fcm Clustering ◽

Value Decomposition

Because of its positive effects on dealing with the curse of dimensionality in big data, random projection for dimensionality reduction has become a popular method recently. In this paper, an academic analysis of influences of random projection on the variability of data set and the dependence of dimensions has been proposed. Together with the theoretical analysis, a new fuzzyc-means (FCM) clustering algorithm with random projection has been presented. Empirical results verify that the new algorithm not only preserves the accuracy of original FCM clustering, but also is more efficient than original clustering and clustering with singular value decomposition. At the same time, a new cluster ensemble approach based on FCM clustering with random projection is also proposed. The new aggregation method can efficiently compute the spectral embedding of data with cluster centers based representation which scales linearly with data size. Experimental results reveal the efficiency, effectiveness, and robustness of our algorithm compared to the state-of-the-art methods.

Download Full-text

A New semi-supervised clustering for incomplete data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189744 ◽

2021 ◽

pp. 1-13

Author(s):

Sonia Goel ◽

Meena Tushir

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Complete Data ◽

Unlabeled Data ◽

Misclassification Rate ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Supervised Clustering

Semi-supervised clustering technique partitions the unlabeled data based on prior knowledge of labeled data. Most of the semi-supervised clustering algorithms exist only for the clustering of complete data, i.e., the data sets with no missing features. In this paper, an effort has been made to check the effectiveness of semi-supervised clustering when applied to incomplete data sets. The novelty of this approach is that it considers the missing features along with available knowledge (labels) of the data set. The linear interpolation imputation technique initially imputes the missing features of the data set, thus completing the data set. A semi-supervised clustering is now employed on this complete data set, and missing features are regularly updated within the clustering process. In the proposed work, the labeled percentage range used is 30, 40, 50, and 60% of the total data. Data is further altered by arbitrarily eliminating certain features of its components, which makes the data incomplete with partial labeling. The proposed algorithm utilizes both labeled and unlabeled data, along with certain missing values in the data. The proposed algorithm is evaluated using three performance indices, namely the misclassification rate, random index metric, and error rate. Despite the additional missing features, the proposed algorithm has been successfully implemented on real data sets and showed better/competing results than well-known standard semi-supervised clustering methods.

Download Full-text

Distributed Clustering Algorithm to Explore Selection Diversity in Wireless Sensor Networks

IEICE Transactions on Communications ◽

10.1587/transcom.e93.b.1232 ◽

2010 ◽

Vol E93-B (5) ◽

pp. 1232-1239

Author(s):

Hyung-Yun KONG ◽

ASADUZZAMAN

Keyword(s):

Wireless Sensor Networks ◽

Sensor Networks ◽

Clustering Algorithm ◽

Wireless Sensor ◽

Distributed Clustering ◽

Selection Diversity

Download Full-text

An approach for document retrieval using cluster-based inverted indexing

Journal of Information Science ◽

10.1177/01655515211018401 ◽

2021 ◽

pp. 016555152110184

Author(s):

Gunjan Chandwani ◽

Anil Ahlawat ◽

Gaurav Dubey

Keyword(s):

High Performance ◽

Clustering Algorithm ◽

Pearson Correlation ◽

Relevant Information ◽

Document Retrieval ◽

Bhattacharyya Distance ◽

Data Set ◽

Query Matching ◽

Inverted Indexing ◽

Query Optimisation

Document retrieval plays an important role in knowledge management as it facilitates us to discover the relevant information from the existing data. This article proposes a cluster-based inverted indexing algorithm for document retrieval. First, the pre-processing is done to remove the unnecessary and redundant words from the documents. Then, the indexing of documents is done by the cluster-based inverted indexing algorithm, which is developed by integrating the piecewise fuzzy C-means (piFCM) clustering algorithm and inverted indexing. After providing the index to the documents, the query matching is performed for the user queries using the Bhattacharyya distance. Finally, the query optimisation is done by the Pearson correlation coefficient, and the relevant documents are retrieved. The performance of the proposed algorithm is analysed by the WebKB data set and Twenty Newsgroups data set. The analysis exposes that the proposed algorithm offers high performance with a precision of 1, recall of 0.70 and F-measure of 0.8235. The proposed document retrieval system retrieves the most relevant documents and speeds up the storing and retrieval of information.

Download Full-text