scholarly journals Fuzzyc-Means and Cluster Ensemble with Random Projection for Big Data Clustering

2016 ◽  
Vol 2016 ◽  
pp. 1-13 ◽  
Author(s):  
Mao Ye ◽  
Wenfen Liu ◽  
Jianghong Wei ◽  
Xuexian Hu

Because of its positive effects on dealing with the curse of dimensionality in big data, random projection for dimensionality reduction has become a popular method recently. In this paper, an academic analysis of influences of random projection on the variability of data set and the dependence of dimensions has been proposed. Together with the theoretical analysis, a new fuzzyc-means (FCM) clustering algorithm with random projection has been presented. Empirical results verify that the new algorithm not only preserves the accuracy of original FCM clustering, but also is more efficient than original clustering and clustering with singular value decomposition. At the same time, a new cluster ensemble approach based on FCM clustering with random projection is also proposed. The new aggregation method can efficiently compute the spectral embedding of data with cluster centers based representation which scales linearly with data size. Experimental results reveal the efficiency, effectiveness, and robustness of our algorithm compared to the state-of-the-art methods.

2014 ◽  
Vol 687-691 ◽  
pp. 1496-1499
Author(s):  
Yong Lin Leng

Partially missing or blurring attribute values make data become incomplete during collecting data. Generally we use inputation or discarding method to deal with incomplete data before clustering. In this paper we proposed an a new similarity metrics algorithm based on incomplete information system. First algorithm divided the data set into a complete data set and non complete data set, and then the complete data set was clustered using the affinity propagation clustering algorithm, incomplete data according to the design method of the similarity metric is divided into the corresponding cluster. In order to improve the efficiency of the algorithm, designing the distributed clustering algorithm based on cloud computing technology. Experiment demonstrates the proposed algorithm can cluster the incomplete big data directly and improve the accuracy and effectively.


2020 ◽  
Vol 37 (6) ◽  
pp. 1093-1101
Author(s):  
Divakar Yadav ◽  
Akanksha ◽  
Arun Kumar Yadav

Plants have a great role to play in biodiversity sustenance. These natural products not only push their demand for agricultural productivity, but also for the manufacturing of medical products, cosmetics and many more. Apple is one of the fruits that is known for its excellent nutritional properties and is therefore recommended for daily intake. However, due to various diseases in apple plants, farmers have to suffer from a huge loss. This not only causes severe effects on fruit’s health, but also decreases its overall productivity, quantity, and quality. A novel convolutional neural network (CNN) based model for recognition and classification of apple leaf diseases is proposed in this paper. The proposed model applies contrast stretching based pre-processing technique and fuzzy c-means (FCM) clustering algorithm for the identification of plant diseases. These techniques help to improve the accuracy of CNN model even with lesser size of dataset. 400 image samples (200 healthy, 200 diseased) of apple leaves have been used to train and validate the performance of the proposed model. The proposed model achieved an accuracy of 98%. To achieve this accuracy, it uses lesser data-set size as compared to other existing models, without compromising with the performance, which become possible due to use of contrast stretching pre-processing combined with FCM clustering algorithm.


2018 ◽  
Vol 9 (3) ◽  
pp. 15-30 ◽  
Author(s):  
S. Vengadeswaran ◽  
S. R. Balasundaram

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.


2018 ◽  
Vol 8 (3) ◽  
pp. 445-469 ◽  
Author(s):  
Yariv Aizenbud ◽  
Amir Averbuch

Abstract In recent years, several algorithms which approximate matrix decomposition have been developed. These algorithms are based on metric conservation features for linear spaces of random projection types. We present a new algorithm, which achieves with high probability a rank-$r$ singular value decomposition (SVD) approximation of an $n \times n$ matrix and derive an error bound that does not depend on the first $r$ singular values. Although the algorithm has an asymptotic complexity similar to state-of-the-art algorithms and the proven error bound is not as tight as the state-of-the-art bound, experiments show that the proposed algorithm is faster in practice while providing the same error rates as those of the state-of-the-art algorithms. We also show that an i.i.d. sub-Gaussian matrix with large probability of having null entries is metric conserving. This result is used in the SVD approximation algorithm, as well as to improve the performance of a previously proposed approximated LU decomposition algorithm.


Author(s):  
Makoto Yasuda ◽  

This paper considers a fuzzyc-means (FCM) clustering algorithm in combination with deterministic annealing and the Tsallis entropy maximization. The Tsallis entropy is aq-parameter extension of the Shannon entropy. By maximizing the Tsallis entropy within the framework of FCM, statistical mechanical membership functions can be derived. One of the major considerations when using this method is how to determine appropriate values forqand the highest annealing temperature,Thigh, for a given data set. Accordingly, in this paper, a method for determining these values simultaneously without introducing any additional parameters is presented, where the membership function is approximated using a series expansion method. The results of experiments indicate that the proposed method is effective, and bothqandThighcan be determined automatically and algebraically from a given data set.


Author(s):  
Sonia Goel ◽  
Meena Tushir

Introduction: Incomplete data sets containing some missing attributes is a prevailing problem in many research areas. The reasons for the lack of missing attributes may be several; human error in tabulating/recording the data, machine failure, errors in data acquisition or refusal of a patient/customer to answer few questions in a questionnaire or survey. Further, clustering of such data sets becomes a challenge. Objective: In this paper, we presented a critical review of various methodologies proposed for handling missing data in clustering. The focus of this paper is the comparison of various imputation techniques based FCM clustering and the four clustering strategies proposed by Hathway and Bezdek. Methods: In this paper, we imputed the missing values in incomplete datasets by various imputation/ non-imputation techniques to complete the data set and then conventional fuzzy clustering algorithm is applied to get the clustering results. Results: Experiments on various synthetic data sets and real data sets from UCI repository are carried out. To evaluate the performance of the various imputation/ non-imputation based FCM clustering algorithm, several performance criteria and statistical tests are considered. Experimental results on various data sets show that the linear interpolation based FCM clustering performs significantly better than other imputation as well as non-imputation techniques. Conclusion: It is concluded that the clustering algorithm is data specific, no clustering technique can give good results on all data sets. It depends upon both the data type and the percentage of missing attributes in the dataset. Through this study, we have shown that the linear interpolation based FCM clustering algorithm can be used effectively for clustering of incomplete data set.


2019 ◽  
Vol 20 (S19) ◽  
Author(s):  
Thomas A. Geddes ◽  
Taiyun Kim ◽  
Lihao Nan ◽  
James G. Burchfield ◽  
Jean Y. H. Yang ◽  
...  

Abstract Background Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification. Results Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used. Conclusions Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS


2021 ◽  
Author(s):  
Moritz Heusinger ◽  
Christoph Raab ◽  
Frank-Michael Schleif

AbstractIn recent years social media became an important part of everyday life for many people. A big challenge of social media is, to find posts, that are interesting for the user. Many social networks like Twitter handle this problem with so-called hashtags. A user can label his own Tweet (post) with a hashtag, while other users can search for posts containing a specified hashtag. But what about finding posts which are not labeled by the creator? We provide a way of completing hashtags for unlabeled posts using classification on a novel real-world Twitter data stream. New posts will be created every second, thus this context fits perfectly for non-stationary data analysis. Our goal is to show, how labels (hashtags) of social media posts can be predicted by stream classifiers. In particular, we employ random projection (RP) as a preprocessing step in calculating streaming models. Also, we provide a novel real-world data set for streaming analysis called NSDQ with a comprehensive data description. We show that this dataset is a real challenge for state-of-the-art stream classifiers. While RP has been widely used and evaluated in stationary data analysis scenarios, non-stationary environments are not well analyzed. In this paper, we provide a use case of RP on real-world streaming data, especially on NSDQ dataset. We discuss why RP can be used in this scenario and how it can handle stream-specific situations like concept drift. We also provide experiments with RP on streaming data, using state-of-the-art stream classifiers like adaptive random forest and concept drift detectors. Additionally, we experimentally evaluate an online principal component analysis (PCA) approach in the same fashion as we do for RP. To obtain higher dimensional synthetic streams, we use random Fourier features (RFF) in an online manner which allows us, to increase the number of dimensions of low dimensional streams.


2013 ◽  
Vol 392 ◽  
pp. 803-807 ◽  
Author(s):  
Xue Bo Feng ◽  
Fang Yao ◽  
Zhi Gang Li ◽  
Xiao Jing Yang

According to the number of cluster centers, initial cluster centers, fuzzy factor, iterations and threshold, Fuzzy C-means clustering algorithm (FCM) clusters the data set. FCM will encounter the initialization problem of clustering prototype. Firstly, the article combines the maximum and minimum distance algorithm and K-means algorithm to determine the number of clusters and the initial cluster centers. Secondly, the article determines the optimal number of clusters with Silhouette indicators. Finally, the article improves the convergence rate of FCM by revising membership constantly. The improved FCM has good clustering effect, enhances the optimized capability, and improves the efficiency and effectiveness of the clustering. It has better tightness in the class, scatter among classes and cluster stability and faster convergence rate than the traditional FCM clustering method.


Sign in / Sign up

Export Citation Format

Share Document