Fuzzyc-Means and Cluster Ensemble with Random Projection for Big Data Clustering

Because of its positive effects on dealing with the curse of dimensionality in big data, random projection for dimensionality reduction has become a popular method recently. In this paper, an academic analysis of influences of random projection on the variability of data set and the dependence of dimensions has been proposed. Together with the theoretical analysis, a new fuzzyc-means (FCM) clustering algorithm with random projection has been presented. Empirical results verify that the new algorithm not only preserves the accuracy of original FCM clustering, but also is more efficient than original clustering and clustering with singular value decomposition. At the same time, a new cluster ensemble approach based on FCM clustering with random projection is also proposed. The new aggregation method can efficiently compute the spectral embedding of data with cluster centers based representation which scales linearly with data size. Experimental results reveal the efficiency, effectiveness, and robustness of our algorithm compared to the state-of-the-art methods.

Download Full-text

Incomplete Big Data Distributed Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1496 ◽

2014 ◽

Vol 687-691 ◽

pp. 1496-1499

Author(s):

Yong Lin Leng

Keyword(s):

Big Data ◽

Incomplete Data ◽

Clustering Algorithm ◽

Design Method ◽

Complete Data ◽

Similarity Metrics ◽

Distributed Clustering ◽

Computing Technology ◽

Data Set ◽

Affinity Propagation Clustering

Partially missing or blurring attribute values make data become incomplete during collecting data. Generally we use inputation or discarding method to deal with incomplete data before clustering. In this paper we proposed an a new similarity metrics algorithm based on incomplete information system. First algorithm divided the data set into a complete data set and non complete data set, and then the complete data set was clustered using the affinity propagation clustering algorithm, incomplete data according to the design method of the similarity metric is divided into the corresponding cluster. In order to improve the efficiency of the algorithm, designing the distributed clustering algorithm based on cloud computing technology. Experiment demonstrates the proposed algorithm can cluster the incomplete big data directly and improve the accuracy and effectively.

Download Full-text

A Novel Convolutional Neural Network Based Model for Recognition and Classification of Apple Leaf Diseases

Traitement du signal ◽

10.18280/ts.370622 ◽

2020 ◽

Vol 37 (6) ◽

pp. 1093-1101

Author(s):

Divakar Yadav ◽

Akanksha ◽

Arun Kumar Yadav

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Clustering Algorithm ◽

Daily Intake ◽

Plant Diseases ◽

Data Set ◽

Proposed Model ◽

Fcm Clustering ◽

Contrast Stretching

Plants have a great role to play in biodiversity sustenance. These natural products not only push their demand for agricultural productivity, but also for the manufacturing of medical products, cosmetics and many more. Apple is one of the fruits that is known for its excellent nutritional properties and is therefore recommended for daily intake. However, due to various diseases in apple plants, farmers have to suffer from a huge loss. This not only causes severe effects on fruit’s health, but also decreases its overall productivity, quantity, and quality. A novel convolutional neural network (CNN) based model for recognition and classification of apple leaf diseases is proposed in this paper. The proposed model applies contrast stretching based pre-processing technique and fuzzy c-means (FCM) clustering algorithm for the identification of plant diseases. These techniques help to improve the accuracy of CNN model even with lesser size of dataset. 400 image samples (200 healthy, 200 diseased) of apple leaves have been used to train and validate the performance of the proposed model. The proposed model achieved an accuracy of 98%. To achieve this accuracy, it uses lesser data-set size as compared to other existing models, without compromising with the performance, which become possible due to use of contrast stretching pre-processing combined with FCM clustering algorithm.

Download Full-text

An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering

International Journal of Ambient Computing and Intelligence ◽

10.4018/ijaci.2018070102 ◽

2018 ◽

Vol 9 (3) ◽

pp. 15-30 ◽

Cited By ~ 4

Author(s):

S. Vengadeswaran ◽

S. R. Balasundaram

Keyword(s):

Big Data ◽

Execution Time ◽

Clustering Algorithm ◽

Graph Clustering ◽

Data Placement ◽

Data Locality ◽

Query Execution ◽

Data Set ◽

Statistical Measures ◽

Default Data

This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.

Download Full-text

Matrix decompositions using sub-Gaussian random matrices

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay017 ◽

2018 ◽

Vol 8 (3) ◽

pp. 445-469 ◽

Cited By ~ 2

Author(s):

Yariv Aizenbud ◽

Amir Averbuch

Keyword(s):

Error Bound ◽

State Of The Art ◽

Matrix Decomposition ◽

Singular Values ◽

Random Projection ◽

Error Rates ◽

The State ◽

Matrix Decompositions ◽

Asymptotic Complexity ◽

Value Decomposition

Abstract In recent years, several algorithms which approximate matrix decomposition have been developed. These algorithms are based on metric conservation features for linear spaces of random projection types. We present a new algorithm, which achieves with high probability a rank-$r$ singular value decomposition (SVD) approximation of an $n \times n$ matrix and derive an error bound that does not depend on the first $r$ singular values. Although the algorithm has an asymptotic complexity similar to state-of-the-art algorithms and the proven error bound is not as tight as the state-of-the-art bound, experiments show that the proposed algorithm is faster in practice while providing the same error rates as those of the state-of-the-art algorithms. We also show that an i.i.d. sub-Gaussian matrix with large probability of having null entries is metric conserving. This result is used in the SVD approximation algorithm, as well as to improve the performance of a previously proposed approximated LU decomposition algorithm.

Download Full-text

Research on Clustering Algorithm of Heterogeneous Network Privacy Big Data Set Based on Cloud Computing

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Advanced Hybrid Information Processing ◽

10.1007/978-3-030-67871-5_33 ◽

2021 ◽

pp. 367-376

Author(s):

Ming-hao Ding

Keyword(s):

Cloud Computing ◽

Big Data ◽

Heterogeneous Network ◽

Clustering Algorithm ◽

Data Set

Download Full-text

Approximate Determination ofq-Parameter for FCM with Tsallis Entropy Maximization

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2017.p1152 ◽

2017 ◽

Vol 21 (7) ◽

pp. 1152-1160

Author(s):

Makoto Yasuda ◽

Keyword(s):

Clustering Algorithm ◽

Annealing Temperature ◽

Expansion Method ◽

Tsallis Entropy ◽

Approximate Determination ◽

Data Set ◽

Entropy Maximization ◽

Fcm Clustering ◽

Statistical Mechanical ◽

Series Expansion Method

This paper considers a fuzzyc-means (FCM) clustering algorithm in combination with deterministic annealing and the Tsallis entropy maximization. The Tsallis entropy is aq-parameter extension of the Shannon entropy. By maximizing the Tsallis entropy within the framework of FCM, statistical mechanical membership functions can be derived. One of the major considerations when using this method is how to determine appropriate values forqand the highest annealing temperature,Thigh, for a given data set. Accordingly, in this paper, a method for determining these values simultaneously without introducing any additional parameters is presented, where the membership function is approximated using a series expansion method. The results of experiments indicate that the proposed method is effective, and bothqandThighcan be determined automatically and algebraically from a given data set.

Download Full-text

Different Approaches for Missing Data Handling in Fuzzy Clustering: A Review

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096512666191127121710 ◽

2020 ◽

Vol 13 (6) ◽

pp. 833-846

Author(s):

Sonia Goel ◽

Meena Tushir

Keyword(s):

Missing Data ◽

Fuzzy Clustering ◽

Incomplete Data ◽

Clustering Algorithm ◽

Linear Interpolation ◽

Performance Criteria ◽

Data Sets ◽

Data Set ◽

Fcm Clustering ◽

Missing Attributes

Introduction: Incomplete data sets containing some missing attributes is a prevailing problem in many research areas. The reasons for the lack of missing attributes may be several; human error in tabulating/recording the data, machine failure, errors in data acquisition or refusal of a patient/customer to answer few questions in a questionnaire or survey. Further, clustering of such data sets becomes a challenge. Objective: In this paper, we presented a critical review of various methodologies proposed for handling missing data in clustering. The focus of this paper is the comparison of various imputation techniques based FCM clustering and the four clustering strategies proposed by Hathway and Bezdek. Methods: In this paper, we imputed the missing values in incomplete datasets by various imputation/ non-imputation techniques to complete the data set and then conventional fuzzy clustering algorithm is applied to get the clustering results. Results: Experiments on various synthetic data sets and real data sets from UCI repository are carried out. To evaluate the performance of the various imputation/ non-imputation based FCM clustering algorithm, several performance criteria and statistical tests are considered. Experimental results on various data sets show that the linear interpolation based FCM clustering performs significantly better than other imputation as well as non-imputation techniques. Conclusion: It is concluded that the clustering algorithm is data specific, no clustering technique can give good results on all data sets. It depends upon both the data type and the percentage of missing attributes in the dataset. Through this study, we have shown that the linear interpolation based FCM clustering algorithm can be used effectively for clustering of incomplete data set.

Download Full-text

Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis

BMC Bioinformatics ◽

10.1186/s12859-019-3179-5 ◽

2019 ◽

Vol 20 (S19) ◽

Cited By ~ 5

Author(s):

Thomas A. Geddes ◽

Taiyun Kim ◽

Lihao Nan ◽

James G. Burchfield ◽

Jean Y. H. Yang ◽

...

Keyword(s):

Data Analysis ◽

Single Cell ◽

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

Random Projection ◽

Computational Technique ◽

Cell Type ◽

Cluster Ensemble ◽

Cell Type Specific

Abstract Background Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification. Results Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used. Conclusions Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS

Download Full-text

Dimensionality reduction in the context of dynamic social media data streams

Evolving Systems ◽

10.1007/s12530-021-09396-z ◽

2021 ◽

Author(s):

Moritz Heusinger ◽

Christoph Raab ◽

Frank-Michael Schleif

Keyword(s):

Social Media ◽

Data Analysis ◽

Real World ◽

State Of The Art ◽

Concept Drift ◽

Principal Component ◽

Random Projection ◽

Streaming Data ◽

Real World Data ◽

Data Set

AbstractIn recent years social media became an important part of everyday life for many people. A big challenge of social media is, to find posts, that are interesting for the user. Many social networks like Twitter handle this problem with so-called hashtags. A user can label his own Tweet (post) with a hashtag, while other users can search for posts containing a specified hashtag. But what about finding posts which are not labeled by the creator? We provide a way of completing hashtags for unlabeled posts using classification on a novel real-world Twitter data stream. New posts will be created every second, thus this context fits perfectly for non-stationary data analysis. Our goal is to show, how labels (hashtags) of social media posts can be predicted by stream classifiers. In particular, we employ random projection (RP) as a preprocessing step in calculating streaming models. Also, we provide a novel real-world data set for streaming analysis called NSDQ with a comprehensive data description. We show that this dataset is a real challenge for state-of-the-art stream classifiers. While RP has been widely used and evaluated in stationary data analysis scenarios, non-stationary environments are not well analyzed. In this paper, we provide a use case of RP on real-world streaming data, especially on NSDQ dataset. We discuss why RP can be used in this scenario and how it can handle stream-specific situations like concept drift. We also provide experiments with RP on streaming data, using state-of-the-art stream classifiers like adaptive random forest and concept drift detectors. Additionally, we experimentally evaluate an online principal component analysis (PCA) approach in the same fashion as we do for RP. To obtain higher dimensional synthetic streams, we use random Fourier features (RFF) in an online manner which allows us, to increase the number of dimensions of low dimensional streams.

Download Full-text

Improved Fuzzy C-Means Based on the Optimal Number of Clusters

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.392.803 ◽

2013 ◽

Vol 392 ◽

pp. 803-807 ◽

Cited By ~ 1

Author(s):

Xue Bo Feng ◽

Fang Yao ◽

Zhi Gang Li ◽

Xiao Jing Yang

Keyword(s):

Convergence Rate ◽

Clustering Algorithm ◽

Optimal Number ◽

Data Set ◽

Number Of Clusters ◽

Fuzzy C Means ◽

Initial Cluster ◽

Fuzzy C Means Clustering ◽

Fcm Clustering ◽

Optimal Number Of Clusters

According to the number of cluster centers, initial cluster centers, fuzzy factor, iterations and threshold, Fuzzy C-means clustering algorithm (FCM) clusters the data set. FCM will encounter the initialization problem of clustering prototype. Firstly, the article combines the maximum and minimum distance algorithm and K-means algorithm to determine the number of clusters and the initial cluster centers. Secondly, the article determines the optimal number of clusters with Silhouette indicators. Finally, the article improves the convergence rate of FCM by revising membership constantly. The improved FCM has good clustering effect, enhances the optimized capability, and improves the efficiency and effectiveness of the clustering. It has better tightness in the class, scatter among classes and cluster stability and faster convergence rate than the traditional FCM clustering method.

Download Full-text