Improved K-MEANS Algorithm Based on Samples

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

Download Full-text

Precise clustering analysis of Internet financial credit reporting dependent on multidimensional attribute sparse large data

International Journal of Electrical Engineering Education ◽

10.1177/00207209211002086 ◽

2021 ◽

pp. 002072092110020

Author(s):

Lingling Chen ◽

Yuanyuan Zhang ◽

Min Zeng

Keyword(s):

Clustering Analysis ◽

Large Data ◽

Distance Matrix ◽

The Internet ◽

Relationship Matrix ◽

Clustering Methods ◽

Correlation Relationship ◽

Credit Reporting ◽

Financial Credit ◽

Approximate Distance

Given that the traditional methods cannot perform clustering analysis on the Internet financial credit reporting directly and effectively, a kind of precise clustering analysis of internet financial credit reporting dependent on multidimensional attribute sparse large data is proposed. By measuring the overall distance between Internet financial credit reporting through the sparse large data with multidimensional attributes, the multidimensional attribute sparse large data are used to perform clustering analysis on the overall distance matrix and the component approximate distance matrix between the data, respectively. The correlation relationship between the Internet financial credit reporting under these two perspectives is taken into comprehensive consideration. Multidimensional attribute sparse large data pairs are used to reflect the comprehensive relationship matrix of the original Internet financial credit reporting to achieve clustering with relatively high quality. Numerical experiments show that compared with the traditional clustering methods, the method proposed in this paper can not only reflect the overall data features effectively, but also improve the clustering effect of the original Internet financial credit reporting data through the analysis of the correlation relationship between the important component attribute sequences.

Download Full-text

An Optimizing Method of Competitive Neural Network

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.467-469.894 ◽

2011 ◽

Vol 467-469 ◽

pp. 894-899

Author(s):

Hong Men ◽

Hai Yan Liu ◽

Lei Wang ◽

Yun Peng Pan

Keyword(s):

Neural Network ◽

Network Structure ◽

Clustering Analysis ◽

Learning Rate ◽

Optimum Number ◽

Number Of Clusters ◽

Competitive Neural Network ◽

Output Neurons ◽

Simulation Results ◽

Strong Ability

This paper presents an optimizing method of competitive neural network(CNN):During clustering analysis fixed on the optimum number of output neurons according to the change of DB value，and then adjusted connected weight including increasing ,dividing , delete. Each neuron had the different variety trend of learning rate according with the change of the probability of neurons. The optimizing method made classification more accurate. Simulation results showed that optimized network structure had a strong ability to adjust the number of clusters dynamically and good results of classification.

Download Full-text

Multi-Aspect Incremental Tensor Decomposition Based on Distributed In-Memory Big Data Systems

Journal of Data and Information Science ◽

10.2478/jdis-2020-0010 ◽

2020 ◽

Vol 5 (2) ◽

pp. 13-32

Author(s):

Hye-Kyung Yang ◽

Hwan-Seung Yong

Keyword(s):

Execution Time ◽

Three Dimensional ◽

Large Data ◽

Tensor Decomposition ◽

Decomposition Methods ◽

Apache Spark ◽

Decomposition Algorithms ◽

Data Systems ◽

Big Data Systems ◽

Spark Framework

AbstractPurposeWe propose InParTen2, a multi-aspect parallel factor analysis three-dimensional tensor decomposition algorithm based on the Apache Spark framework. The proposed method reduces re-decomposition cost and can handle large tensors.Design/methodology/approachConsidering that tensor addition increases the size of a given tensor along all axes, the proposed method decomposes incoming tensors using existing decomposition results without generating sub-tensors. Additionally, InParTen2 avoids the calculation of Khari–Rao products and minimizes shuffling by using the Apache Spark platform.FindingsThe performance of InParTen2 is evaluated by comparing its execution time and accuracy with those of existing distributed tensor decomposition methods on various datasets. The results confirm that InParTen2 can process large tensors and reduce the re-calculation cost of tensor decomposition. Consequently, the proposed method is faster than existing tensor decomposition algorithms and can significantly reduce re-decomposition cost.Research limitationsThere are several Hadoop-based distributed tensor decomposition algorithms as well as MATLAB-based decomposition methods. However, the former require longer iteration time, and therefore their execution time cannot be compared with that of Spark-based algorithms, whereas the latter run on a single machine, thus limiting their ability to handle large data.Practical implicationsThe proposed algorithm can reduce re-decomposition cost when tensors are added to a given tensor by decomposing them based on existing decomposition results without re-decomposing the entire tensor.Originality/valueThe proposed method can handle large tensors and is fast within the limited-memory framework of Apache Spark. Moreover, InParTen2 can handle static as well as incremental tensor decomposition.

Download Full-text

Effective Utilization of Resources Through Optimal Allocation and Opportunistic Migration of Virtual Machines in Cloud Environment

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2021070105 ◽

2021 ◽

Vol 11 (3) ◽

pp. 72-91

Author(s):

Priyanka H. ◽

Mary Cherian

Keyword(s):

Data Center ◽

Execution Time ◽

Data Centers ◽

Optimal Allocation ◽

Virtual Machines ◽

Large Data ◽

Migration Time ◽

Resource Usage ◽

Total Execution Time ◽

Effective Utilization

Cloud computing has become more prominent, and it is used in large data centers. Distribution of well-organized resources (bandwidth, CPU, and memory) is the major problem in the data centers. The genetically enhanced shuffling frog leaping algorithm (GESFLA) framework is proposed to select the optimal virtual machines to schedule the tasks and allocate them in physical machines (PMs). The proposed GESFLA-based resource allocation technique is useful in minimizing the wastage of resource usage and also minimizes the power consumption of the data center. The proposed GESFL algorithm is compared with task-based particle swarm optimization (TBPSO) for efficiency. The experimental results show the excellence of GESFLA over TBPSO in terms of resource usage ratio, migration time, and total execution time. The proposed GESFLA framework reduces the energy consumption of data center up to 79%, migration time by 67%, and CPU utilization is improved by 9% for Planet Lab workload traces. For the random workload, the execution time is minimized by 71%, transfer time is reduced up to 99%, and the CPU consumption is improved by 17% when compared to TBPSO.

Download Full-text

Automatic Partitioning of Large Scale Simulation in Grid Computing for Run Time Reduction

Innovations in Information Systems for Business Functionality and Operations Management ◽

10.4018/978-1-4666-0933-4.ch014 ◽

2012 ◽

pp. 225-252

Author(s):

Nurcin Celik ◽

Esfandyar Mazhari ◽

John Canby ◽

Omid Kazemi ◽

Parag Sarfare ◽

...

Keyword(s):

Execution Time ◽

Large Scale ◽

Time Synchronization ◽

Computational Grid ◽

Experimental Results ◽

Time Interval ◽

Computational Power ◽

Large Scale Systems ◽

Large Scale Simulations ◽

Reduce Execution Time

Simulating large-scale systems usually entails exhaustive computational powers and lengthy execution times. The goal of this research is to reduce execution time of large-scale simulations without sacrificing their accuracy by partitioning a monolithic model into multiple pieces automatically and executing them in a distributed computing environment. While this partitioning allows us to distribute required computational power to multiple computers, it creates a new challenge of synchronizing the partitioned models. In this article, a partitioning methodology based on a modified Prim’s algorithm is proposed to minimize the overall simulation execution time considering 1) internal computation in each of the partitioned models and 2) time synchronization between them. In addition, the authors seek to find the most advantageous number of partitioned models from the monolithic model by evaluating the tradeoff between reduced computations vs. increased time synchronization requirements. In this article, epoch- based synchronization is employed to synchronize logical times of the partitioned simulations, where an appropriate time interval is determined based on the off-line simulation analyses. A computational grid framework is employed for execution of the simulations partitioned by the proposed methodology. The experimental results reveal that the proposed approach reduces simulation execution time significantly while maintaining the accuracy as compared with the monolithic simulation execution approach.

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

Automatic Partitioning of Large Scale Simulation in Grid Computing for Run Time Reduction

International Journal of Operations Research and Information Systems ◽

10.4018/joris.2010040105 ◽

2010 ◽

Vol 1 (2) ◽

pp. 64-90 ◽

Cited By ~ 6

Author(s):

Nurcin Celik ◽

Esfandyar Mazhari ◽

John Canby ◽

Omid Kazemi ◽

Parag Sarfare ◽

...

Keyword(s):

Execution Time ◽

Large Scale ◽

Time Synchronization ◽

Computational Grid ◽

Time Interval ◽

Computational Power ◽

Computing Environment ◽

Large Scale Systems ◽

Large Scale Simulations ◽

Reduce Execution Time

Simulating large-scale systems usually entails exhaustive computational powers and lengthy execution times. The goal of this research is to reduce execution time of large-scale simulations without sacrificing their accuracy by partitioning a monolithic model into multiple pieces automatically and executing them in a distributed computing environment. While this partitioning allows us to distribute required computational power to multiple computers, it creates a new challenge of synchronizing the partitioned models. In this article, a partitioning methodology based on a modified Prim’s algorithm is proposed to minimize the overall simulation execution time considering 1) internal computation in each of the partitioned models and 2) time synchronization between them. In addition, the authors seek to find the most advantageous number of partitioned models from the monolithic model by evaluating the tradeoff between reduced computations vs. increased time synchronization requirements. In this article, epoch- based synchronization is employed to synchronize logical times of the partitioned simulations, where an appropriate time interval is determined based on the off-line simulation analyses. A computational grid framework is employed for execution of the simulations partitioned by the proposed methodology. The experimental results reveal that the proposed approach reduces simulation execution time significantly while maintaining the accuracy as compared with the monolithic simulation execution approach.

Download Full-text