HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

Community structure detection is one of the major research areas of network science and it is particularly useful for large real networks applications. This work presents a deep study of the most discussed algorithms for community detection based on modularity measure: Newman’s spectral method using a fine-tuning stage and the method of Clauset, Newman, and Moore (CNM) with its variants. The computational complexity of the algorithms is analysed for the development of a high performance code to accelerate the execution of these algorithms without compromising the quality of the results, according to the modularity measure. The implemented code allows the generation of partitions with modularity values consistent with the literature and it overcomes 1 million nodes with Newman’s spectral method. The code was applied to a wide range of real networks and the performances of the algorithms are evaluated.

Download Full-text

Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Scientific Programming ◽

10.1155/2021/5916748 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Biqiu Li ◽

Jiabin Wang ◽

Xueli Liu

Keyword(s):

Data Mining ◽

Large Scale ◽

Clustering Algorithm ◽

Parallel Implementation ◽

Data Cleaning ◽

Position Vector ◽

Data Sets ◽

Implementation Scheme ◽

Mining Work ◽

Context Features

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.

Download Full-text

Meditation-induced effects on whole-brain structural and effective connectivity

10.1101/2021.06.10.447903 ◽

2021 ◽

Author(s):

Eleonora De Filippi ◽

Anira Escrichs ◽

Matthieu Gilson ◽

Marti Sanchez-Fibla ◽

Estela Camara ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Effective Connectivity ◽

Temporal Structure ◽

Functional Domain ◽

Anatomical Connectivity ◽

Whole Brain ◽

Spatio Temporal ◽

Large Scale Networks ◽

And Function

In the past decades, there has been a growing scientific interest in characterizing neural correlates of meditation training. Nonetheless, the mechanisms underlying meditation remain elusive. In the present work, we investigated meditation-related changes in structural and functional connectivities (SC and FC, respectively). For this purpose, we scanned experienced meditators and control (naive) subjects using magnetic resonance imaging (MRI) to acquire structural and functional data during two conditions, resting-state and meditation (focused attention on breathing). In this way, we aimed to characterize and distinguish both short-term and long-term modifications in the brain's structure and function. First, we performed a network-based analysis of anatomical connectivity. Then, to analyze the fMRI data, we calculated whole-brain effective connectivity (EC) estimates, relying on a dynamical network model to replicate BOLD signals' spatio-temporal structure, akin to FC with lagged correlations. We compared the estimated EC, FC, and SC links as features to train classifiers to predict behavioral conditions and group identity. The whole-brain SC analysis revealed strengthened anatomical connectivity across large-scale networks for meditators compared to controls. We found that differences in SC were reflected in the functional domain as well. We demonstrated through a machine-learning approach that EC features were more informative than FC and SC solely. Using EC features we reached high performance for the condition-based classification within each group and moderately high accuracies when comparing the two groups in each condition. Moreover, we showed that the most informative EC links that discriminated between meditators and controls involved the same large-scale networks previously found to have increased anatomical connectivity. Overall, the results of our whole-brain model-based approach revealed a mechanism underlying meditation by providing causal relationships at the structure-function level.

Download Full-text

Clustering Algorithm for Community Detection in Complex Network: A Comprehensive Review

Recent Advances in Computer Science and Communications ◽

10.2174/2213275912666190710183635 ◽

2020 ◽

Vol 13 (4) ◽

pp. 542-549

Author(s):

Smita Agrawal ◽

Atul Patel

Keyword(s):

Complex Network ◽

Community Detection ◽

Large Scale ◽

Clustering Algorithm ◽

Detection Algorithm ◽

Connected Subgraph ◽

Comprehensive Review ◽

Detection Algorithms ◽

Community Detection Algorithm ◽

Large Scale Networks

Many real-world social networks exist in the form of a complex network, which includes very large scale networks with structured or unstructured data and a set of graphs. This complex network is available in the form of brain graph, protein structure, food web, transportation system, World Wide Web, and these networks are sparsely connected, and most of the subgraphs are densely connected. Due to the scaling of large scale graphs, efficient way for graph generation, complexity, the dynamic nature of graphs, and community detection are challenging tasks. From large scale graph to find the densely connected subgraph from the complex network, various community detection algorithms using clustering techniques are discussed here. In this paper, we discussed the taxonomy of various community detection algorithms like Structural Clustering Algorithm for Networks (SCAN), Structural-Attribute based Cluster (SA-cluster), Community Detection based on Hierarchical Clustering (CDHC), etc. In this comprehensive review, we provide a classification of community detection algorithm based on their approach, dataset used for the existing algorithm for experimental study and measure to evaluate them. In the end, insights into the future scope and research opportunities for community detection are discussed.

Download Full-text

Parallel PPI Prediction Performance Study on HPC Platforms

Journal of Circuits System and Computers ◽

10.1142/s0218126615500747 ◽

2015 ◽

Vol 24 (05) ◽

pp. 1550074 ◽

Cited By ~ 1

Author(s):

Ali A. El-Moursy ◽

Wael S. Afifi ◽

Fadi N. Sibai ◽

Salwa M. Nassar

Keyword(s):

Protein Interactions ◽

Execution Time ◽

High Performance ◽

Large Scale ◽

Parallel Implementation ◽

Prediction Method ◽

Protein Protein Interactions ◽

Performance Study ◽

Ppi Prediction ◽

Performance Computing

STRIKE is an algorithm which predicts protein–protein interactions (PPIs) and determines that proteins interact if they contain similar substrings of amino acids. Unlike other methods for PPI prediction, STRIKE is able to achieve reasonable improvement over the existing PPI prediction methods. Although its high accuracy as a PPI prediction method, STRIKE consumes a large execution time and hence it is considered to be a compute-intensive application. In this paper, we develop and implement a parallel STRIKE algorithm for high-performance computing (HPC) systems. Using a large-scale cluster, the execution time of the parallel implementation of this bioinformatics algorithm was reduced from about a week on a serial uniprocessor machine to about 16.5 h on 16 computing nodes, down to about 2 h on 128 parallel nodes. Communication overheads between nodes are thoroughly studied.

Download Full-text

Parallel Implementation of Improved K-Means Based on a Cloud Platform

Information Technology And Control ◽

10.5755/j01.itc.48.4.23881 ◽

2019 ◽

Vol 48 (4) ◽

pp. 673-681

Author(s):

Shufen Zhang ◽

Zhiyu Liu ◽

Xuebin Chen ◽

Changyin Luo

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Sample Density ◽

Scale Data ◽

Selection Of

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Download Full-text

GPLEXUS: enabling genome-scale gene association network reconstruction and analysis for very large-scale expression data

Nucleic Acids Research ◽

10.1093/nar/gkt983 ◽

2013 ◽

Vol 42 (5) ◽

pp. e32-e32 ◽

Cited By ~ 6

Author(s):

Jun Li ◽

Hairong Wei ◽

Tingsong Liu ◽

Patrick Xuechun Zhao

Keyword(s):

High Performance ◽

Large Scale ◽

Clustering Algorithm ◽

Biotic And Abiotic Stress ◽

Experimental Conditions ◽

Gene Association ◽

Genome Wide ◽

Genome Level ◽

Novel Algorithms ◽

Computational Resources

Abstract The accurate construction and interpretation of gene association networks (GANs) is challenging, but crucial, to the understanding of gene function, interaction and cellular behavior at the genome level. Most current state-of-the-art computational methods for genome-wide GAN reconstruction require high-performance computational resources. However, even high-performance computing cannot fully address the complexity involved with constructing GANs from very large-scale expression profile datasets, especially for the organisms with medium to large size of genomes, such as those of most plant species. Here, we present a new approach, GPLEXUS (http://plantgrn.noble.org/GPLEXUS/), which integrates a series of novel algorithms in a parallel-computing environment to construct and analyze genome-wide GANs. GPLEXUS adopts an ultra-fast estimation for pairwise mutual information computing that is similar in accuracy and sensitivity to the Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) method and runs ∼1000 times faster. GPLEXUS integrates Markov Clustering Algorithm to effectively identify functional subnetworks. Furthermore, GPLEXUS includes a novel ‘condition-removing’ method to identify the major experimental conditions in which each subnetwork operates from very large-scale gene expression datasets across several experimental conditions, which allows users to annotate the various subnetworks with experiment-specific conditions. We demonstrate GPLEXUS’s capabilities by construing global GANs and analyzing subnetworks related to defense against biotic and abiotic stress, cell cycle growth and division in Arabidopsis thaliana.

Download Full-text