scholarly journals HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

2018 ◽  
Vol 46 (6) ◽  
pp. e33-e33 ◽  
Author(s):  
Ariful Azad ◽  
Georgios A Pavlopoulos ◽  
Christos A Ouzounis ◽  
Nikos C Kyrpides ◽  
Aydin Buluç
2014 ◽  
Vol 2014 ◽  
pp. 1-15 ◽  
Author(s):  
Vinícius da Fonseca Vieira ◽  
Carolina Ribeiro Xavier ◽  
Nelson Francisco Favilla Ebecken ◽  
Alexandre Gonçalves Evsukoff

Community structure detection is one of the major research areas of network science and it is particularly useful for large real networks applications. This work presents a deep study of the most discussed algorithms for community detection based on modularity measure: Newman’s spectral method using a fine-tuning stage and the method of Clauset, Newman, and Moore (CNM) with its variants. The computational complexity of the algorithms is analysed for the development of a high performance code to accelerate the execution of these algorithms without compromising the quality of the results, according to the modularity measure. The implemented code allows the generation of partitions with modularity values consistent with the literature and it overcomes 1 million nodes with Newman’s spectral method. The code was applied to a wide range of real networks and the performances of the algorithms are evaluated.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Biqiu Li ◽  
Jiabin Wang ◽  
Xueli Liu

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.


2021 ◽  
Author(s):  
Eleonora De Filippi ◽  
Anira Escrichs ◽  
Matthieu Gilson ◽  
Marti Sanchez-Fibla ◽  
Estela Camara ◽  
...  

In the past decades, there has been a growing scientific interest in characterizing neural correlates of meditation training. Nonetheless, the mechanisms underlying meditation remain elusive. In the present work, we investigated meditation-related changes in structural and functional connectivities (SC and FC, respectively). For this purpose, we scanned experienced meditators and control (naive) subjects using magnetic resonance imaging (MRI) to acquire structural and functional data during two conditions, resting-state and meditation (focused attention on breathing). In this way, we aimed to characterize and distinguish both short-term and long-term modifications in the brain's structure and function. First, we performed a network-based analysis of anatomical connectivity. Then, to analyze the fMRI data, we calculated whole-brain effective connectivity (EC) estimates, relying on a dynamical network model to replicate BOLD signals' spatio-temporal structure, akin to FC with lagged correlations. We compared the estimated EC, FC, and SC links as features to train classifiers to predict behavioral conditions and group identity. The whole-brain SC analysis revealed strengthened anatomical connectivity across large-scale networks for meditators compared to controls. We found that differences in SC were reflected in the functional domain as well. We demonstrated through a machine-learning approach that EC features were more informative than FC and SC solely. Using EC features we reached high performance for the condition-based classification within each group and moderately high accuracies when comparing the two groups in each condition. Moreover, we showed that the most informative EC links that discriminated between meditators and controls involved the same large-scale networks previously found to have increased anatomical connectivity. Overall, the results of our whole-brain model-based approach revealed a mechanism underlying meditation by providing causal relationships at the structure-function level.


2020 ◽  
Vol 13 (4) ◽  
pp. 542-549
Author(s):  
Smita Agrawal ◽  
Atul Patel

Many real-world social networks exist in the form of a complex network, which includes very large scale networks with structured or unstructured data and a set of graphs. This complex network is available in the form of brain graph, protein structure, food web, transportation system, World Wide Web, and these networks are sparsely connected, and most of the subgraphs are densely connected. Due to the scaling of large scale graphs, efficient way for graph generation, complexity, the dynamic nature of graphs, and community detection are challenging tasks. From large scale graph to find the densely connected subgraph from the complex network, various community detection algorithms using clustering techniques are discussed here. In this paper, we discussed the taxonomy of various community detection algorithms like Structural Clustering Algorithm for Networks (SCAN), Structural-Attribute based Cluster (SA-cluster), Community Detection based on Hierarchical Clustering (CDHC), etc. In this comprehensive review, we provide a classification of community detection algorithm based on their approach, dataset used for the existing algorithm for experimental study and measure to evaluate them. In the end, insights into the future scope and research opportunities for community detection are discussed.


2015 ◽  
Vol 24 (05) ◽  
pp. 1550074 ◽  
Author(s):  
Ali A. El-Moursy ◽  
Wael S. Afifi ◽  
Fadi N. Sibai ◽  
Salwa M. Nassar

STRIKE is an algorithm which predicts protein–protein interactions (PPIs) and determines that proteins interact if they contain similar substrings of amino acids. Unlike other methods for PPI prediction, STRIKE is able to achieve reasonable improvement over the existing PPI prediction methods. Although its high accuracy as a PPI prediction method, STRIKE consumes a large execution time and hence it is considered to be a compute-intensive application. In this paper, we develop and implement a parallel STRIKE algorithm for high-performance computing (HPC) systems. Using a large-scale cluster, the execution time of the parallel implementation of this bioinformatics algorithm was reduced from about a week on a serial uniprocessor machine to about 16.5 h on 16 computing nodes, down to about 2 h on 128 parallel nodes. Communication overheads between nodes are thoroughly studied.


2019 ◽  
Vol 48 (4) ◽  
pp. 673-681
Author(s):  
Shufen Zhang ◽  
Zhiyu Liu ◽  
Xuebin Chen ◽  
Changyin Luo

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.


2013 ◽  
Vol 42 (5) ◽  
pp. e32-e32 ◽  
Author(s):  
Jun Li ◽  
Hairong Wei ◽  
Tingsong Liu ◽  
Patrick Xuechun Zhao

Abstract The accurate construction and interpretation of gene association networks (GANs) is challenging, but crucial, to the understanding of gene function, interaction and cellular behavior at the genome level. Most current state-of-the-art computational methods for genome-wide GAN reconstruction require high-performance computational resources. However, even high-performance computing cannot fully address the complexity involved with constructing GANs from very large-scale expression profile datasets, especially for the organisms with medium to large size of genomes, such as those of most plant species. Here, we present a new approach, GPLEXUS (http://plantgrn.noble.org/GPLEXUS/), which integrates a series of novel algorithms in a parallel-computing environment to construct and analyze genome-wide GANs. GPLEXUS adopts an ultra-fast estimation for pairwise mutual information computing that is similar in accuracy and sensitivity to the Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) method and runs ∼1000 times faster. GPLEXUS integrates Markov Clustering Algorithm to effectively identify functional subnetworks. Furthermore, GPLEXUS includes a novel ‘condition-removing’ method to identify the major experimental conditions in which each subnetwork operates from very large-scale gene expression datasets across several experimental conditions, which allows users to annotate the various subnetworks with experiment-specific conditions. We demonstrate GPLEXUS’s capabilities by construing global GANs and analyzing subnetworks related to defense against biotic and abiotic stress, cell cycle growth and division in Arabidopsis thaliana.


Sign in / Sign up

Export Citation Format

Share Document