A heterogeneous parallel implementation of the Markov clustering algorithm for large-scale biological networks on distributed CPU–GPU clusters

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks

Nucleic Acids Research ◽

10.1093/nar/gkx1313 ◽

2018 ◽

Vol 46 (6) ◽

pp. e33-e33 ◽

Cited By ~ 28

Author(s):

Ariful Azad ◽

Georgios A Pavlopoulos ◽

Christos A Ouzounis ◽

Nikos C Kyrpides ◽

Aydin Buluç

Keyword(s):

High Performance ◽

Large Scale ◽

Clustering Algorithm ◽

Parallel Implementation ◽

Markov Clustering ◽

Large Scale Networks

Download Full-text

Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Scientific Programming ◽

10.1155/2021/5916748 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Biqiu Li ◽

Jiabin Wang ◽

Xueli Liu

Keyword(s):

Data Mining ◽

Large Scale ◽

Clustering Algorithm ◽

Parallel Implementation ◽

Data Cleaning ◽

Position Vector ◽

Data Sets ◽

Implementation Scheme ◽

Mining Work ◽

Context Features

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.

Download Full-text

Parallel Implementation of Improved K-Means Based on a Cloud Platform

Information Technology And Control ◽

10.5755/j01.itc.48.4.23881 ◽

2019 ◽

Vol 48 (4) ◽

pp. 673-681

Author(s):

Shufen Zhang ◽

Zhiyu Liu ◽

Xuebin Chen ◽

Changyin Luo

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Sample Density ◽

Scale Data ◽

Selection Of

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Download Full-text

The Implementation of Regularized Markov Clustering with Pigeon Inspired Optimization Algorithm in Analyzing the SARS-CoV-2 (COVID-19) Protein Interaction Network

Desimal Jurnal Matematika ◽

10.24042/djm.v3i3.6822 ◽

2020 ◽

Vol 3 (3) ◽

pp. 191-200

Author(s):

M. Syamsuddin Wisnubroto ◽

Marsudi Siburian ◽

Febri Dwi Irawati

Keyword(s):

Protein Interaction ◽

Optimization Algorithm ◽

Large Scale ◽

Clustering Algorithm ◽

Drug Research ◽

Interaction Network ◽

Protein Interaction Networks ◽

Interaction Networks ◽

Clustering Methods ◽

Markov Clustering

Proteins interact with other proteins, DNA, and other molecules, forming large-scale protein interaction networks and for easy analysis, clustering methods are needed. Regularized Markov clustering algorithm is an improvement of MCL where operations on expansion are replaced by new operations that update the flow distributions of each node. But to reduce the weaknesses of the RMCL optimization, Pigeon Inspired Optimization Algorithm (PIO) is used to replace the inflation parameters. The simulation results of IPC SARS-Cov-2 (COVID-19) inflation parameters get the result of 42 proteins as the center of the cluster and 8 protein pairs interacting with each other. Proteins of COVID-19 that interact with 20 or more proteins are ORF8, NSP13, NSP7, M, N, ORF9C, NSP8, and NSP1. Their interactions might be used as a target for drug research.

Download Full-text

Multiscale Hemodynamics Using GPU Clusters

Communications in Computational Physics ◽

10.4208/cicp.210910.250311a ◽

2012 ◽

Vol 11 (1) ◽

pp. 48-64 ◽

Cited By ~ 12

Author(s):

Mauro Bisson ◽

Massimo Bernaschi ◽

Simone Melchionna ◽

Sauro Succi ◽

Efthimios Kaxiras

Keyword(s):

High Performance ◽

Large Scale ◽

Parallel Implementation ◽

Particle Dynamics ◽

Performance Tests ◽

Gpu Clusters ◽

Speed Up ◽

Parallel Code ◽

Realistic Geometries ◽

Hemodynamic Simulations

AbstractThe parallel implementation of MUPHY, a concurrent multiscale code for large-scale hemodynamic simulations in anatomically realistic geometries, for multi-GPU platforms is presented. Performance tests show excellent results, with a nearly linear parallel speed-up on up to 32GPUs and a more than tenfold GPU/CPU acceleration, all across the range of GPUs. The basic MUPHY scheme combines a hydrokinetic (Lattice Boltzmann) representation of the blood plasma, with a Particle Dynamics treatment of suspended biological bodies, such as red blood cells. To the best of our knowledge, this represents the first effort in the direction of laying down general design principles for multiscale/physics parallel Particle Dynamics applications in non-ideal geometries. This configures the present multi-GPU version of MUPHY as one of the first examples of a high-performance parallel code for multiscale/physics biofluidic applications in realistically complex geometries.

Download Full-text

A Fast Clustering Algorithm for Large-scale and High Dimensional Data

ACTA AUTOMATICA SINICA ◽

10.3724/sp.j.1004.2009.00859 ◽

2009 ◽

Vol 35 (7) ◽

pp. 859-866

Author(s):

Ming LIU ◽

Xiao-Long WANG ◽

Yuan-Chao LIU

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

High Dimensional Data ◽

High Dimensional

Download Full-text

A Novel Unsupervised Classification Method for Sandy Land Using Fully Polarimetric SAR Data

Remote Sensing ◽

10.3390/rs13030355 ◽

2021 ◽

Vol 13 (3) ◽

pp. 355

Author(s):

Weixian Tan ◽

Borong Sun ◽

Chenyu Xiao ◽

Pingping Huang ◽

Wei Xu ◽

...

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Feature Vector ◽

Unsupervised Classification ◽

Classification Method ◽

Sandy Land ◽

Classification Methods ◽

The Many ◽

Representative Points

Classification based on polarimetric synthetic aperture radar (PolSAR) images is an emerging technology, and recent years have seen the introduction of various classification methods that have been proven to be effective to identify typical features of many terrain types. Among the many regions of the study, the Hunshandake Sandy Land in Inner Mongolia, China stands out for its vast area of sandy land, variety of ground objects, and intricate structure, with more irregular characteristics than conventional land cover. Accounting for the particular surface features of the Hunshandake Sandy Land, an unsupervised classification method based on new decomposition and large-scale spectral clustering with superpixels (ND-LSC) is proposed in this study. Firstly, the polarization scattering parameters are extracted through a new decomposition, rather than other decomposition approaches, which gives rise to more accurate feature vector estimate. Secondly, a large-scale spectral clustering is applied as appropriate to meet the massive land and complex terrain. More specifically, this involves a beginning sub-step of superpixels generation via the Adaptive Simple Linear Iterative Clustering (ASLIC) algorithm when the feature vector combined with the spatial coordinate information are employed as input, and subsequently a sub-step of representative points selection as well as bipartite graph formation, followed by the spectral clustering algorithm to complete the classification task. Finally, testing and analysis are conducted on the RADARSAT-2 fully PolSAR dataset acquired over the Hunshandake Sandy Land in 2016. Both qualitative and quantitative experiments compared with several classification methods are conducted to show that proposed method can significantly improve performance on classification.

Download Full-text

A Parallel Unmixing-Based Content Retrieval System for Distributed Hyperspectral Imagery Repository on Cloud Computing Platforms

Remote Sensing ◽

10.3390/rs13020176 ◽

2021 ◽

Vol 13 (2) ◽

pp. 176

Author(s):

Peng Zheng ◽

Zebin Wu ◽

Jin Sun ◽

Yi Zhang ◽

Yaoqin Zhu ◽

...

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Retrieval System ◽

Hyperspectral Image ◽

Parallel Implementation ◽

Remotely Sensed Data ◽

Web Interfaces ◽

Content Retrieval ◽

Service Mode ◽

Computing Platforms

As the volume of remotely sensed data grows significantly, content-based image retrieval (CBIR) becomes increasingly important, especially for cloud computing platforms that facilitate processing and storing big data in a parallel and distributed way. This paper proposes a novel parallel CBIR system for hyperspectral image (HSI) repository on cloud computing platforms under the guide of unmixed spectral information, i.e., endmembers and their associated fractional abundances, to retrieve hyperspectral scenes. However, existing unmixing methods would suffer extremely high computational burden when extracting meta-data from large-scale HSI data. To address this limitation, we implement a distributed and parallel unmixing method that operates on cloud computing platforms in parallel for accelerating the unmixing processing flow. In addition, we implement a global standard distributed HSI repository equipped with a large spectral library in a software-as-a-service mode, providing users with HSI storage, management, and retrieval services through web interfaces. Furthermore, the parallel implementation of unmixing processing is incorporated into the CBIR system to establish the parallel unmixing-based content retrieval system. The performance of our proposed parallel CBIR system was verified in terms of both unmixing efficiency and accuracy.

Download Full-text