PecanPy: a fast, efficient, and parallelized Python implementation of node2vec

Bioinformatics ◽

10.1093/bioinformatics/btab202 ◽

2021 ◽

Author(s):

Renming Liu ◽

Arjun Krishnan

Keyword(s):

Data Structures ◽

Biological Networks ◽

Supplementary Information ◽

Network Density ◽

High Quality ◽

Large Graphs ◽

Graph Data ◽

Compact Graph ◽

Low Dimensional ◽

Node Embeddings

Abstract Summary Learning low-dimensional representations (embeddings) of nodes in large graphs is key to applying machine learning on massive biological networks. Node2vec is the most widely used method for node embedding. However, its original Python and C ++ implementations scale poorly with network density, failing for dense biological networks with hundreds of millions of edges. We have developed PecanPy, a new Python implementation of node2vec that uses cache-optimized compact graph data structures and precomputing/parallelization to result in fast, high-quality node embeddings for biological networks of all sizes and densities. Availability PecanPy software is freely available at https://github.com/krishnanlab/PecanPy Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PecanPy: a fast, efficient, and parallelized Python implementation of node2vec

10.1101/2020.07.23.218487 ◽

2020 ◽

Author(s):

Renming Liu ◽

Arjun Krishnan

Keyword(s):

Machine Learning ◽

Data Structures ◽

Biological Networks ◽

Network Density ◽

High Quality ◽

Large Graphs ◽

Graph Data ◽

Compact Graph ◽

Low Dimensional ◽

Node Embeddings

AbstractLearning low-dimensional representations (embeddings) of nodes in large graphs is key to applying machine learning on massive biological networks. Node2vec is the most widely used method for node embedding. However, its original Python and C++ implementations scale poorly with network density, failing for dense biological networks with hundreds of millions of edges. We have developed PecanPy, a new Python implementation of node2vec that uses cache-optimized compact graph data structures and precomputing/parallelization to result in fast, high-quality node embeddings for biological networks of all sizes and densities. PecanPy software and documentation are available at https://github.com/krishnanlab/pecanpy.

Download Full-text

Link Prediction in Complex Networks using Embedding Techniques and Similarity Measures

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.e2762.039520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 1690-1696

Keyword(s):

Biological Networks ◽

Link Prediction ◽

Preferential Attachment ◽

Similarity Measures ◽

Telecommunication Networks ◽

Different Dimensions ◽

Low Dimensional ◽

Embedding Methods ◽

Node Embeddings ◽

Interacting Components

Networks have proved to be very helpful in modelling complex systems with interacting components. There are various problems across various domains where the systems can be modelled in the form of a network with links between interacting components. The Problem of Link Prediction deals with predicting missing links in a given network. The application of link prediction ranges across various disciplines including biological networks, transportation networks, social networks, telecommunication networks, etc. In this paper, we use node embedding methods to encode the nodes into low dimensional embeddings and predict links based on the edge embeddings computed by taking the hadamard product of the participating nodes. We further compare the accuracy of the models trained on different dimensions of embeddings. We also study how the introduction of additional features changes the accuracy when introduced to various dimensions of node embeddings. The additional features include overlapping measures such as Jaccard similarity, Adamic-Adar score and dot product between node embeddings as well as heuristic features i.e. Common Neighbors, Resource Allocation, preferential attachment and friend tns score.

Download Full-text

Tiered Sampling

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441299 ◽

2021 ◽

Vol 15 (5) ◽

pp. 1-52

Author(s):

Lorenzo De Stefani ◽

Erisa Terolli ◽

Eli Upfal

Keyword(s):

Large Scale ◽

Analysis Of Algorithms ◽

Base Layer ◽

Single Edge ◽

Real World Data ◽

High Quality ◽

Large Graphs ◽

Massive Graphs ◽

Variance Estimate ◽

Low Probability

We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

Download Full-text

Synthesis of high quality inorganic fullerene-like BN hollow spheres via a simple chemical routeElectronic supplementary information (ESI) available: XPS spectrum of as-prepared h-BN. See http://www.rsc.org/suppdata/cc/b3/b308264d/

Chemical Communications ◽

10.1039/b308264d ◽

2003 ◽

pp. 2688 ◽

Cited By ~ 43

Author(s):

Xinjun Wang ◽

Yi Xie ◽

Qixun Guo

Keyword(s):

Hollow Spheres ◽

Supplementary Information ◽

High Quality ◽

Inorganic Fullerene ◽

Simple Chemical

Download Full-text

Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

ACM Transactions on Information Systems ◽

10.1145/3495209 ◽

2022 ◽

Vol 40 (4) ◽

pp. 1-45

Author(s):

Weiren Yu ◽

Julie McCann ◽

Chengyuan Zhang ◽

Hakan Ferhatosmanoglu

Keyword(s):

Web Search ◽

Similarity Score ◽

High Quality ◽

Deterministic Method ◽

Large Graphs ◽

Guaranteed Accuracy ◽

Semantic Difference ◽

Speed Up ◽

Novel Method ◽

Search Quality

SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [ 24 ] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix D . Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [ 1 ], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes. In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied- D ” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [ 24 ] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR # , with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [ 24 ] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al. SimRank ~{S} to Jeh and Widom’s SimRank S . (6) We propose GSR # , a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.

Download Full-text

Graph representation learning: a survey

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.13 ◽

2020 ◽

Vol 9 ◽

Author(s):

Fenxiao Chen ◽

Yun-Cheng Wang ◽

Bin Wang ◽

C.-C. Jay Kuo

Keyword(s):

Graph Embedding ◽

Large Data ◽

Representation Learning ◽

Graph Representation ◽

Data Sets ◽

Graph Data ◽

Graph Properties ◽

Wide Range ◽

Regular Lattices ◽

Low Dimensional

Abstract Research on graph representation learning has received great attention in recent years since most data in real-world applications come in the form of graphs. High-dimensional graph data are often in irregular forms. They are more difficult to analyze than image/video/audio data defined on regular lattices. Various graph embedding techniques have been developed to convert the raw graph data into a low-dimensional vector representation while preserving the intrinsic graph properties. In this review, we first explain the graph embedding task and its challenges. Next, we review a wide range of graph embedding techniques with insights. Then, we evaluate several stat-of-the-art methods against small and large data sets and compare their performance. Finally, potential applications and future directions are presented.

Download Full-text

OpenBioLink: a benchmarking framework for large-scale biomedical link prediction

Bioinformatics ◽

10.1093/bioinformatics/btaa274 ◽

2020 ◽

Vol 36 (13) ◽

pp. 4097-4098 ◽

Cited By ~ 3

Author(s):

Anna Breit ◽

Simon Ott ◽

Asan Agibetov ◽

Matthias Samwald

Keyword(s):

Link Prediction ◽

Large Scale ◽

Source Code ◽

Machine Learning Algorithms ◽

Knowledge Networks ◽

Supplementary Information ◽

Supplementary Data ◽

Biomedical Knowledge ◽

High Quality ◽

Baseline Evaluation

Abstract Summary Recently, novel machine-learning algorithms have shown potential for predicting undiscovered links in biomedical knowledge networks. However, dedicated benchmarks for measuring algorithmic progress have not yet emerged. With OpenBioLink, we introduce a large-scale, high-quality and highly challenging biomedical link prediction benchmark to transparently and reproducibly evaluate such algorithms. Furthermore, we present preliminary baseline evaluation results. Availability and implementation Source code and data are openly available at https://github.com/OpenBioLink/OpenBioLink. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Discovering a sparse set of pairwise discriminating features in high-dimensional data

Bioinformatics ◽

10.1093/bioinformatics/btaa690 ◽

2020 ◽

Author(s):

Samuel Melton ◽

Sharad Ramanathan

Keyword(s):

Single Cell ◽

Dimensional Space ◽

Cell Types ◽

Dimensional Subspace ◽

Supplementary Information ◽

High Dimensional ◽

Technological Advances ◽

Data Points ◽

Low Dimensional ◽

Sparse Set

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Classification in biological networks with hypergraphlet kernels

Bioinformatics ◽

10.1093/bioinformatics/btaa768 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jose Lugo-Martinez ◽

Daniel Zeiberg ◽

Thomas Gaudelet ◽

Noël Malod-Dognin ◽

Natasa Przulj ◽

...

Keyword(s):

Biological Networks ◽

Kernel Method ◽

Information Loss ◽

Cellular Systems ◽

Supplementary Information ◽

Edge Classification ◽

Vertex Classification ◽

Prediction Problems ◽

Potential Use ◽

Modeling Physical Systems

Abstract Motivation Biological and cellular systems are often modeled as graphs in which vertices represent objects of interest (genes, proteins and drugs) and edges represent relational ties between these objects (binds-to, interacts-with and regulates). This approach has been highly successful owing to the theory, methodology and software that support analysis and learning on graphs. Graphs, however, suffer from information loss when modeling physical systems due to their inability to accurately represent multiobject relationships. Hypergraphs, a generalization of graphs, provide a framework to mitigate information loss and unify disparate graph-based methodologies. Results We present a hypergraph-based approach for modeling biological systems and formulate vertex classification, edge classification and link prediction problems on (hyper)graphs as instances of vertex classification on (extended, dual) hypergraphs. We then introduce a novel kernel method on vertex- and edge-labeled (colored) hypergraphs for analysis and learning. The method is based on exact and inexact (via hypergraph edit distances) enumeration of hypergraphlets; i.e. small hypergraphs rooted at a vertex of interest. We empirically evaluate this method on fifteen biological networks and show its potential use in a positive-unlabeled setting to estimate the interactome sizes in various species. Availability and implementation https://github.com/jlugomar/hypergraphlet-kernels Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Top-Down Garbage Collector: a tool for selecting high-quality top-down proteomics mass spectra

Bioinformatics ◽

10.1093/bioinformatics/btz085 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3489-3490 ◽

Cited By ~ 1

Author(s):

Diogo B Lima ◽

André R F Silva ◽

Mathieu Dupré ◽

Marlon D M Santos ◽

Milan A Clasen ◽

...

Keyword(s):

Quality Control ◽

Mass Spectra ◽

Rate Increase ◽

Supplementary Information ◽

Supplementary Data ◽

Top Down ◽

High Quality ◽

Garbage Collector ◽

E Coli ◽

Spectral Libraries

Abstract Motivation We present the first tool for unbiased quality control of top-down proteomics datasets. Our tool can select high-quality top-down proteomics spectra, serve as a gateway for building top-down spectral libraries and, ultimately, improve identification rates. Results We demonstrate that a twofold rate increase for two E. coli top-down proteomics datasets may be achievable. Availability and implementation http://patternlabforproteomics.org/tdgc, freely available for academic use. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text