RoleSim*: Scaling axiomatic role-based similarity ranking on large graphs

AbstractRoleSim and SimRank are among the popular graph-theoretic similarity measures with many applications in, e.g., web search, collaborative filtering, and sociometry. While RoleSim addresses the automorphic (role) equivalence of pairwise similarity which SimRank lacks, it ignores the neighboring similarity information out of the automorphically equivalent set. Consequently, two pairs of nodes, which are not automorphically equivalent by nature, cannot be well distinguished by RoleSim if the averages of their neighboring similarities over the automorphically equivalent set are the same. To alleviate this problem: 1) We propose a novel similarity model, namely RoleSim*, which accurately evaluates pairwise role similarities in a more comprehensive manner. RoleSim* not only guarantees the automorphic equivalence that SimRank lacks, but also takes into account the neighboring similarity information outside the automorphically equivalent sets that are overlooked by RoleSim. 2) We prove the existence and uniqueness of the RoleSim* solution, and show its three axiomatic properties (i.e., symmetry, boundedness, and non-increasing monotonicity). 3) We provide a concise bound for iteratively computing RoleSim* formula, and estimate the number of iterations required to attain a desired accuracy. 4) We induce a distance metric based on RoleSim* similarity, and show that the RoleSim* metric fulfills the triangular inequality, which implies the sum-transitivity of its similarity scores. 5) We present a threshold-based RoleSim* model that reduces the computational time further with provable accuracy guarantee. 6) We propose a single-source RoleSim* model, which scales well for sizable graphs. 7) We also devise methods to scale RoleSim* based search by incorporating its triangular inequality property with partitioning techniques. Our experimental results on real datasets demonstrate that RoleSim* achieves higher accuracy than its competitors while scaling well on sizable graphs with billions of edges.

Download Full-text

Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

ACM Transactions on Information Systems ◽

10.1145/3495209 ◽

2022 ◽

Vol 40 (4) ◽

pp. 1-45

Author(s):

Weiren Yu ◽

Julie McCann ◽

Chengyuan Zhang ◽

Hakan Ferhatosmanoglu

Keyword(s):

Web Search ◽

Similarity Score ◽

High Quality ◽

Deterministic Method ◽

Large Graphs ◽

Guaranteed Accuracy ◽

Semantic Difference ◽

Speed Up ◽

Novel Method ◽

Search Quality

SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [ 24 ] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix D . Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [ 1 ], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes. In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied- D ” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [ 24 ] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR # , with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [ 24 ] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al. SimRank ~{S} to Jeh and Widom’s SimRank S . (6) We propose GSR # , a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.

Download Full-text

Similarity Learning via Kernel Preserving Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014057 ◽

2019 ◽

Vol 33 ◽

pp. 4057-4064 ◽

Cited By ~ 4

Author(s):

Zhao Kang ◽

Yiwei Lu ◽

Yuanzhang Su ◽

Changsheng Li ◽

Zenglin Xu

Keyword(s):

Fundamental Problem ◽

Dimensional Space ◽

Similarity Measures ◽

Original Data ◽

Subspace Learning ◽

Reconstruction Error ◽

Semisupervised Learning ◽

Low Rank ◽

Similarity Learning ◽

Similarity Information

Data similarity is a key concept in many data-driven applications. Many algorithms are sensitive to similarity measures. To tackle this fundamental problem, automatically learning of similarity information from data via self-expression has been developed and successfully applied in various models, such as low-rank representation, sparse subspace learning, semisupervised learning. However, it just tries to reconstruct the original data and some valuable information, e.g., the manifold structure, is largely ignored. In this paper, we argue that it is beneficial to preserve the overall relations when we extract similarity information. Specifically, we propose a novel similarity learning framework by minimizing the reconstruction error of kernel matrices, rather than the reconstruction error of original data adopted by existing work. Taking the clustering task as an example to evaluate our method, we observe considerable improvements compared to other state-ofthe-art methods. More importantly, our proposed framework is very general and provides a novel and fundamental building block for many other similarity-based tasks. Besides, our proposed kernel preserving opens up a large number of possibilities to embed high-dimensional data into low-dimensional space.

Download Full-text

Large-scale DCMs for resting-state fMRI

Network Neuroscience ◽

10.1162/netn_a_00015 ◽

2017 ◽

Vol 1 (3) ◽

pp. 222-241 ◽

Cited By ~ 63

Author(s):

Adeel Razi ◽

Mohamed L. Seghier ◽

Yuan Zhou ◽

Peter McColgan ◽

Peter Zeidman ◽

...

Keyword(s):

Functional Connectivity ◽

Resting State ◽

Large Scale ◽

Effective Connectivity ◽

Directed Graphs ◽

Resting State Fmri ◽

Causal Modeling ◽

Large Graphs ◽

Model Inversion ◽

Graph Theoretic

This paper considers the identification of large directed graphs for resting-state brain networks based on biophysical models of distributed neuronal activity, that is, effective connectivity. This identification can be contrasted with functional connectivity methods based on symmetric correlations that are ubiquitous in resting-state functional MRI (fMRI). We use spectral dynamic causal modeling (DCM) to invert large graphs comprising dozens of nodes or regions. The ensuing graphs are directed and weighted, hence providing a neurobiologically plausible characterization of connectivity in terms of excitatory and inhibitory coupling. Furthermore, we show that the use of Bayesian model reduction to discover the most likely sparse graph (or model) from a parent (e.g., fully connected) graph eschews the arbitrary thresholding often applied to large symmetric (functional connectivity) graphs. Using empirical fMRI data, we show that spectral DCM furnishes connectivity estimates on large graphs that correlate strongly with the estimates provided by stochastic DCM. Furthermore, we increase the efficiency of model inversion using functional connectivity modes to place prior constraints on effective connectivity. In other words, we use a small number of modes to finesse the potentially redundant parameterization of large DCMs. We show that spectral DCM—with functional connectivity priors—is ideally suited for directed graph theoretic analyses of resting-state fMRI. We envision that directed graphs will prove useful in understanding the psychopathology and pathophysiology of neurodegenerative and neurodevelopmental disorders. We will demonstrate the utility of large directed graphs in clinical populations in subsequent reports, using the procedures described in this paper.

Download Full-text

SNP Variable Selection by Generalized Graph Domination

10.1101/396085 ◽

2018 ◽

Author(s):

Shuzhen Sun ◽

Zhuqi Miao ◽

Blaise Ratcliffe ◽

Polly Campbell ◽

Bret Pasch ◽

...

Keyword(s):

Variable Selection ◽

High Throughput Sequencing ◽

Dominating Set ◽

Similarity Measures ◽

Correlation Coefficients ◽

Biological Research ◽

Pairwise Linkage Disequilibrium ◽

Graph Theoretic ◽

Large Numbers ◽

Highly Correlated

AbstractHigh-throughput sequencing technology has revolutionized both medical and biological research by generating exceedingly large numbers of genetic variants. The resulting datasets share a number of common characteristics that might lead to poor generalization capacity. Concerns include noise accumulated due to the large number of predictors, sparse information regarding the p ≫ n problem, and overfitting and model mis-identification resulting from spurious collinearity. Additionally, complex correlation patterns are present among variables. As a consequence, reliable variable selection techniques play a pivotal role in predictive analysis, generalization capability, and robustness in clustering, as well as interpretability of the derived models.K-dominating set, a parameterized graph-theoretic generalization model, was used to model SNP (single nucleotide polymorphism) data as a similarity network and searched for representative SNP variables. In particular, each SNP was represented as a vertex in the graph, (dis)similarity measures such as correlation coefficients or pairwise linkage disequilibrium were estimated to describe the relationship between each pair of SNPs; a pair of vertices are adjacent, i.e. joined by an edge, if the pairwise similarity measure exceeds a user-specified threshold. A minimum K-dominating set in the SNP graph was then made as the smallest subset such that every SNP that is excluded from the subset has at least k neighbors in the selected ones. The strength ofk-dominating set selection in identifying independent variables, and in culling representative variables that are highly correlated with others, was demonstrated by a simulated dataset. The advantages of k-dominating set variable selection were also illustrated in two applications: pedigree reconstruction using SNP profiles of 1,372 Douglas-fir trees, and species delineation for 226 grasshopper mouse samples. A C++ source code that implements SNP-SELECT and uses Gurobi™ optimization solver for the k-dominating set variable selection is available (https://github.com/transgenomicsosu/SNP-SELECT).

Download Full-text

Learning Stochastic Equivalence based on Discrete Ricci Curvature

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/201 ◽

2021 ◽

Author(s):

Xuan Guo ◽

Qiang Tian ◽

Wang Zhang ◽

Wenjun Wang ◽

Pengfei Jiao

Keyword(s):

Ricci Curvature ◽

Structure Identification ◽

Network Embedding ◽

Discrimination Ability ◽

Role Identification ◽

Stochastic Equivalence ◽

Regular Equivalence ◽

Role Based ◽

Low Dimensional ◽

Automorphic Equivalence

Role-based network embedding methods aim to preserve node-centric connectivity patterns, which are expressions of node roles, into low-dimensional vectors. However, almost all the existing methods are designed for capturing a relaxation of automorphic equivalence or regular equivalence. They may be good at structure identification but could show poorer performance on role identification. Because automorphic equivalence and regular equivalence strictly tie the role of a node to the identities of all its neighbors. To mitigate this problem, we construct a framework called Curvature-based Network Embedding with Stochastic Equivalence (CNESE) to embed stochastic equivalence. More specifically, we estimate the role distribution of nodes based on discrete Ricci curvature for its excellent ability to concisely representing local topology. We use a Variational Auto-Encoder to generate embeddings while a degree-guided regularizer and a contrastive learning regularizer are leveraged to improving both its robustness and discrimination ability. The effectiveness of our proposed CNESE is demonstrated by extensive experiments on real-world networks.

Download Full-text

USING PARALLEL DISTRIBUTED PROCESSING TO REDUCE THE COMPUTATIONAL TIME OF DIGITAL MEDIA SIMILARITY MEASURES

10.1007/978-3-030-88381-2_4 ◽

2021 ◽

pp. 65-87

Author(s):

Myeong Lim ◽

James Jones

Keyword(s):

Digital Media ◽

Distributed Processing ◽

Similarity Measures ◽

Computational Time ◽

Parallel Distributed Processing

Download Full-text

Similarity-based approaches to virtual screening

Biochemical Society Transactions ◽

10.1042/bst0310603 ◽

2003 ◽

Vol 31 (3) ◽

pp. 603-606 ◽

Cited By ~ 73

Author(s):

P. Willett

Keyword(s):

Virtual Screening ◽

Similarity Measure ◽

Similarity Measures ◽

Similarity Coefficients ◽

Molecular Fingerprints ◽

Tanimoto Coefficient ◽

Graph Theoretic

Current similarity measures for virtual screening are based on the use of molecular fingerprints and the Tanimoto coefficient. This paper describes two ways in which one can increase the effectiveness of similarity-based virtual screening: using similarity coefficients other than the Tanimoto coefficient for the comparison of molecular fingerprints; and using a graph-theoretic similarity measure based on the largest substructure common to a pair of molecules.

Download Full-text