Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

Weiren Yu; Julie McCann; Chengyuan Zhang; Hakan Ferhatosmanoglu

doi:10.1145/3495209

Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge Graphs

ACM Transactions on Information Systems ◽

10.1145/3495209 ◽

2022 ◽

Vol 40 (4) ◽

pp. 1-45

Author(s):

Weiren Yu ◽

Julie McCann ◽

Chengyuan Zhang ◽

Hakan Ferhatosmanoglu

Keyword(s):

Web Search ◽

Similarity Score ◽

High Quality ◽

Deterministic Method ◽

Large Graphs ◽

Guaranteed Accuracy ◽

Semantic Difference ◽

Speed Up ◽

Novel Method ◽

Search Quality

SimRank is an attractive link-based similarity measure used in fertile fields of Web search and sociometry. However, the existing deterministic method by Kusumoto et al. [ 24 ] for retrieving SimRank does not always produce high-quality similarity results, as it fails to accurately obtain diagonal correction matrix D . Moreover, SimRank has a “connectivity trait” problem: increasing the number of paths between a pair of nodes would decrease its similarity score. The best-known remedy, SimRank++ [ 1 ], cannot completely fix this problem, since its score would still be zero if there are no common in-neighbors between two nodes. In this article, we study fast high-quality link-based similarity search on billion-scale graphs. (1) We first devise a “varied- D ” method to accurately compute SimRank in linear memory. We also aggregate duplicate computations, which reduces the time of [ 24 ] from quadratic to linear in the number of iterations. (2) We propose a novel “cosine-based” SimRank model to circumvent the “connectivity trait” problem. (3) To substantially speed up the partial-pairs “cosine-based” SimRank search on large graphs, we devise an efficient dimensionality reduction algorithm, PSR # , with guaranteed accuracy. (4) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument in [ 24 ] that “if D is replaced by a scaled identity matrix (1-Ɣ)I, their top-K rankings will not be affected much”. (5) We propose a novel method that can accurately convert from Li et al. SimRank ~{S} to Jeh and Widom’s SimRank S . (6) We propose GSR # , a generalisation of our “cosine-based” SimRank model, to quantify pairwise similarities across two distinct graphs, unlike SimRank that would assess nodes across two graphs as completely dissimilar. Extensive experiments on various datasets demonstrate the superiority of our proposed approaches in terms of high search quality, computational efficiency, accuracy, and scalability on billion-edge graphs.

Download Full-text

Tiered Sampling

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441299 ◽

2021 ◽

Vol 15 (5) ◽

pp. 1-52

Author(s):

Lorenzo De Stefani ◽

Erisa Terolli ◽

Eli Upfal

Keyword(s):

Large Scale ◽

Analysis Of Algorithms ◽

Base Layer ◽

Single Edge ◽

Real World Data ◽

High Quality ◽

Large Graphs ◽

Massive Graphs ◽

Variance Estimate ◽

Low Probability

We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

Download Full-text

Strategic Team AI Path Plans: Probabilistic Pathfinding

International Journal of Computer Games Technology ◽

10.1155/2008/834616 ◽

2008 ◽

Vol 2008 ◽

pp. 1-6 ◽

Cited By ~ 3

Author(s):

Tng C. H. John ◽

Edmond C. Prakash ◽

Narendra S. Chaudhari

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Computer Games ◽

Fitness Function ◽

New Method ◽

High Quality ◽

Plan Generation ◽

Novel Method ◽

Games And Simulations

This paper proposes a novel method to generate strategic team AI pathfinding plans for computer games and simulations using probabilistic pathfinding. This method is inspired by genetic algorithms (Russell and Norvig, 2002), in that, a fitness function is used to test the quality of the path plans. The method generates high-quality path plans by eliminating the low-quality ones. The path plans are generated by probabilistic pathfinding, and the elimination is done by a fitness test of the path plans. This path plan generation method has the ability to generate variation or different high-quality paths, which is desired for games to increase replay values. This work is an extension of our earlier work on team AI: probabilistic pathfinding (John et al., 2006). We explore ways to combine probabilistic pathfinding and genetic algorithm to create a new method to generate strategic team AI pathfinding plans.

Download Full-text

A Novel Method for Clustering Web Search Results with Wikipedia Disambiguation Pages

Database Systems for Advanced Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-22324-7_1 ◽

2015 ◽

pp. 3-16 ◽

Cited By ~ 1

Author(s):

Zhi Huang ◽

Zhendong Niu ◽

Donglei Liu ◽

Wenjuan Niu ◽

Wei Wang

Keyword(s):

Web Search ◽

Search Results ◽

Novel Method

Download Full-text

A Deep Convolutional Neural Network Model for Intelligent Discrimination between Coal and Rocks in Coal Mining Face

Mathematical Problems in Engineering ◽

10.1155/2020/2616510 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Lei Si ◽

Xiangxiang Xiong ◽

Zhongbin Wang ◽

Chao Tan

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Data Augmentation ◽

Deep Convolutional Neural Network ◽

Accurate Identification ◽

Experiment System ◽

Speed Up ◽

Novel Method ◽

Coal Rock ◽

Mining Face

Accurate identification of the distribution of coal seam is a prerequisite for realizing intelligent mining of shearer. This paper presents a novel method for identifying coal and rock based on a deep convolutional neural network (CNN). Three regularization methods are introduced in this paper to solve the overfitting problem of CNN and speed up the convergence: dropout, weight regularization, and batch normalization. Then the coal-rock image information is enriched by means of data augmentation, which significantly improves the performance. The shearer cutting coal-rock experiment system is designed to collect more real coal-rock images, and some experiments are provided. The experiment results indicate that the network we designed has better performance in identifying the coal-rock images.

Download Full-text

Pushing the limits of crystallography

Journal of Applied Crystallography ◽

10.1107/s160057671601637x ◽

2016 ◽

Vol 49 (6) ◽

pp. 2106-2115 ◽

Cited By ~ 8

Author(s):

Janusz Wolny ◽

Ireneusz Buganski ◽

Pawel Kuczera ◽

Radoslaw Strzalka

Keyword(s):

Statistical Approach ◽

Structure Refinement ◽

Waller Factor ◽

Theoretical Research ◽

High Quality ◽

Consistent Theory ◽

Debye Waller Factor ◽

Weak Reflection ◽

Novel Method ◽

Quasiperiodic Systems

A very serious concern of scientists dealing with crystal structure refinement, including theoretical research, pertains to the characteristic bias in calculatedversusmeasured diffraction intensities, observed particularly in the weak reflection regime. This bias is here attributed to corrective factors for phonons and, even more distinctly, phasons, and credible proof supporting this assumption is given. The lack of a consistent theory of phasons in quasicrystals significantly contributes to this characteristic bias. It is shown that the most commonly used exponential Debye–Waller factor for phasons fails in the case of quasicrystals, and a novel method of calculating the correction factor within a statistical approach is proposed. The results obtained for model quasiperiodic systems show that phasonic perturbations can be successfully described and refinement fits of high quality are achievable. The standard Debye–Waller factor for phonons works equally well for periodic and quasiperiodic crystals, and it is only in the last steps of a refinement that different correction functions need to be applied to improve the fit quality.

Download Full-text

Product Design Retrieval by Matching Bills of Materials

Journal of Mechanical Design ◽

10.1115/1.4025489 ◽

2013 ◽

Vol 136 (1) ◽

Cited By ~ 8

Author(s):

M. Kashkoush ◽

H. ElMaraghy

Keyword(s):

Product Design ◽

Phylogenetic Trees ◽

Retrieval Method ◽

Tree Reconciliation ◽

Bill Of Materials ◽

Use Of Data ◽

Speed Up ◽

Novel Method ◽

Design Retrieval

A new automatic design retrieval method that identifies the legacy product design most similar to a new one is proposed. Matching phylogenetic trees has been utilized in biological science for decades and is referred to as “tree reconciliation.” A new application of this approach in manufacturing is presented where legacy designs are retrieved based on reconciliation of trees representing products bill of materials (BOM). A product BOM is a structured tree, which represents its components and their hierarchal relationships; hence, it captures the contents and structure of assembled products. Making use of data associated with the retrieved designs also helps speed-up other downstream planning activities such as process planning, hence improving planning efficiency. A chemical processing centrifugal pump is used as a case study for illustration. The results obtained using the proposed method is compared with those recently published on BOM trees matching for further analysis and verification. This novel method is less computationally complex than available state-of-the-art algorithms.

Download Full-text

A Novel Method for the Measurement of Slight Wear

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.750-752.1987 ◽

2013 ◽

Vol 750-752 ◽

pp. 1987-1991

Author(s):

Jian Hua Yang ◽

Xing Jian Ma ◽

Lu Fei Rong ◽

Bo Liu ◽

Jia Yan Yu ◽

...

Keyword(s):

Wear Resistance ◽

Measurement Accuracy ◽

Processing Technology ◽

Wear Testing ◽

Cross Sectional ◽

Speed Up ◽

Novel Method ◽

Pretreatment Conditions ◽

Weighing Method

The characterization of surface wear resistance of materials usually relies on the measurement of slight wear. There are two obvious shortcomings for weighing method. In order to improve the measurement accuracy, a more intuitive and reliable method for quantitative measurement of slight wear, interference microscope method, has be given. Higher accuracy (the order of micrometer) can be achieved using the interferometry for the measurement of slight wear. The results show that the masking processing technology can ensure that all samples for wear testing and other analysis are obtained under the same pretreatment conditions and vacuum processing conditions, speed up the commercialization of processing technology, and comparing the cross-sectional areas of wear scars is a correct way to characterize the wear resistance of different zones.

Download Full-text

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

10.1101/345983 ◽

2018 ◽

Cited By ~ 2

Author(s):

Huilong Du ◽

Chengzhi Liang

Keyword(s):

Single Molecule ◽

High Efficiency ◽

Reference Genome ◽

Repetitive Sequences ◽

Sequencing Data ◽

High Quality ◽

Single Molecule Sequencing ◽

Genome Maps ◽

Long Reads ◽

Novel Method

AbstractDue to the large number of repetitive sequences in complex eukaryotic genomes, fragmented and incompletely assembled genomes lose value as reference sequences, often due to short contigs that cannot be anchored or mispositioned onto chromosomes. Here we report a novel method Highly Efficient Repeat Assembly (HERA), which includes a new concept called a connection graph as well as algorithms for constructing the graph. HERA resolves repeats at high efficiency with single-molecule sequencing data, and enables the assembly of chromosome-scale contigs by further integrating genome maps and Hi-C data. We tested HERA with the genomes of rice R498, maize B73, human HX1 and Tartary buckwheat Pinku1. HERA can correctly assemble most of the tandemly repetitive sequences in rice using single-molecule sequencing data only. Using the same maize and human sequencing data published by Jiao et al. (2017) and Shi et al. (2016), respectively, we dramatically improved on the sequence contiguity compared with the published assemblies, increasing the contig N50 from 1.3 Mb to 61.2 Mb in maize B73 assembly and from 8.3 Mb to 54.4 Mb in human HX1 assembly with HERA. We provided a high-quality maize reference genome with 96.9% of the gaps filled (only 76 gaps left) and several incorrectly positioned sequences fixed compared with the B73 RefGen_v4 assembly. Comparisons between the HERA assembly of HX1 and the human GRCh38 reference genome showed that many gaps in GRCh38 could be filled, and that GRCh38 contained some potential errors that could be fixed. We assembled the Pinku1 genome into 12 scaffolds with a contig N50 size of 27.85 Mb. HERA serves as a new genome assembly/phasing method to generate high quality sequences for complex genomes and as a curation tool to improve the contiguity and completeness of existing reference genomes, including the correction of assembly errors in repetitive regions.

Download Full-text

An Efficient Monte Carlo Approach to Compute PageRank for Large Graphs on a Single PC

Foundations of Computing and Decision Sciences ◽

10.1515/fcds-2016-0002 ◽

2016 ◽

Vol 41 (1) ◽

pp. 29-43

Author(s):

Tomohiro Sonobe

Keyword(s):

Monte Carlo ◽

Random Walk ◽

Memory Management ◽

Computational Experiments ◽

Large Graphs ◽

Graph Data ◽

Monte Carlo Approach ◽

Large Graph ◽

Novel Method

AbstractThis paper describes a novel Monte Carlo based random walk to compute PageRanks of nodes in a large graph on a single PC. The target graphs of this paper are ones whose size is larger than the physical memory. In such an environment, memory management is a difficult task for simulating the random walk among the nodes. We propose a novel method that partitions the graph into subgraphs in order to make them fit into the physical memory, and conducts the random walk for each subgraph. By evaluating the walks lazily, we can conduct the walks only in a subgraph and approximate the random walk by rotating the subgraphs. In computational experiments, the proposed method exhibits good performance for existing large graphs with several passes of the graph data.

Download Full-text

Fast Parallel Algorithm for Large Fractal Kinetic Models with Diffusion

10.1101/275248 ◽

2018 ◽

Author(s):

A. A. Popov ◽

S.-C. Lee ◽

P. P. Kuksa ◽

J. D. Glickson ◽

A. A. Shestovb

Keyword(s):

Parallel Algorithm ◽

Kinetic Models ◽

Chemical Kinetic ◽

Mass Action ◽

Mathematical Treatment ◽

Complex Dynamic ◽

Enzyme Model ◽

Speed Up ◽

A Cell ◽

Novel Method

ABSTRACTChemical kinetic simulations are usually based on the law of mass action that applies to behavior of particles in solution. Molecular interactions in a crowded medium as in a cell, however, are not easily described by such conventional mathematical treatment. Fractal kinetics is emerging as a novel method for simulating kinetic reactions in such an environment. To date, there has not been a fast, efficient, and, more importantly, parallel algorithm for such computations. Here, we present an algorithm with several novel features for simulating large (with respect to size and time scale) fractal kinetic models. We applied the fractal kinetic technique and our algorithm to a canonical substrate-enzyme model with explicit phase-separation in the product, and achieved a speed-up of up to 8 times over previous results with reasonably tight bounds on the accuracy of the simulation. We anticipate that this technique and algorithm will have important applications to simulation of intra-cell biochemical reactions with complex dynamic behavior.

Download Full-text