Top-k Similarity Matching in Large Graphs with Attributes

Author(s):  
Xiaofeng Ding ◽  
Jianhong Jia ◽  
Jiuyong Li ◽  
Jixue Liu ◽  
Hai Jin
Author(s):  
Renato Vizuete ◽  
Federica Garin ◽  
Paolo Frasca

2021 ◽  
Vol 15 (4) ◽  
Author(s):  
Yun Peng ◽  
Xin Lin ◽  
Byron Choi ◽  
Bingsheng He

2021 ◽  
Vol 15 (6) ◽  
pp. 1-27
Author(s):  
Marco Bressan ◽  
Stefano Leucci ◽  
Alessandro Panconesi

We address the problem of computing the distribution of induced connected subgraphs, aka graphlets or motifs , in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling by leveraging the color coding technique by Alon, Yuster, and Zwick. In this work, we extend the applicability of this approach by introducing a set of algorithmic optimizations and techniques that reduce the running time and space usage of color coding and improve the accuracy of the counts. To this end, we first show how to optimize color coding to efficiently build a compact table of a representative subsample of all graphlets in the input graph. For 8-node motifs, we can build such a table in one hour for a graph with 65M nodes and 1.8B edges, which is times larger than the state of the art. We then introduce a novel adaptive sampling scheme that breaks the “additive error barrier” of uniform sampling, guaranteeing multiplicative approximations instead of just additive ones. This allows us to count not only the most frequent motifs, but also extremely rare ones. For instance, on one graph we accurately count nearly 10.000 distinct 8-node motifs whose relative frequency is so small that uniform sampling would literally take centuries to find them. Our results show that color coding is still the most promising approach to scalable motif counting.


2021 ◽  
Vol 15 (5) ◽  
pp. 1-52
Author(s):  
Lorenzo De Stefani ◽  
Erisa Terolli ◽  
Eli Upfal

We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.


Agronomy ◽  
2021 ◽  
Vol 11 (7) ◽  
pp. 1307
Author(s):  
Haoriqin Wang ◽  
Huaji Zhu ◽  
Huarui Wu ◽  
Xiaomin Wang ◽  
Xiao Han ◽  
...  

In the question-and-answer (Q&A) communities of the “China Agricultural Technology Extension Information Platform”, thousands of rice-related Chinese questions are newly added every day. The rapid detection of the same semantic question is the key to the success of a rice-related intelligent Q&A system. To allow the fast and automatic detection of the same semantic rice-related questions, we propose a new method based on the Coattention-DenseGRU (Gated Recurrent Unit). According to the rice-related question characteristics, we applied word2vec with the TF-IDF (Term Frequency–Inverse Document Frequency) method to process and analyze the text data and compare it with the Word2vec, GloVe, and TF-IDF methods. Combined with the agricultural word segmentation dictionary, we applied Word2vec with the TF-IDF method, effectively solving the problem of high dimension and sparse data in the rice-related text. Each network layer employed the connection information of features and all previous recursive layers’ hidden features. To alleviate the problem of feature vector size increasing due to dense splicing, an autoencoder was used after dense concatenation. The experimental results show that rice-related question similarity matching based on Coattention-DenseGRU can improve the utilization of text features, reduce the loss of features, and achieve fast and accurate similarity matching of the rice-related question dataset. The precision and F1 values of the proposed model were 96.3% and 96.9%, respectively. Compared with seven other kinds of question similarity matching models, we present a new state-of-the-art method with our rice-related question dataset.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Michał Ławniczak ◽  
Adam Sawicki ◽  
Małgorzata Białous ◽  
Leszek Sirko

AbstractWe identify and investigate isoscattering strings of concatenating quantum graphs possessing n units and 2n infinite external leads. We give an insight into the principles of designing large graphs and networks for which the isoscattering properties are preserved for $$n \rightarrow \infty $$ n → ∞ . The theoretical predictions are confirmed experimentally using $$n=2$$ n = 2 units, four-leads microwave networks. In an experimental and mathematical approach our work goes beyond prior results by demonstrating that using a trace function one can address the unsettled until now problem of whether scattering properties of open complex graphs and networks with many external leads are uniquely connected to their shapes. The application of the trace function reduces the number of required entries to the $$2n \times 2n $$ 2 n × 2 n scattering matrices $${\hat{S}}$$ S ^ of the systems to 2n diagonal elements, while the old measures of isoscattering require all $$(2n)^2$$ ( 2 n ) 2 entries. The studied problem generalizes a famous question of Mark Kac “Can one hear the shape of a drum?”, originally posed in the case of isospectral dissipationless systems, to the case of infinite strings of open graphs and networks.


2005 ◽  
Vol 11 (4) ◽  
pp. 457-468 ◽  
Author(s):  
E.R. Gansner ◽  
Y. Koren ◽  
S.C. North
Keyword(s):  

2015 ◽  
Vol 8 (3) ◽  
pp. 183-202 ◽  
Author(s):  
Danai Koutra ◽  
U Kang ◽  
Jilles Vreeken ◽  
Christos Faloutsos
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document