Effective and efficient graph augmentation in large graphs

We address the problem of computing the distribution of induced connected subgraphs, aka graphlets or motifs , in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling by leveraging the color coding technique by Alon, Yuster, and Zwick. In this work, we extend the applicability of this approach by introducing a set of algorithmic optimizations and techniques that reduce the running time and space usage of color coding and improve the accuracy of the counts. To this end, we first show how to optimize color coding to efficiently build a compact table of a representative subsample of all graphlets in the input graph. For 8-node motifs, we can build such a table in one hour for a graph with 65M nodes and 1.8B edges, which is times larger than the state of the art. We then introduce a novel adaptive sampling scheme that breaks the “additive error barrier” of uniform sampling, guaranteeing multiplicative approximations instead of just additive ones. This allows us to count not only the most frequent motifs, but also extremely rare ones. For instance, on one graph we accurately count nearly 10.000 distinct 8-node motifs whose relative frequency is so small that uniform sampling would literally take centuries to find them. Our results show that color coding is still the most promising approach to scalable motif counting.

Download Full-text

Tiered Sampling

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441299 ◽

2021 ◽

Vol 15 (5) ◽

pp. 1-52

Author(s):

Lorenzo De Stefani ◽

Erisa Terolli ◽

Eli Upfal

Keyword(s):

Large Scale ◽

Analysis Of Algorithms ◽

Base Layer ◽

Single Edge ◽

Real World Data ◽

High Quality ◽

Large Graphs ◽

Massive Graphs ◽

Variance Estimate ◽

Low Probability

We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

Download Full-text

Isoscattering strings of concatenating graphs and networks

Scientific Reports ◽

10.1038/s41598-020-80950-6 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Michał Ławniczak ◽

Adam Sawicki ◽

Małgorzata Białous ◽

Leszek Sirko

Keyword(s):

Trace Function ◽

Quantum Graphs ◽

Mathematical Approach ◽

Large Graphs ◽

Graphs And Networks ◽

Scattering Matrices ◽

Theoretical Predictions ◽

Infinite Strings ◽

Microwave Networks ◽

Insight Into

AbstractWe identify and investigate isoscattering strings of concatenating quantum graphs possessing n units and 2n infinite external leads. We give an insight into the principles of designing large graphs and networks for which the isoscattering properties are preserved for $$n \rightarrow \infty $$ n → ∞ . The theoretical predictions are confirmed experimentally using $$n=2$$ n = 2 units, four-leads microwave networks. In an experimental and mathematical approach our work goes beyond prior results by demonstrating that using a trace function one can address the unsettled until now problem of whether scattering properties of open complex graphs and networks with many external leads are uniquely connected to their shapes. The application of the trace function reduces the number of required entries to the $$2n \times 2n $$ 2 n × 2 n scattering matrices $${\hat{S}}$$ S ^ of the systems to 2n diagonal elements, while the old measures of isoscattering require all $$(2n)^2$$ ( 2 n ) 2 entries. The studied problem generalizes a famous question of Mark Kac “Can one hear the shape of a drum?”, originally posed in the case of isospectral dissipationless systems, to the case of infinite strings of open graphs and networks.

Download Full-text

Topological Fisheye Views for Visualizing Large Graphs

IEEE Transactions on Visualization and Computer Graphics ◽

10.1109/tvcg.2005.66 ◽

2005 ◽

Vol 11 (4) ◽

pp. 457-468 ◽

Cited By ~ 72

Author(s):

E.R. Gansner ◽

Y. Koren ◽

S.C. North

Keyword(s):

Large Graphs

Download Full-text

Summarizing and understanding large graphs

Statistical Analysis and Data Mining The ASA Data Science Journal ◽

10.1002/sam.11267 ◽

2015 ◽

Vol 8 (3) ◽

pp. 183-202 ◽

Cited By ~ 26

Author(s):

Danai Koutra ◽

U Kang ◽

Jilles Vreeken ◽

Christos Faloutsos

Keyword(s):

Large Graphs

Download Full-text

A Distributed Graph Partitioning Algorithm for Processing Large Graphs

2016 IEEE Symposium on Service-Oriented System Engineering (SOSE) ◽

10.1109/sose.2016.48 ◽

2016 ◽

Cited By ~ 3

Author(s):

Tefeng Chen ◽

Bo Li

Keyword(s):

Graph Partitioning ◽

Large Graphs ◽

Partitioning Algorithm

Download Full-text

Privacy-Preserving Spectral Analysis of Large Graphs in Public Clouds

Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security - ASIA CCS '16 ◽

10.1145/2897845.2897857 ◽

2016 ◽

Cited By ~ 4

Author(s):

Sagar Sharma ◽

James Powers ◽

Keke Chen

Keyword(s):

Spectral Analysis ◽

Privacy Preserving ◽

Large Graphs

Download Full-text

Population-specific genome graphs improve high-throughput sequencing data analysis: A case study on the Pan-African genome

10.1101/2021.03.19.436173 ◽

2021 ◽

Author(s):

H. Serhat Tetikol ◽

Kubra Narci ◽

Deniz Turgut ◽

Gungor Budak ◽

Ozem Kalay ◽

...

Keyword(s):

High Throughput Sequencing ◽

Information Overload ◽

African Ancestry ◽

Sample Selection ◽

Variant Calling ◽

Population Diversity ◽

Human Populations ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Graph Augmentation

ABSTRACTGraph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference for capturing the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based bioinformatics toolkits, how to curate genomic variants and subsequently construct genome graphs remains an understudied problem that inevitably determines the effectiveness of the end-to-end bioinformatics pipeline. In this study, we discuss major obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and test the proposed approach on the whole-genome samples of African ancestry. Our results show that, as more representative alternatives to linear or generic graph references, population-specific graphs can achieve significantly lower read mapping errors, increased variant calling sensitivity and provide the improvements of joint variant calling without the need of computationally intensive post-processing steps.

Download Full-text