Graph representation learning: a survey

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.13 ◽

2020 ◽

Vol 9 ◽

Author(s):

Fenxiao Chen ◽

Yun-Cheng Wang ◽

Bin Wang ◽

C.-C. Jay Kuo

Keyword(s):

Graph Embedding ◽

Large Data ◽

Representation Learning ◽

Graph Representation ◽

Data Sets ◽

Graph Data ◽

Graph Properties ◽

Wide Range ◽

Regular Lattices ◽

Low Dimensional

Abstract Research on graph representation learning has received great attention in recent years since most data in real-world applications come in the form of graphs. High-dimensional graph data are often in irregular forms. They are more difficult to analyze than image/video/audio data defined on regular lattices. Various graph embedding techniques have been developed to convert the raw graph data into a low-dimensional vector representation while preserving the intrinsic graph properties. In this review, we first explain the graph embedding task and its challenges. Next, we review a wide range of graph embedding techniques with insights. Then, we evaluate several stat-of-the-art methods against small and large data sets and compare their performance. Finally, potential applications and future directions are presented.

Download Full-text

More faithfulness graph embedding

International Journal of Applied Mathematical Research ◽

10.14419/ijamr.v4i2.4419 ◽

2015 ◽

Vol 4 (2) ◽

pp. 336

Author(s):

Alaa Najim

Keyword(s):

Dimensionality Reduction ◽

Graph Embedding ◽

New Method ◽

Graph Representation ◽

Data Sets ◽

Graph Visualization ◽

Graph Data ◽

Original Space ◽

Data Points ◽

Effectiveness And Efficiency

<p><span lang="EN-GB">Using dimensionality reduction idea to visualize graph data sets can preserve the properties of the original space and reveal the underlying information shared among data points. Continuity Trustworthy Graph Embedding (CTGE) is new method we have introduced in this paper to improve the faithfulness of the graph visualization. We will use CTGE in graph field to find new understandable representation to be more easy to analyze and study. Several experiments on real graph data sets are applied to test the effectiveness and efficiency of the proposed method, which showed CTGE generates highly faithfulness graph representation when compared its representation with other methods.</span></p>

Download Full-text

An Attention-Based Graph Neural Network for Heterogeneous Structural Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5833 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4132-4139

Author(s):

Huiting Hong ◽

Hantao Guo ◽

Yucheng Lin ◽

Xiaoqing Yang ◽

Zang Li ◽

...

Keyword(s):

Neural Network ◽

Structural Information ◽

Representation Learning ◽

Graph Representation ◽

Heterogeneous Information ◽

Domain Experts ◽

Proposed Model ◽

Meta Path ◽

Low Dimensional ◽

Public Datasets

In this paper, we focus on graph representation learning of heterogeneous information network (HIN), in which various types of vertices are connected by various types of relations. Most of the existing methods conducted on HIN revise homogeneous graph embedding models via meta-paths to learn low-dimensional vector space of HIN. In this paper, we propose a novel Heterogeneous Graph Structural Attention Neural Network (HetSANN) to directly encode structural information of HIN without meta-path and achieve more informative representations. With this method, domain experts will not be needed to design meta-path schemes and the heterogeneous information can be processed automatically by our proposed model. Specifically, we implicitly represent heterogeneous information using the following two methods: 1) we model the transformation between heterogeneous vertices through a projection in low-dimensional entity spaces; 2) afterwards, we apply the graph neural network to aggregate multi-relational information of projected neighborhood by means of attention mechanism. We also present three extensions of HetSANN, i.e., voices-sharing product attention for the pairwise relationships in HIN, cycle-consistency loss to retain the transformation between heterogeneous entity spaces, and multi-task learning with full use of information. The experiments conducted on three public datasets demonstrate that our proposed models achieve significant and consistent improvements compared to state-of-the-art solutions.

Download Full-text

hcga: Highly Comparative Graph Analysis for network phenotyping

10.1101/2020.09.25.312926 ◽

2020 ◽

Author(s):

Robert L. Peach ◽

Alexis Arnaudon ◽

Julia A. Schmidt ◽

Henry A. Palasciano ◽

Nathan R. Bernier ◽

...

Keyword(s):

Statistical Learning ◽

Organic Semiconductors ◽

Scientific Discovery ◽

Theoretical Research ◽

Data Sets ◽

Regression Problem ◽

Data Set ◽

Graph Data ◽

Open Platform ◽

Graph Properties

AbstractNetworks are widely used as mathematical models of complex systems across many scientific disciplines, not only in biology and medicine but also in the social sciences, physics, computing and engineering. Decades of work have produced a vast corpus of research characterising the topological, combinatorial, statistical and spectral properties of graphs. Each graph property can be thought of as a feature that captures important (and some times overlapping) characteristics of a network. In the analysis of real-world graphs, it is crucial to integrate systematically a large number of diverse graph features in order to characterise and classify networks, as well as to aid network-based scientific discovery. In this paper, we introduce HCGA, a framework for highly comparative analysis of graph data sets that computes several thousands of graph features from any given network. HCGA also offers a suite of statistical learning and data analysis tools for automated identification and selection of important and interpretable features underpinning the characterisation of graph data sets. We show that HCGA outperforms other methodologies on supervised classification tasks on benchmark data sets whilst retaining the interpretability of network features. We also illustrate how HCGA can be used for network-based discovery through two examples where data is naturally represented as graphs: the clustering of a data set of images of neuronal morphologies, and a regression problem to predict charge transfer in organic semiconductors based on their structure. HCGA is an open platform that can be expanded to include further graph properties and statistical learning tools to allow researchers to leverage the wide breadth of graph-theoretical research to quantitatively analyse and draw insights from network data.

Download Full-text

Representation learning of RNA velocity reveals robust cell transitions

10.1101/2021.03.19.436127 ◽

2021 ◽

Author(s):

Chen Qiao ◽

Yuanhua Huang

Keyword(s):

Cellular Differentiation ◽

Representation Learning ◽

Biological Processes ◽

Dimensional Representation ◽

Promising Technique ◽

Technical Noise ◽

Wide Range ◽

Cell Transcriptome ◽

Low Dimensional ◽

Single Cell Transcriptome

RNA velocity is a promising technique to reveal transient cellular dynamics among a heterogeneous cell population and quantify their transitions from single-cell transcriptome experiments. However, the cell transitions estimated from high dimensional RNA velocity are often unstable or inaccurate, partly due to the high technical noise and less informative projection. Here, we present VeloAE, a tailored representation learning method to learn a low-dimensional representation of RNA velocity on which cell transitions can be robustly estimated. From various experimental datasets, we show that VeloAE can both accurately identify stimulation dynamics in time-series designs and effectively capture the expected cellular differentiation in different biological systems. VeloAE therefore enhances the usefulness of RNA velocity for studying a wide range of biological processes.

Download Full-text

An Approach to Knowledge Base Completion by a Committee-Based Knowledge Graph Embedding

Applied Sciences ◽

10.3390/app10082651 ◽

2020 ◽

Vol 10 (8) ◽

pp. 2651

Author(s):

Su Jeong Choi ◽

Hyun-Je Song ◽

Seong-Bae Park

Keyword(s):

Knowledge Base ◽

Language Processing ◽

Graph Embedding ◽

Knowledge Bases ◽

Knowledge Graph ◽

Data Sets ◽

Complete Knowledge ◽

Proposed Model ◽

Ranking Task ◽

Low Dimensional

Knowledge bases such as Freebase, YAGO, DBPedia, and Nell contain a number of facts with various entities and relations. Since they store many facts, they are regarded as core resources for many natural language processing tasks. Nevertheless, they are not normally complete and have many missing facts. Such missing facts keep them from being used in diverse applications in spite of their usefulness. Therefore, it is significant to complete knowledge bases. Knowledge graph embedding is one of the promising approaches to completing a knowledge base and thus many variants of knowledge graph embedding have been proposed. It maps all entities and relations in knowledge base onto a low dimensional vector space. Then, candidate facts that are plausible in the space are determined as missing facts. However, any single knowledge graph embedding is insufficient to complete a knowledge base. As a solution to this problem, this paper defines knowledge base completion as a ranking task and proposes a committee-based knowledge graph embedding model for improving the performance of knowledge base completion. Since each knowledge graph embedding has its own idiosyncrasy, we make up a committee of various knowledge graph embeddings to reflect various perspectives. After ranking all candidate facts according to their plausibility computed by the committee, the top-k facts are chosen as missing facts. Our experimental results on two data sets show that the proposed model achieves higher performance than any single knowledge graph embedding and shows robust performances regardless of k. These results prove that the proposed model considers various perspectives in measuring the plausibility of candidate facts.

Download Full-text

0062 Improved Circadian Data Ordering in the Presence of Biological and Technical Confounds

SLEEP ◽

10.1093/sleep/zsaa056.060 ◽

2020 ◽

Vol 43 (Supplement_1) ◽

pp. A24-A26

Author(s):

J Hammarlund ◽

R Anafi

Keyword(s):

Machine Learning ◽

Network Architecture ◽

Large Data ◽

Data Sets ◽

Sample Collection ◽

Circadian Phase ◽

Confounding Variables ◽

Wide Range ◽

Fully Connected ◽

Genome Scale

Abstract Introduction We recently used unsupervised machine learning to order genome scale data along a circadian cycle. CYCLOPS (Anafi et al PNAS 2017) encodes high dimensional genomic data onto an ellipse and offers the potential to identify circadian patterns in large data-sets. This approach requires many samples from a wide range of circadian phases. Individual data-sets often lack sufficient samples. Composite expression repositories vastly increase the available data. However, these agglomerated datasets also introduce technical (e.g. processing site) and biological (e.g. age or disease) confounders that may hamper circadian ordering. Methods Using the FLUX machine learning library we expanded the CYCLOPS network. We incorporated additional encoding and decoding layers that model the influence of labeled confounding variables. These layers feed into a fully connected autoencoder with a circular bottleneck, encoding the estimated phase of each sample. The expanded network simultaneously estimates the influence of confounding variables along with circadian phase. We compared the performance of the original and expanded networks using both real and simulated expression data. In a first test, we used time-labeled data from a single-center describing human cortical samples obtained at autopsy. To generate a second, idealized processing center, we introduced gene specific biases in expression along with a bias in sample collection time. In a second test, we combined human lung biopsy data from two medical centers. Results The performance of the original CYCLOPS network degraded with the introduction of increasing, non-circadian confounds. The expanded network was able to more accurately assess circadian phase over a wider range of confounding influences. Conclusion The addition of labeled confounding variables into the network architecture improves circadian data ordering. The use of the expanded network should facilitate the application of CYCLOPS to multi-center data and expand the data available for circadian analysis. Support This work was supported by the National Cancer Institute (1R01CA227485-01)

Download Full-text

Galaxy formation and evolution using multi-wavelength, multi-resolution imaging data in the Virtual Observatory

Proceedings of the International Astronomical Union ◽

10.1017/s1743921307011908 ◽

2006 ◽

Vol 2 (14) ◽

pp. 592-592

Author(s):

Paresh Prema ◽

Nicholas A. Walton ◽

Richard G. McMahon

Keyword(s):

Galaxy Formation ◽

High Redshift ◽

Large Data ◽

Electromagnetic Spectrum ◽

Data Sets ◽

Virtual Observatory ◽

X Rays ◽

Imaging Data ◽

Wide Range ◽

Multi Wavelength

Observational astronomy is entering an exciting new era with large surveys delivering deep multi-wavelength data over a wide range of the electromagnetic spectrum. The last ten years has seen a growth in the study of high redshift galaxies discovered with the method pioneered by Steidel et al. (1995) used to identify galaxies above z>1. The technique is designed to take advantage of the multi-wavelength data now available for astronomers that can extend from X-rays to radio wavelength. The technique is fast becoming a useful way to study large samples of objects at these high redshifts and we are currently designing and implementing an automated technique to study these samples of objects. However, large surveys produce large data sets that have now reached terabytes (e.g. for the Sloan Digital Sky Survey, <http://www.sdss.org>) in size and petabytes over the next 10yr (e.g., LSST, <http://www.lsst.org>). The Virtual Observatory is now providing a means to deal with this issue and users are now able to access many data sets in a quicker more useful form.

Download Full-text

Caps-OWKG: a capsule network model for open-world knowledge graph

International Journal of Machine Learning and Cybernetics ◽

10.1007/s13042-020-01259-4 ◽

2021 ◽

Author(s):

Yuhan Wang ◽

Weidong Xiao ◽

Zhen Tan ◽

Xiang Zhao

Keyword(s):

Representation Learning ◽

Graph Representation ◽

Knowledge Graph ◽

World Knowledge ◽

Relational Structures ◽

Open World ◽

Latent Features ◽

Knowledge Graphs ◽

Low Dimensional ◽

Better Than

AbstractKnowledge graphs are typical multi-relational structures, which is consisted of many entities and relations. Nonetheless, existing knowledge graphs are still sparse and far from being complete. To refine the knowledge graphs, representation learning is utilized to embed entities and relations into low-dimensional spaces. Many existing knowledge graphs embedding models focus on learning latent features in close-world assumption but omit the changeable of each knowledge graph.In this paper, we propose a knowledge graph representation learning model, called Caps-OWKG, which leverages the capsule network to capture the both known and unknown triplets features in open-world knowledge graph. It combines the descriptive text and knowledge graph to get descriptive embedding and structural embedding, simultaneously. Then, the both above embeddings are used to calculate the probability of triplet authenticity. We verify the performance of Caps-OWKG on link prediction task with two common datasets FB15k-237-OWE and DBPedia50k. The experimental results are better than other baselines, and achieve the state-of-the-art performance.

Download Full-text

An encoding of genome content for machine learning

10.1101/524280 ◽

2019 ◽

Cited By ~ 3

Author(s):

A. Viehweger ◽

S. Krautwurst ◽

D. H. Parks ◽

B. König ◽

M. Marz

Keyword(s):

Machine Learning ◽

Culture Media ◽

Protein Domains ◽

Large Data ◽

Data Sets ◽

Neural Net ◽

Functional Relationships ◽

A Genome ◽

Genome Content ◽

Low Dimensional

AbstractAn ever-growing number of metagenomes can be used for biomining and the study of microbial functions. The use of learning algorithms in this context has been hindered, because they often need input in the form of low-dimensional, dense vectors of numbers. We propose such a representation for genomes callednanotextthat scales to very large data sets.The underlying model is learned from a corpus of nearly 150 thousand genomes spanning 750 million protein domains. We treat the protein domains in a genome like words in a document, assuming that protein domains in a similar context have similar “meaning”. This meaning can be distributed by a neural net over a vector of numbers.The resulting vectors efficiently encode function, preserve known phylogeny, capture subtle functional relationships and are robust against genome incompleteness. The “functional” distance between two vectors complements nucleotide-based distance, so that genomes can be identified as similar even though their nucleotide identity is low.nanotextcan thus encode (meta)genomes for direct use in downstream machine learning tasks. We show this by predicting plausible culture media for metagenome assembled genomes (MAGs) from theTara Oceans Expeditionusing their genome content only.nanotextis freely released under a BSD licence (https://github.com/phiweger/nanotext).

Download Full-text

Using Graph Representation in Host-Based Intrusion Detection

Security and Communication Networks ◽

10.1155/2021/6291276 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Zhichao Hu ◽

Likun Liu ◽

Haining Yu ◽

Xiangzhan Yu

Keyword(s):

Intrusion Detection ◽

Representation Learning ◽

Graph Representation ◽

Daily Lives ◽

Embedding Method ◽

System Calls ◽

Average Improvement ◽

Wide Range ◽

Structural Association ◽

N Gram

Cybersecurity has become an important part of our daily lives. As an important part, there are many researches on intrusion detection based on host system call in recent years. Compared to sentences, a sequence of system calls has unique characteristics. It contains implicit pattern relationships that are less sensitive to the order of occurrence and that have less impact on the classification results when the frequency of system calls varies slightly. There are also various properties such as resource consumption, execution time, predefined rules, and empirical weights of system calls. Commonly used word embedding methods, such as Bow, TI-IDF, N-Gram, and Word2Vec, do not fully exploit such relationships in sequences as well as conveniently support attribute expansion. To solve these problems, we introduce Graph Representation based Intrusion Detection (GRID), an intrusion detection framework based on graph representation learning. It captures the potential relationships between system calls to learn better features, and it is applicable to a wide range of back-end classifiers. GRID utilizes a new sequence embedding method Graph Random State Embedding (GRSE) that uses graph structures to model a finite number of sequence items and represent the structural association relationships between them. A more efficient representation of sequence embeddings is generated by random walks, word embeddings, and graph pooling. Moreover, it can be easily extended to sequences with attributes. Our experimental results on the AFDA-LD dataset show that GRID has an average improvement of 2% using the GRSE embedding method comparing to others.

Download Full-text