Fusion of text and graph information for machine learning problems on networks

PeerJ Computer Science ◽

10.7717/peerj-cs.526 ◽

2021 ◽

Vol 7 ◽

pp. e526

Author(s):

Ilya Makarov ◽

Mikhail Makarov ◽

Dmitrii Kiselev

Keyword(s):

Machine Learning ◽

Link Prediction ◽

Citation Network ◽

Representation Learning ◽

Scientific Paper ◽

Learning Problems ◽

Network Embedding ◽

Network Properties ◽

Text Information ◽

Low Dimensional

Today, increased attention is drawn towards network representation learning, a technique that maps nodes of a network into vectors of a low-dimensional embedding space. A network embedding constructed this way aims to preserve nodes similarity and other specific network properties. Embedding vectors can later be used for downstream machine learning problems, such as node classification, link prediction and network visualization. Naturally, some networks have text information associated with them. For instance, in a citation network, each node is a scientific paper associated with its abstract or title; in a social network, all users may be viewed as nodes of a network and posts of each user as textual attributes. In this work, we explore how combining existing methods of text and network embeddings can increase accuracy for downstream tasks and propose modifications to popular architectures to better capture textual information in network embedding and fusion frameworks.

Download Full-text

Learning Network Embedding with Community Structural Information

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/407 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yu Li ◽

Ying Wang ◽

Tingting Zhang ◽

Jiawei Zhang ◽

Yi Chang

Keyword(s):

Community Structure ◽

Link Prediction ◽

Structural Information ◽

Representation Learning ◽

Network Embedding ◽

Learning Network ◽

Optimization Framework ◽

Vertex Representation ◽

Low Dimensional ◽

Embedding Methods

Network embedding is an effective approach to learn the low-dimensional representations of vertices in networks, aiming to capture and preserve the structure and inherent properties of networks. The vast majority of existing network embedding methods exclusively focus on vertex proximity of networks, while ignoring the network internal community structure. However, the homophily principle indicates that vertices within the same community are more similar to each other than those from different communities, thus vertices within the same community should have similar vertex representations. Motivated by this, we propose a novel network embedding framework NECS to learn the Network Embedding with Community Structural information, which preserves the high-order proximity and incorporates the community structure in vertex representation learning. We formulate the problem into a principled optimization framework and provide an effective alternating algorithm to solve it. Extensive experimental results on several benchmark network datasets demonstrate the effectiveness of the proposed framework in various network analysis tasks including network reconstruction, link prediction and vertex classification.

Download Full-text

W-MMP2Vec: Topic-driven network embedding model for link prediction in content-based heterogeneous information network

Intelligent Data Analysis ◽

10.3233/ida-205168 ◽

2021 ◽

Vol 25 (3) ◽

pp. 711-738

Author(s):

Phu Pham ◽

Phuc Do

Keyword(s):

Link Prediction ◽

Representation Learning ◽

Information Network ◽

Network Embedding ◽

Heterogeneous Information Network ◽

Heterogeneous Information ◽

Learning Framework ◽

Novel Approach ◽

Proposed Model ◽

Meta Path

Link prediction on heterogeneous information network (HIN) is considered as a challenge problem due to the complexity and diversity in types of nodes and links. Currently, there are remained challenges of meta-path-based link prediction in HIN. Previous works of link prediction in HIN via network embedding approach are mainly focused on exploiting features of node rather than existing relations in forms of meta-paths between nodes. In fact, predicting the existence of new links between non-linked nodes is absolutely inconvincible. Moreover, recent HIN-based embedding models also lack of thorough evaluations on the topic similarity between text-based nodes along given meta-paths. To tackle these challenges, in this paper, we proposed a novel approach of topic-driven multiple meta-path-based HIN representation learning framework, namely W-MMP2Vec. Our model leverages the quality of node representations by combining multiple meta-paths as well as calculating the topic similarity weight for each meta-path during the processes of network embedding learning in content-based HINs. To validate our approach, we apply W-TMP2Vec model in solving several link prediction tasks in both content-based and non-content-based HINs (DBLP, IMDB and BlogCatalog). The experimental outputs demonstrate the effectiveness of proposed model which outperforms recent state-of-the-art HIN representation learning models.

Download Full-text

Network Embedding on Hierarchical Community Structure Network

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3434747 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-23

Author(s):

Guojie Song ◽

Yun Wang ◽

Lun Du ◽

Yi Li ◽

Junshan Wang

Keyword(s):

Community Structure ◽

Structural Information ◽

Spherical Surface ◽

Network Embedding ◽

The Galaxy ◽

Community Information ◽

The Hierarchical Structure ◽

Network Properties ◽

Multi Class Classification ◽

Low Dimensional

Network embedding is a method of learning a low-dimensional vector representation of network vertices under the condition of preserving different types of network properties. Previous studies mainly focus on preserving structural information of vertices at a particular scale, like neighbor information or community information, but cannot preserve the hierarchical community structure, which would enable the network to be easily analyzed at various scales. Inspired by the hierarchical structure of galaxies, we propose the Galaxy Network Embedding (GNE) model, which formulates an optimization problem with spherical constraints to describe the hierarchical community structure preserving network embedding. More specifically, we present an approach of embedding communities into a low-dimensional spherical surface, the center of which represents the parent community they belong to. Our experiments reveal that the representations from GNE preserve the hierarchical community structure and show advantages in several applications such as vertex multi-class classification, network visualization, and link prediction. The source code of GNE is available online.

Download Full-text

Exploiting node metadata to predict interactions in large networks using graph embedding and neural networks

10.1101/2021.06.10.447991 ◽

2021 ◽

Author(s):

Rogini Runghen ◽

Daniel B Stouffer ◽

Giulio Valentino Dalla Riva

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Link Prediction ◽

Graph Embedding ◽

Feature Space ◽

Machine Learning Techniques ◽

Large Networks ◽

Data Set ◽

Learning Techniques ◽

Low Dimensional

Collecting network interaction data is difficult. Non-exhaustive sampling and complex hidden processes often result in an incomplete data set. Thus, identifying potentially present but unobserved interactions is crucial both in understanding the structure of large scale data, and in predicting how previously unseen elements will interact. Recent studies in network analysis have shown that accounting for metadata (such as node attributes) can improve both our understanding of how nodes interact with one another, and the accuracy of link prediction. However, the dimension of the object we need to learn to predict interactions in a network grows quickly with the number of nodes. Therefore, it becomes computationally and conceptually challenging for large networks. Here, we present a new predictive procedure combining a graph embedding method with machine learning techniques to predict interactions on the base of nodes' metadata. Graph embedding methods project the nodes of a network onto a---low dimensional---latent feature space. The position of the nodes in the latent feature space can then be used to predict interactions between nodes. Learning a mapping of the nodes' metadata to their position in a latent feature space corresponds to a classic---and low dimensional---machine learning problem. In our current study we used the Random Dot Product Graph model to estimate the embedding of an observed network, and we tested different neural networks architectures to predict the position of nodes in the latent feature space. Flexible machine learning techniques to map the nodes onto their latent positions allow to account for multivariate and possibly complex nodes' metadata. To illustrate the utility of the proposed procedure, we apply it to a large dataset of tourist visits to destinations across New Zealand. We found that our procedure accurately predicts interactions for both existing nodes and nodes newly added to the network, while being computationally feasible even for very large networks. Overall, our study highlights that by exploiting the properties of a well understood statistical model for complex networks and combining it with standard machine learning techniques, we can simplify the link prediction problem when incorporating multivariate node metadata. Our procedure can be immediately applied to different types of networks, and to a wide variety of data from different systems. As such, both from a network science and data science perspective, our work offers a flexible and generalisable procedure for link prediction.

Download Full-text

Semisupervised Community Preserving Network Embedding with Pairwise Constraints

Complexity ◽

10.1155/2020/7953758 ◽

2020 ◽

Vol 2020 ◽

pp. 1-14

Author(s):

Dong Liu ◽

Yan Ru ◽

Qinpeng Li ◽

Shibin Wang ◽

Jianwei Niu

Keyword(s):

Community Structure ◽

Link Prediction ◽

Learning Algorithms ◽

Nonnegative Matrix ◽

Machine Learning Algorithms ◽

Network Visualization ◽

Network Embedding ◽

Pairwise Constraints ◽

Node Clustering ◽

Low Dimensional

Network embedding aims to learn the low-dimensional representations of nodes in networks. It preserves the structure and internal attributes of the networks while representing nodes as low-dimensional dense real-valued vectors. These vectors are used as inputs of machine learning algorithms for network analysis tasks such as node clustering, classification, link prediction, and network visualization. The network embedding algorithms, which considered the community structure, impose a higher level of constraint on the similarity of nodes, and they make the learned node embedding results more discriminative. However, the existing network representation learning algorithms are mostly unsupervised models; the pairwise constraint information, which represents community membership, is not effectively utilized to obtain node embedding results that are more consistent with prior knowledge. This paper proposes a semisupervised modularized nonnegative matrix factorization model, SMNMF, while preserving the community structure for network embedding; the pairwise constraints (must-link and cannot-link) information are effectively fused with the adjacency matrix and node similarity matrix of the network so that the node representations learned by the model are more interpretable. Experimental results on eight real network datasets show that, comparing with the representative network embedding methods, the node representations learned after incorporating the pairwise constraints can obtain higher accuracy in node clustering task and the results of link prediction, and network visualization tasks indicate that the semisupervised model SMNMF is more discriminative than unsupervised ones.

Download Full-text

Negative Sampling in Knowledge Representation Learning: A Mini-Review

10.5121/csit.2020.101519 ◽

2020 ◽

Author(s):

Jing Qian ◽

Gangmin Li ◽

Katie Atkinson ◽

Yong Yue

Keyword(s):

Knowledge Representation ◽

Link Prediction ◽

Representation Learning ◽

Cluster Sampling ◽

Continuous Space ◽

Knowledge Representations ◽

Space Efficiency ◽

Low Dimensional ◽

Fixed Distribution

Knowledge representation learning (KRL) aims at encoding components of a knowledge graph (KG) into a low-dimensional continuous space, which has brought considerable successes in applying deep learning to graph embedding. Most famous KGs contain only positive instances for space efficiency. Typical KRL techniques, especially translational distance-based models, are trained through discriminating positive and negative samples. Thus, negative sampling is unquestionably a non-trivial step in KG embedding. The quality of generated negative samples can directly influence the performance of final knowledge representations in downstream tasks, such as link prediction and triple classification. This review summarizes current negative sampling methods in KRL and we categorize them into three sorts, fixed distribution-based, generative adversarial net (GAN)-based and cluster sampling. Based on this categorization we discuss the most prevalent existing approaches and their characteristics.

Download Full-text

Exponential Family Graph Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5737 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3357-3364

Author(s):

Abdulkadir Celikkanat ◽

Fragkiskos D. Malliaros

Keyword(s):

Random Walk ◽

Exponential Family ◽

Representation Learning ◽

Learning Problems ◽

Interaction Patterns ◽

Network Representation ◽

Learning Tasks ◽

Learning Techniques ◽

Real World Datasets ◽

Low Dimensional

Representing networks in a low dimensional latent space is a crucial task with many interesting applications in graph learning problems, such as link prediction and node classification. A widely applied network representation learning paradigm is based on the combination of random walks for sampling context nodes and the traditional Skip-Gram model to capture center-context node relationships. In this paper, we emphasize on exponential family distributions to capture rich interaction patterns between nodes in random walk sequences. We introduce the generic exponential family graph embedding model, that generalizes random walk-based network representation learning techniques to exponential family conditional distributions. We study three particular instances of this model, analyzing their properties and showing their relationship to existing unsupervised learning models. Our experimental evaluation on real-world datasets demonstrates that the proposed techniques outperform well-known baseline methods in two downstream machine learning tasks.

Download Full-text

DeepMicro: deep representation learning for disease prediction based on microbiome data

10.1101/785626 ◽

2019 ◽

Author(s):

Min Oh ◽

Liqing Zhang

Keyword(s):

Machine Learning ◽

Optimization Procedure ◽

Representation Learning ◽

Disease Prediction ◽

Machine Learning Classification ◽

Evaluation Scheme ◽

Marker Profile ◽

Model Training ◽

Low Dimensional ◽

Microbiome Data

AbstractHuman microbiota plays a key role in human health and growing evidence supports the potential use of microbiome as a predictor of various diseases. However, the high-dimensionality of microbiome data, often in the order of hundreds of thousands, yet low sample sizes, poses great challenge for machine learning-based prediction algorithms. This imbalance induces the data to be highly sparse, preventing from learning a better prediction model. Also, there has been little work on deep learning applications to microbiome data with a rigorous evaluation scheme. To address these challenges, we propose DeepMicro, a deep representation learning framework allowing for an effective representation of microbiome profiles. DeepMicro successfully transforms high-dimensional microbiome data into a robust low-dimensional representation using various autoencoders and applies machine learning classification algorithms on the learned representation. In disease prediction, DeepMicro outperforms the current best approaches based on the strain-level marker profile in five different datasets. In addition, by significantly reducing the dimensionality of the marker profile, DeepMicro accelerates the model training and hyperparameter optimization procedure with 8X-30X speedup over the basic approach. DeepMicro is freely available at https://github.com/minoh0201/DeepMicro.

Download Full-text

Proximity Measures as Graph Convolution Matrices for Link Prediction in Biological Networks

10.1101/2020.11.14.382655 ◽

2020 ◽

Author(s):

Mustafa Coşkun ◽

Mehmet Koyutürk

Keyword(s):

Link Prediction ◽

Similarity Measures ◽

Graph Representation ◽

Supplementary Information ◽

Great Promise ◽

Network Embedding ◽

Common Neighbor ◽

Node Similarity ◽

Topological Characteristics ◽

Low Dimensional

AbstractMotivationLink prediction is an important and well-studied problem in computational biology, with a broad range of applications including disease gene prioritization, drug-disease associations, and drug response in cancer. The general principle in link prediction is to use the topological characteristics and the attributes–if available– of the nodes in the network to predict new links that are likely to emerge/disappear. Recently, graph representation learning methods, which aim to learn a low-dimensional representation of topological characteristics and the attributes of the nodes, have drawn increasing attention to solve the link prediction problem via learnt low-dimensional features. Most prominently, Graph Convolution Network (GCN)-based network embedding methods have demonstrated great promise in link prediction due to their ability of capturing non-linear information of the network. To date, GCN-based network embedding algorithms utilize a Laplacian matrix in their convolution layers as the convolution matrix and the effect of the convolution matrix on algorithm performance has not been comprehensively characterized in the context of link prediction in biomedical networks. On the other hand, for a variety of biomedical link prediction tasks, traditional node similarity measures such as Common Neighbor, Ademic-Adar, and other have shown promising results, and hence there is a need to systematically evaluate the node similarity measures as convolution matrices in terms of their usability and potential to further the state-of-the-art.ResultsWe select 8 representative node similarity measures as convolution matrices within the single-layered GCN graph embedding method and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug–drug interaction (DDI) prediction, protein–protein interaction (PPI) prediction. Our experimental results demonstrate that the node similarity-based convolution matrices significantly improves GCN-based embedding algorithms and deserve more attention in the future biomedical link predictionAvailabilityOur method is implemented as a python library and is available at [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Feature Hashing for Network Representation Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/390 ◽

2018 ◽

Cited By ~ 2

Author(s):

Qixiang Wang ◽

Shanfeng Wang ◽

Maoguo Gong ◽

Yue Wu

Keyword(s):

Link Prediction ◽

Feature Space ◽

Representation Learning ◽

Learning Approaches ◽

Network Representation ◽

Proximity Matrix ◽

Low Dimensional ◽

Vector Representations ◽

Feature Hashing ◽

Node Embeddings

The goal of network representation learning is to embed nodes so as to encode the proximity structures of a graph into a continuous low-dimensional feature space. In this paper, we propose a novel algorithm called node2hash based on feature hashing for generating node embeddings. This approach follows the encoder-decoder framework. There are two main mapping functions in this framework. The first is an encoder to map each node into high-dimensional vectors. The second is a decoder to hash these vectors into a lower dimensional feature space. More specifically, we firstly derive a proximity measurement called expected distance as target which combines position distribution and co-occurrence statistics of nodes over random walks so as to build a proximity matrix, then introduce a set of T different hash functions into feature hashing to generate uniformly distributed vector representations of nodes from the proximity matrix. Compared with the existing state-of-the-art network representation learning approaches, node2hash shows a competitive performance on multi-class node classification and link prediction tasks on three real-world networks from various domains.

Download Full-text