Survey on graph embeddings and their applications to machine learning problems on graphs

PeerJ Computer Science ◽

10.7717/peerj-cs.357 ◽

2021 ◽

Vol 7 ◽

pp. e357

Author(s):

Ilya Makarov ◽

Dmitrii Kiselev ◽

Nikita Nikitinsky ◽

Lovro Subelj

Keyword(s):

Machine Learning ◽

Link Prediction ◽

Graph Embedding ◽

Learning Problems ◽

Feature Engineering ◽

Graph Embeddings ◽

Feature Spaces ◽

Wide Range ◽

Depth Analysis ◽

Node Classification

Dealing with relational data always required significant computational resources, domain expertise and task-dependent feature engineering to incorporate structural information into a predictive model. Nowadays, a family of automated graph feature engineering techniques has been proposed in different streams of literature. So-called graph embeddings provide a powerful tool to construct vectorized feature spaces for graphs and their components, such as nodes, edges and subgraphs under preserving inner graph properties. Using the constructed feature spaces, many machine learning problems on graphs can be solved via standard frameworks suitable for vectorized feature representation. Our survey aims to describe the core concepts of graph embeddings and provide several taxonomies for their description. First, we start with the methodological approach and extract three types of graph embedding models based on matrix factorization, random-walks and deep learning approaches. Next, we describe how different types of networks impact the ability of models to incorporate structural and attributed data into a unified embedding. Going further, we perform a thorough evaluation of graph embedding applications to machine learning problems on graphs, among which are node classification, link prediction, clustering, visualization, compression, and a family of the whole graph embedding algorithms suitable for graph classification, similarity and alignment problems. Finally, we overview the existing applications of graph embeddings to computer science domains, formulate open problems and provide experiment results, explaining how different networks properties result in graph embeddings quality in the four classic machine learning problems on graphs, such as node classification, link prediction, clustering and graph visualization. As a result, our survey covers a new rapidly growing field of network feature engineering, presents an in-depth analysis of models based on network types, and overviews a wide range of applications to machine learning problems on graphs.

Download Full-text

Exploiting node metadata to predict interactions in large networks using graph embedding and neural networks

10.1101/2021.06.10.447991 ◽

2021 ◽

Author(s):

Rogini Runghen ◽

Daniel B Stouffer ◽

Giulio Valentino Dalla Riva

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Link Prediction ◽

Graph Embedding ◽

Feature Space ◽

Machine Learning Techniques ◽

Large Networks ◽

Data Set ◽

Learning Techniques ◽

Low Dimensional

Collecting network interaction data is difficult. Non-exhaustive sampling and complex hidden processes often result in an incomplete data set. Thus, identifying potentially present but unobserved interactions is crucial both in understanding the structure of large scale data, and in predicting how previously unseen elements will interact. Recent studies in network analysis have shown that accounting for metadata (such as node attributes) can improve both our understanding of how nodes interact with one another, and the accuracy of link prediction. However, the dimension of the object we need to learn to predict interactions in a network grows quickly with the number of nodes. Therefore, it becomes computationally and conceptually challenging for large networks. Here, we present a new predictive procedure combining a graph embedding method with machine learning techniques to predict interactions on the base of nodes' metadata. Graph embedding methods project the nodes of a network onto a---low dimensional---latent feature space. The position of the nodes in the latent feature space can then be used to predict interactions between nodes. Learning a mapping of the nodes' metadata to their position in a latent feature space corresponds to a classic---and low dimensional---machine learning problem. In our current study we used the Random Dot Product Graph model to estimate the embedding of an observed network, and we tested different neural networks architectures to predict the position of nodes in the latent feature space. Flexible machine learning techniques to map the nodes onto their latent positions allow to account for multivariate and possibly complex nodes' metadata. To illustrate the utility of the proposed procedure, we apply it to a large dataset of tourist visits to destinations across New Zealand. We found that our procedure accurately predicts interactions for both existing nodes and nodes newly added to the network, while being computationally feasible even for very large networks. Overall, our study highlights that by exploiting the properties of a well understood statistical model for complex networks and combining it with standard machine learning techniques, we can simplify the link prediction problem when incorporating multivariate node metadata. Our procedure can be immediately applied to different types of networks, and to a wide variety of data from different systems. As such, both from a network science and data science perspective, our work offers a flexible and generalisable procedure for link prediction.

Download Full-text

TransGate: Knowledge Graph Embedding with Shared Gate Structure

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013100 ◽

2019 ◽

Vol 33 ◽

pp. 3100-3107 ◽

Cited By ~ 2

Author(s):

Jun Yuan ◽

Neng Gao ◽

Ji Xiang

Keyword(s):

Time Complexity ◽

Link Prediction ◽

Graph Embedding ◽

Feature Engineering ◽

Specific Information ◽

Gate Structure ◽

Essential Problem ◽

Novel Model ◽

State Of Art ◽

Parameter Sharing

Embedding knowledge graphs (KGs) into continuous vector space is an essential problem in knowledge extraction. Current models continue to improve embedding by focusing on discriminating relation-specific information from entities with increasingly complex feature engineering. We noted that they ignored the inherent relevance between relations and tried to learn unique discriminate parameter set for each relation. Thus, these models potentially suffer from high time complexity and large parameters, preventing them from efficiently applying on real-world KGs. In this paper, we follow the thought of parameter sharing to simultaneously learn more expressive features, reduce parameters and avoid complex feature engineering. Based on gate structure from LSTM, we propose a novel model TransGate and develop shared discriminate mechanism, resulting in almost same space complexity as indiscriminate models. Furthermore, to develop a more effective and scalable model, we reconstruct the gate with weight vectors making our method has comparative time complexity against indiscriminate model. We conduct extensive experiments on link prediction and triplets classification. Experiments show that TransGate not only outperforms state-of-art baselines, but also reduces parameters greatly. For example, TransGate outperforms ConvE and RGCN with 6x and 17x fewer parameters, respectively. These results indicate that parameter sharing is a superior way to further optimize embedding and TransGate finds a better trade-off between complexity and expressivity.

Download Full-text

Adaptive Network Automata Modelling of Complex Networks

10.20944/preprints202012.0808.v2 ◽

2021 ◽

Author(s):

Alessandro Muscoloni ◽

Umberto Michieli ◽

Carlo Vittorio Cannistraci

Keyword(s):

Complex Networks ◽

Link Prediction ◽

Network Science ◽

Fundamental Problem ◽

Graph Embedding ◽

Prediction Performance ◽

Self Organization ◽

Deterministic Models ◽

Adaptive Network ◽

Wide Range

Many complex networks have a connectivity that might be only partially detected or that tends to grow over time, hence the prediction of non-observed links is a fundamental problem in network science. The aim of topological link prediction is to forecast these non-observed links by only exploiting features intrinsic to the network topology. It has a wide range of real applications, like suggesting friendships in social networks or predicting interactions in biological networks.The Cannistraci-Hebb theory is a recent achievement in network science that includes a theoretical framework to understand local-based link prediction on paths of length n. In this study we introduce two innovations: theory of modelling (science) and theory of realization (engineering). For the theory of modelling we first recall a definition of network automata as a general framework for modelling the growth of connectivity in complex networks. We then show that several deterministic models previously developed fall within this framework and we introduce novel network automata following the Cannistraci-Hebb rule. For the theory of realization, we present how to build adaptive network automata for link prediction, which incorporate multiple deterministic models of self-organization and automatically choose the rule that better explains the patterns of connectivity in the network under investigation. We compare Cannistraci-Hebb adaptive (CHA) network automaton against state-of-the-art link prediction methods such as structural perturbation method (SPM), stochastic block models (SBM) and artificial intelligence algorithms for graph embedding. CHA displays an overall higher link prediction performance across different evaluation frameworks on 1386 networks. Finally, we highlight that CHA offers the key advantage to explicitly explain the mechanistic rule of self-organization which leads to the link prediction performance, whereas SPM and graph embedding not. In comparison to CHA, SBM unfortunately shows irrelevant and unsatisfactory performance demonstrating that SBM modelling is not adequate for link prediction in real networks.

Download Full-text

Classification of Photogrammetric and Airborne LiDAR Point Clouds Using Machine Learning Algorithms

Drones ◽

10.3390/drones5040104 ◽

2021 ◽

Vol 5 (4) ◽

pp. 104

Author(s):

Zaide Duran ◽

Kubra Ozcan ◽

Muhammed Enes Atik

Keyword(s):

Machine Learning ◽

Point Cloud ◽

Point Clouds ◽

Machine Learning Algorithms ◽

Airborne Lidar ◽

Aerial Photogrammetry ◽

Feature Spaces ◽

Wide Range ◽

Extract Information

With the development of photogrammetry technologies, point clouds have found a wide range of use in academic and commercial areas. This situation has made it essential to extract information from point clouds. In particular, artificial intelligence applications have been used to extract information from point clouds to complex structures. Point cloud classification is also one of the leading areas where these applications are used. In this study, the classification of point clouds obtained by aerial photogrammetry and Light Detection and Ranging (LiDAR) technology belonging to the same region is performed by using machine learning. For this purpose, nine popular machine learning methods have been used. Geometric features obtained from point clouds were used for the feature spaces created for classification. Color information is also added to these in the photogrammetric point cloud. According to the LiDAR point cloud results, the highest overall accuracies were obtained as 0.96 with the Multilayer Perceptron (MLP) method. The lowest overall accuracies were obtained as 0.50 with the AdaBoost method. The method with the highest overall accuracy was achieved with the MLP (0.90) method. The lowest overall accuracy method is the GNB method with 0.25 overall accuracy.

Download Full-text

Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Mathematics ◽

10.3390/math8050662 ◽

2020 ◽

Vol 8 (5) ◽

pp. 662 ◽

Cited By ~ 3

Author(s):

Husein Perez ◽

Joseph H. M. Tah

Keyword(s):

Machine Learning ◽

Density Distribution ◽

Image Classification ◽

High Dimensional Data ◽

Supervised Machine Learning ◽

Learning Problems ◽

High Dimensional ◽

Feature Engineering ◽

Outlier Data

In the field of supervised machine learning, the quality of a classifier model is directly correlated with the quality of the data that is used to train the model. The presence of unwanted outliers in the data could significantly reduce the accuracy of a model or, even worse, result in a biased model leading to an inaccurate classification. Identifying the presence of outliers and eliminating them is, therefore, crucial for building good quality training datasets. Pre-processing procedures for dealing with missing and outlier data, commonly known as feature engineering, are standard practice in machine learning problems. They help to make better assumptions about the data and also prepare datasets in a way that best expose the underlying problem to the machine learning algorithms. In this work, we propose a multistage method for detecting and removing outliers in high-dimensional data. Our proposed method is based on utilising a technique called t-distributed stochastic neighbour embedding (t-SNE) to reduce high-dimensional map of features into a lower, two-dimensional, probability density distribution and then use a simple descriptive statistical method called interquartile range (IQR) to identifying any outlier values from the density distribution of the features. t-SNE is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualisation in a low-dimensional space of two or three dimensions. We applied this method on a dataset containing images for training a convolutional neural network model (ConvNet) for an image classification problem. The dataset contains four different classes of images: three classes contain defects in construction (mould, stain, and paint deterioration) and a no-defect class (normal). We used the transfer learning technique to modify a pre-trained VGG-16 model. We used this model as a feature extractor and as a benchmark to evaluate our method. We have shown that, when using this method, we can identify and remove the outlier images in the dataset. After removing the outlier images from the dataset and re-training the VGG-16 model, the results have also shown that the accuracy of the classification has significantly improved and the number of misclassified cases has also dropped. While many feature engineering techniques for handling missing and outlier data are common in predictive machine learning problems involving numerical or categorical data, there is little work on developing techniques for handling outliers in high-dimensional data which can be used to improve the quality of machine learning problems involving images such as ConvNet models for image classification and object detection problems.

Download Full-text

When Machine Learning Meets Privacy

ACM Computing Surveys ◽

10.1145/3436755 ◽

2021 ◽

Vol 54 (2) ◽

pp. 1-36

Author(s):

Bo Liu ◽

Ming Ding ◽

Sina Shaham ◽

Wenny Rahayu ◽

Farhad Farokhi ◽

...

Keyword(s):

Machine Learning ◽

Privacy Protection ◽

Data Privacy ◽

Privacy Preservation ◽

Research Progress ◽

Future Research ◽

Surveillance Systems ◽

Smart Healthcare ◽

Wide Range ◽

Depth Analysis

The newly emerged machine learning (e.g., deep learning) methods have become a strong driving force to revolutionize a wide range of industries, such as smart healthcare, financial technology, and surveillance systems. Meanwhile, privacy has emerged as a big concern in this machine learning-based artificial intelligence era. It is important to note that the problem of privacy preservation in the context of machine learning is quite different from that in traditional data privacy protection, as machine learning can act as both friend and foe. Currently, the work on the preservation of privacy and machine learning are still in an infancy stage, as most existing solutions only focus on privacy problems during the machine learning process. Therefore, a comprehensive study on the privacy preservation problems and machine learning is required. This article surveys the state of the art in privacy issues and solutions for machine learning. The survey covers three categories of interactions between privacy and machine learning: (i) private machine learning, (ii) machine learning-aided privacy protection, and (iii) machine learning-based privacy attack and corresponding protection schemes. The current research progress in each category is reviewed and the key challenges are identified. Finally, based on our in-depth analysis of the area of privacy and machine learning, we point out future research directions in this field.

Download Full-text

Knowledge graph embedding for data mining vs. knowledge graph embedding for link prediction – two sides of the same coin?

Semantic Web ◽

10.3233/sw-212892 ◽

2022 ◽

pp. 1-24

Author(s):

Jan Portisch ◽

Nicolas Heist ◽

Heiko Paulheim

Keyword(s):

Data Mining ◽

Link Prediction ◽

Graph Embedding ◽

Knowledge Graph ◽

Graph Embeddings ◽

Similarity Functions ◽

Evaluation Methodologies ◽

Series Of Experiments ◽

Two Sides ◽

Lower Dimensional

Knowledge Graph Embeddings, i.e., projections of entities and relations to lower dimensional spaces, have been proposed for two purposes: (1) providing an encoding for data mining tasks, and (2) predicting links in a knowledge graph. Both lines of research have been pursued rather in isolation from each other so far, each with their own benchmarks and evaluation methodologies. In this paper, we argue that both tasks are actually related, and we show that the first family of approaches can also be used for the second task and vice versa. In two series of experiments, we provide a comparison of both families of approaches on both tasks, which, to the best of our knowledge, has not been done so far. Furthermore, we discuss the differences in the similarity functions evoked by the different embedding approaches.

Download Full-text

Understanding Negative Sampling in Knowledge Graph Embedding

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2021.12105 ◽

2021 ◽

Vol 12 (1) ◽

pp. 71-81

Author(s):

Jing Qian ◽

Gangmin Li ◽

Katie Atkinson ◽

Yong Yue

Keyword(s):

Link Prediction ◽

Graph Embedding ◽

Knowledge Graph ◽

Direct Impact ◽

Dimensional Vector Space ◽

Dynamic Distribution ◽

Space Efficiency ◽

Node Classification ◽

Low Dimensional

Knowledge graph embedding (KGE) is to project entities and relations of a knowledge graph (KG) into a low-dimensional vector space, which has made steady progress in recent years. Conventional KGE methods, especially translational distance-based models, are trained through discriminating positive samples from negative ones. Most KGs store only positive samples for space efficiency. Negative sampling thus plays a crucial role in encoding triples of a KG. The quality of generated negative samples has a direct impact on the performance of learnt knowledge representation in a myriad of downstream tasks, such as recommendation, link prediction and node classification. We summarize current negative sampling approaches in KGE into three categories, static distribution-based, dynamic distribution-based and custom cluster-based respectively. Based on this categorization we discuss the most prevalent existing approaches and their characteristics. It is a hope that this review can provide some guidelines for new thoughts about negative sampling in KGE.

Download Full-text

A systematic review of automated feature engineering solutions in machine learning problems

XVI Brazilian Symposium on Information Systems ◽

10.1145/3411564.3411610 ◽

2020 ◽

Author(s):

Fernando F. Prado ◽

Luciano A. Digiampietri

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Learning Problems ◽

Feature Engineering

Download Full-text

Application of network link prediction in drug discovery

BMC Bioinformatics ◽

10.1186/s12859-021-04082-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Khushnood Abbas ◽

Alireza Abbasi ◽

Shi Dong ◽

Ling Niu ◽

Laihang Yu ◽

...

Keyword(s):

Machine Learning ◽

Drug Discovery ◽

Link Prediction ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Biomedical Data ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Wide Range

Abstract Background Technological and research advances have produced large volumes of biomedical data. When represented as a network (graph), these data become useful for modeling entities and interactions in biological and similar complex systems. In the field of network biology and network medicine, there is a particular interest in predicting results from drug–drug, drug–disease, and protein–protein interactions to advance the speed of drug discovery. Existing data and modern computational methods allow to identify potentially beneficial and harmful interactions, and therefore, narrow drug trials ahead of actual clinical trials. Such automated data-driven investigation relies on machine learning techniques. However, traditional machine learning approaches require extensive preprocessing of the data that makes them impractical for large datasets. This study presents wide range of machine learning methods for predicting outcomes from biomedical interactions and evaluates the performance of the traditional methods with more recent network-based approaches. Results We applied a wide range of 32 different network-based machine learning models to five commonly available biomedical datasets, and evaluated their performance based on three important evaluations metrics namely AUROC, AUPR, and F1-score. We achieved this by converting link prediction problem as binary classification problem. In order to achieve this we have considered the existing links as positive example and randomly sampled negative examples from non-existant set. After experimental evaluation we found that Prone, ACT and $$LRW_5$$ L R W 5 are the top 3 best performers on all five datasets. Conclusions This work presents a comparative evaluation of network-based machine learning algorithms for predicting network links, with applications in the prediction of drug-target and drug–drug interactions, and applied well known network-based machine learning methods. Our work is helpful in guiding researchers in the appropriate selection of machine learning methods for pharmaceutical tasks.

Download Full-text