Effective Representing of Information Network by Variational Autoencoder

Network representation is the basis of many applications and of extensive interest in various fields, such as information retrieval, social network analysis, and recommendation systems. Most previous methods for network representation only consider the incomplete aspects of a problem, including link structure, node information, and partial integration. The present study proposes a deep network representation model that seamlessly integrates the text information and structure of a network. Our model captures highly non-linear relationships between nodes and complex features of a network by exploiting the variational autoencoder (VAE), which is a deep unsupervised generation algorithm. We also merge the representation learned with a paragraph vector model and that learned with the VAE to obtain the network representation that preserves both structure and text information. We conduct comprehensive empirical experiments on benchmark datasets and find our model performs better than state-of-the-art techniques by a large margin.

Download Full-text

Unsupervised Outlier Detection in Multidimensional Data

10.21203/rs.3.rs-250665/v1 ◽

2021 ◽

Author(s):

Atiq Rehman ◽

Samir Brahim Belhaouari

Keyword(s):

State Of The Art ◽

Machine Learning Algorithms ◽

Multidimensional Data ◽

High Dimensions ◽

Comprehensive Performance ◽

Benchmark Datasets ◽

Distance Vector ◽

Detection Schemes ◽

Unsupervised Outlier Detection ◽

Better Than

Abstract Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use only a single dimensional distance vector to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Download Full-text

Category Trees - Classifiers that Branch on Category

International Journal of Artificial Intelligence & Applications ◽

10.5121/ijaia.2021.12606 ◽

2021 ◽

Vol 12 (06) ◽

pp. 65-76

Author(s):

Kieran Greer

Keyword(s):

State Of The Art ◽

The State ◽

Biological Analogy ◽

Category Type ◽

Benchmark Datasets ◽

Batch Input ◽

Distinguishing Features ◽

Incorrect Data ◽

Better Than

This paper presents a batch classifier that splits a dataset into tree branches depending on the category type. It has been improved from the earlier version and fixed a mistake in the earlier paper. Two important changes have been made. The first is to represent each category with a separate classifier. Each classifier then classifies its own subset of data rows, using batch input values to create the centroid and also represent the category itself. If the classifier contains data from more than one category however, it needs to create new classifiers for the incorrect data. The second change therefore is to allow the classifier to branch to new layers when there is a split in the data, and create new classifiers there for the data rows that are incorrectly classified. Each layer can therefore branch like a tree - not for distinguishing features, but for distinguishing categories. The paper then suggests a further innovation, which is to represent some data columns with fixed value ranges, or bands. When considering features, it is shown that some of the data can be classified directly through fixed value ranges, while the rest must be classified using a classifier technique and the idea allows the paper to discuss a biological analogy with neurons and neuron links. Tests show that the method can successfully classify a diverse set of benchmark datasets to better than the state-of-the-art.

Download Full-text

Non-I.I.D. Multi-Instance Learning for Predicting Instance and Bag Labels with Variational Auto-Encoder

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/465 ◽

2021 ◽

Author(s):

Weijia Zhang

Keyword(s):

Medical Imaging ◽

Supervised Learning ◽

Real World ◽

State Of The Art ◽

Experimental Results ◽

Weakly Supervised Learning ◽

Variational Autoencoder ◽

Label Prediction ◽

Weakly Supervised ◽

Better Than

Multi-instance learning is a type of weakly supervised learning. It deals with tasks where the data is a set of bags and each bag is a set of instances. Only the bag labels are observed whereas the labels for the instances are unknown. An important advantage of multi-instance learning is that by representing objects as a bag of instances, it is able to preserve the inherent dependencies among parts of the objects. Unfortunately, most existing algorithms assume all instances to be identically and independently distributed, which violates real-world scenarios since the instances within a bag are rarely independent. In this work, we propose the Multi-Instance Variational Autoencoder (MIVAE) algorithm which explicitly models the dependencies among the instances for predicting both bag labels and instance labels. Experimental results on several multi-instance benchmarks and end-to-end medical imaging datasets demonstrate that MIVAE performs better than state-of-the-art algorithms for both instance label and bag label prediction tasks.

Download Full-text

Allowing mutations in maximal matches boosts genome compression performance

Bioinformatics ◽

10.1093/bioinformatics/btaa572 ◽

2020 ◽

Vol 36 (18) ◽

pp. 4675-4681 ◽

Cited By ~ 1

Author(s):

Yuansheng Liu ◽

Limsoon Wong ◽

Jinyan Li

Keyword(s):

Data Storage ◽

State Of The Art ◽

Dna Bases ◽

Supplementary Information ◽

Maximal Match ◽

Compression Performance ◽

Genome Data ◽

Compression Speed ◽

Benchmark Datasets ◽

Better Than

Abstract Motivation A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. Results We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. Availability and implementation https://github.com/yuansliu/memRGC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Regularizing Variational Autoencoder with Diversity and Uncertainty Awareness

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/408 ◽

2021 ◽

Author(s):

Dazhong Shen ◽

Chuan Qin ◽

Chao Wang ◽

Hengshu Zhu ◽

Enhong Chen ◽

...

Keyword(s):

Latent Variables ◽

State Of The Art ◽

Likelihood Estimation ◽

Generative Models ◽

Latent Space ◽

Variational Autoencoder ◽

Benchmark Datasets ◽

Classification Tasks ◽

Latent Representations ◽

Low Uncertainty

As one of the most popular generative models, Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference. However, when the decoder network is sufficiently expressive, VAE may lead to posterior collapse; that is, uninformative latent representations may be learned. To this end, in this paper, we propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space, and thus the representation can be learned in a meaningful and compact manner. Specifically, we first theoretically demonstrate that it will result in better latent space with high diversity and low uncertainty awareness by controlling the distribution of posterior’s parameters across the whole data accordingly. Then, without the introduction of new loss terms or modifying training strategies, we propose to exploit Dropout on the variances and Batch-Normalization on the means simultaneously to regularize their distributions implicitly. Furthermore, to evaluate the generalization effect, we also exploit DU-VAE for inverse autoregressive flow based-VAE (VAE-IAF) empirically. Finally, extensive experiments on three benchmark datasets clearly show that our approach can outperform state-of-the-art baselines on both likelihood estimation and underlying classification tasks.

Download Full-text

A New End-to-End Multi-Dimensional CNN Framework for Land Cover/Land Use Change Detection in Multi-Source Remote Sensing Datasets

Remote Sensing ◽

10.3390/rs12122010 ◽

2020 ◽

Vol 12 (12) ◽

pp. 2010 ◽

Cited By ~ 5

Author(s):

Seyd Teymoor Seydi ◽

Mahdi Hasanlou ◽

Meisam Amani

Keyword(s):

Remote Sensing ◽

Change Detection ◽

State Of The Art ◽

Polarimetric Synthetic Aperture Radar ◽

Different Types ◽

Benchmark Datasets ◽

Convolution Kernels ◽

2D And 3D ◽

Accuracy Indices ◽

Better Than

The diversity of change detection (CD) methods and the limitations in generalizing these techniques using different types of remote sensing datasets over various study areas have been a challenge for CD applications. Additionally, most CD methods have been implemented in two intensive and time-consuming steps: (a) predicting change areas, and (b) decision on predicted areas. In this study, a novel CD framework based on the convolutional neural network (CNN) is proposed to not only address the aforementioned problems but also to considerably improve the level of accuracy. The proposed CNN-based CD network contains three parallel channels: the first and second channels, respectively, extract deep features on the original first- and second-time imagery and the third channel focuses on the extraction of change deep features based on differencing and staking deep features. Additionally, each channel includes three types of convolution kernels: 1D-, 2D-, and 3D-dilated-convolution. The effectiveness and reliability of the proposed CD method are evaluated using three different types of remote sensing benchmark datasets (i.e., multispectral, hyperspectral, and Polarimetric Synthetic Aperture RADAR (PolSAR)). The results of the CD maps are also evaluated both visually and statistically by calculating nine different accuracy indices. Moreover, the results of the CD using the proposed method are compared to those of several state-of-the-art CD algorithms. All the results prove that the proposed method outperforms the other remote sensing CD techniques. For instance, considering different scenarios, the Overall Accuracies (OAs) and Kappa Coefficients (KCs) of the proposed CD method are better than 95.89% and 0.805, respectively, and the Miss Detection (MD) and the False Alarm (FA) rates are lower than 12% and 3%, respectively.

Download Full-text

Improving Variational Autoencoder based Out-of-Distribution Detection for Embedded Real-time Applications

ACM Transactions on Embedded Computing Systems ◽

10.1145/3477026 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-26

Author(s):

Yeli Feng ◽

Daniel Jun Xian Ng ◽

Arvind Easwaran

Keyword(s):

Machine Learning ◽

Real Time ◽

State Of The Art ◽

Autonomous Driving ◽

The State ◽

Theory And Practice ◽

Data Sets ◽

Variational Autoencoder ◽

Distribution Shifts ◽

Better Than

Uncertainties in machine learning are a significant roadblock for its application in safety-critical cyber-physical systems (CPS). One source of uncertainty arises from distribution shifts in the input data between training and test scenarios. Detecting such distribution shifts in real-time is an emerging approach to address the challenge. The high dimensional input space in CPS applications involving imaging adds extra difficulty to the task. Generative learning models are widely adopted for the task, namely out-of-distribution (OoD) detection. To improve the state-of-the-art, we studied existing proposals from both machine learning and CPS fields. In the latter, safety monitoring in real-time for autonomous driving agents has been a focus. Exploiting the spatiotemporal correlation of motion in videos, we can robustly detect hazardous motion around autonomous driving agents. Inspired by the latest advances in the Variational Autoencoder (VAE) theory and practice, we tapped into the prior knowledge in data to further boost OoD detection’s robustness. Comparison studies over nuScenes and Synthia data sets show our methods significantly improve detection capabilities of OoD factors unique to driving scenarios, 42% better than state-of-the-art approaches. Our model also generalized near-perfectly, 97% better than the state-of-the-art across the real-world and simulation driving data sets experimented. Finally, we customized one proposed method into a twin-encoder model that can be deployed to resource limited embedded devices for real-time OoD detection. Its execution time was reduced over four times in low-precision 8-bit integer inference, while detection capability is comparable to its corresponding floating-point model.

Download Full-text

Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings

10.31219/osf.io/tkvap ◽

2018 ◽

Cited By ~ 2

Author(s):

Debanjan Mahata ◽

John Kuriakose ◽

Rajiv Ratn Shah ◽

Roger Zimmermann ◽

John R. Talburt

Keyword(s):

State Of The Art ◽

Keyword Extraction ◽

Personalized Pagerank ◽

Text Documents ◽

Pagerank Algorithm ◽

Scientific Papers ◽

Benchmark Datasets ◽

Evaluation Dataset ◽

And Training ◽

Better Than

Keyword extraction is a fundamental task in naturallanguage processing that facilitates mapping of documents to a concise set of representative single and multi-word phrases. Keywords from text documents are primarily extracted using supervised and unsupervised approaches. In this paper, we present an unsupervised technique that uses a combination of theme-weighted personalized PageRank algorithm and neural phrase embeddings for extracting and ranking keywords. Wealso introduce an efficient way of processing text documents and training phrase embeddings using existing techniques. We share an evaluation dataset derived from an existing dataset that is used for choosing the underlying embedding model. The evaluations for ranked keyword extraction are performed on two benchmark datasets comprising of short abstracts (Inspec), and long scientific papers (SemEval 2010), and is shown to produce results better than the state-of-the-art systems.

Download Full-text

Unsupervised outlier detection in multidimensional data

Journal Of Big Data ◽

10.1186/s40537-021-00469-z ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Atiq ur Rehman ◽

Samir Brahim Belhaouari

Keyword(s):

State Of The Art ◽

Machine Learning Algorithms ◽

Multidimensional Data ◽

Statistical Techniques ◽

High Dimensions ◽

Comprehensive Performance ◽

Benchmark Datasets ◽

Detection Schemes ◽

Unsupervised Outlier Detection ◽

Better Than

AbstractDetection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use transformation of data to a unidimensional distance space to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Download Full-text

SMPLIP-Score: predicting ligand binding affinity from simple and interpretable on-the-fly interaction fingerprint pattern descriptors

Journal of Cheminformatics ◽

10.1186/s13321-021-00507-1 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Surendra Kumar ◽

Mi-hyun Kim

Keyword(s):

Ligand Binding ◽

Binding Affinity ◽

Scoring Functions ◽

Binding Affinities ◽

Ligand Interaction ◽

Fingerprint Pattern ◽

Comparable Performance ◽

Direct Interpretation ◽

Benchmark Datasets ◽

Complex Features

AbstractIn drug discovery, rapid and accurate prediction of protein–ligand binding affinities is a pivotal task for lead optimization with acceptable on-target potency as well as pharmacological efficacy. Furthermore, researchers hope for a high correlation between docking score and pose with key interactive residues, although scoring functions as free energy surrogates of protein–ligand complexes have failed to provide collinearity. Recently, various machine learning or deep learning methods have been proposed to overcome the drawbacks of scoring functions. Despite being highly accurate, their featurization process is complex and the meaning of the embedded features cannot directly be interpreted by human recognition without an additional feature analysis. Here, we propose SMPLIP-Score (Substructural Molecular and Protein–Ligand Interaction Pattern Score), a direct interpretable predictor of absolute binding affinity. Our simple featurization embeds the interaction fingerprint pattern on the ligand-binding site environment and molecular fragments of ligands into an input vectorized matrix for learning layers (random forest or deep neural network). Despite their less complex features than other state-of-the-art models, SMPLIP-Score achieved comparable performance, a Pearson’s correlation coefficient up to 0.80, and a root mean square error up to 1.18 in pK units with several benchmark datasets (PDBbind v.2015, Astex Diverse Set, CSAR NRC HiQ, FEP, PDBbind NMR, and CASF-2016). For this model, generality, predictive power, ranking power, and robustness were examined using direct interpretation of feature matrices for specific targets.

Download Full-text