Dictionary learning allows model-free pseudotime estimation of transcriptomic data

Abstract Background Pseudotime estimation from dynamic single-cell transcriptomic data enables characterisation and understanding of the underlying processes, for example developmental processes. Various pseudotime estimation methods have been proposed during the last years. Typically, these methods start with a dimension reduction step because the low-dimensional representation is usually easier to analyse. Approaches such as PCA, ICA or t-SNE belong to the most widely used methods for dimension reduction in pseudotime estimation methods. However, these methods usually make assumptions on the derived dimensions, which can result in important dataset properties being missed. In this paper, we suggest a new dictionary learning based approach, dynDLT, for dimension reduction and pseudotime estimation of dynamic transcriptomic data. Dictionary learning is a matrix factorisation approach that does not restrict the dependence of the derived dimensions. To evaluate the performance, we conduct a large simulation study and analyse 8 real-world datasets. Results The simulation studies reveal that firstly, dynDLT preserves the simulated patterns in low-dimension and the pseudotimes can be derived from the low-dimensional representation. Secondly, the results show that dynDLT is suitable for the detection of genes exhibiting the simulated dynamic patterns, thereby facilitating the interpretation of the compressed representation and thus the dynamic processes. For the real-world data analysis, we select datasets with samples that are taken at different time points throughout an experiment. The pseudotimes found by dynDLT have high correlations with the experimental times. We compare the results to other approaches used in pseudotime estimation, or those that are method-wise closely connected to dictionary learning: ICA, NMF, PCA, t-SNE, and UMAP. DynDLT has the best overall performance for the simulated and real-world datasets. Conclusions We introduce dynDLT, a method that is suitable for pseudotime estimation. Its main advantages are: (1) It presents a model-free approach, meaning that it does not restrict the dependence of the derived dimensions; (2) Genes that are relevant in the detected dynamic processes can be identified from the dictionary matrix; (3) By a restriction of the dictionary entries to positive values, the dictionary atoms are highly interpretable.

Download Full-text

Dimension Reduction for Objects Composed of Vector Sets

International Journal of Applied Mathematics and Computer Science ◽

10.1515/amcs-2017-0012 ◽

2017 ◽

Vol 27 (1) ◽

pp. 169-180 ◽

Cited By ~ 1

Author(s):

Marton Szemenyei ◽

Ferenc Vajda

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Discriminant Analysis ◽

Probability Distribution ◽

Dimension Reduction ◽

Pose Estimation ◽

Real World ◽

Single Object ◽

Real World Datasets

Abstract Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.

Download Full-text

Relation Structure-Aware Heterogeneous Information Network Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014456 ◽

2019 ◽

Vol 33 ◽

pp. 4456-4463 ◽

Cited By ~ 8

Author(s):

Yuanfu Lu ◽

Chuan Shi ◽

Linmei Hu ◽

Zhiyuan Liu

Keyword(s):

Real World ◽

Dimensional Space ◽

Structural Characteristics ◽

Information Network ◽

Network Embedding ◽

Heterogeneous Information Network ◽

Heterogeneous Information ◽

Real World Datasets ◽

Low Dimensional ◽

Embedding Methods

Heterogeneous information network (HIN) embedding aims to embed multiple types of nodes into a low-dimensional space. Although most existing HIN embedding methods consider heterogeneous relations in HINs, they usually employ one single model for all relations without distinction, which inevitably restricts the capability of network embedding. In this paper, we take the structural characteristics of heterogeneous relations into consideration and propose a novel Relation structure-aware Heterogeneous Information Network Embedding model (RHINE). By exploring the real-world networks with thorough mathematical analysis, we present two structure-related measures which can consistently distinguish heterogeneous relations into two categories: Affiliation Relations (ARs) and Interaction Relations (IRs). To respect the distinctive characteristics of relations, in our RHINE, we propose different models specifically tailored to handle ARs and IRs, which can better capture the structures and semantics of the networks. At last, we combine and optimize these models in a unified and elegant manner. Extensive experiments on three real-world datasets demonstrate that our model significantly outperforms the state-of-the-art methods in various tasks, including node clustering, link prediction, and node classification.

Download Full-text

Combined Reinforcement Learning via Abstract Representations

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013582 ◽

2019 ◽

Vol 33 ◽

pp. 3582-3589 ◽

Cited By ~ 4

Author(s):

Vincent Francois-Lavet ◽

Yoshua Bengio ◽

Doina Precup ◽

Joelle Pineau

Keyword(s):

Reinforcement Learning ◽

State Space ◽

Transfer Learning ◽

Computationally Efficient ◽

Dimensional Representation ◽

Learning Methods ◽

Model Free ◽

Abstract Representations ◽

Low Dimensional ◽

New Strategies

In the quest for efficient and robust reinforcement learning methods, both model-free and model-based approaches offer advantages. In this paper we propose a new way of explicitly bridging both approaches via a shared low-dimensional learned encoding of the environment, meant to capture summarizing abstractions. We show that the modularity brought by this approach leads to good generalization while being computationally efficient, with planning happening in a smaller latent state space. In addition, this approach recovers a sufficient low-dimensional representation of the environment, which opens up new strategies for interpretable AI, exploration and transfer learning.

Download Full-text

Recent Advances in Supervised Dimension Reduction: A Survey

Machine Learning and Knowledge Extraction ◽

10.3390/make1010020 ◽

2019 ◽

Vol 1 (1) ◽

pp. 341-358 ◽

Cited By ~ 12

Author(s):

Guoqing Chao ◽

Yuan Luo ◽

Weiping Ding

Keyword(s):

Dimension Reduction ◽

Dimensional Representation ◽

Open Problems ◽

Advantages And Disadvantages ◽

Reduction Problem ◽

Reduction Methods ◽

Low Dimensional ◽

Supervised Dimension Reduction ◽

Effective Representation ◽

Non Negative Matrix Factorization

Recently, we have witnessed an explosive growth in both the quantity and dimension of data generated, which aggravates the high dimensionality challenge in tasks such as predictive modeling and decision support. Up to now, a large amount of unsupervised dimension reduction methods have been proposed and studied. However, there is no specific review focusing on the supervised dimension reduction problem. Most studies performed classification or regression after unsupervised dimension reduction methods. However, we recognize the following advantages if learning the low-dimensional representation and the classification/regression model simultaneously: high accuracy and effective representation. Considering classification or regression as being the main goal of dimension reduction, the purpose of this paper is to summarize and organize the current developments in the field into three main classes: PCA-based, Non-negative Matrix Factorization (NMF)-based, and manifold-based supervised dimension reduction methods, as well as provide elaborated discussions on their advantages and disadvantages. Moreover, we outline a dozen open problems that can be further explored to advance the development of this topic.

Download Full-text

Large-Scale Heterogeneous Feature Embedding

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013878 ◽

2019 ◽

Vol 33 ◽

pp. 3878-3885 ◽

Cited By ~ 5

Author(s):

Xiao Huang ◽

Qingquan Song ◽

Fan Yang ◽

Xia Hu

Keyword(s):

Real World ◽

Large Scale ◽

Single Type ◽

Heterogeneous Information ◽

Multiview Learning ◽

Efficiency And Effectiveness ◽

Joint Embedding ◽

Real World Datasets ◽

Low Dimensional ◽

Vector Representations

Feature embedding aims to learn a low-dimensional vector representation for each instance to preserve the information in its features. These representations can benefit various offthe-shelf learning algorithms. While embedding models for a single type of features have been well-studied, real-world instances often contain multiple types of correlated features or even information within a different modality such as networks. Existing studies such as multiview learning show that it is promising to learn unified vector representations from all sources. However, high computational costs of incorporating heterogeneous information limit the applications of existing algorithms. The number of instances and dimensions of features in practice are often large. To bridge the gap, we propose a scalable framework FeatWalk, which can model and incorporate instance similarities in terms of different types of features into a unified embedding representation. To enable the scalability, FeatWalk does not directly calculate any similarity measure, but provides an alternative way to simulate the similarity-based random walks among instances to extract the local instance proximity and preserve it in a set of instance index sequences. These sequences are homogeneous with each other. A scalable word embedding algorithm is applied to them to learn a joint embedding representation of instances. Experiments on four real-world datasets demonstrate the efficiency and effectiveness of FeatWalk.

Download Full-text

Bootstrapping Entity Alignment with Knowledge Graph Embedding

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/611 ◽

2018 ◽

Cited By ~ 35

Author(s):

Zequn Sun ◽

Wei Hu ◽

Qingheng Zhang ◽

Yuzhong Qu

Keyword(s):

Performance Improvement ◽

Real World ◽

State Of The Art ◽

Graph Embedding ◽

Training Data ◽

Knowledge Graph ◽

Error Accumulation ◽

Knowledge Graphs ◽

Real World Datasets ◽

Low Dimensional

Embedding-based entity alignment represents different knowledge graphs (KGs) as low-dimensional embeddings and finds entity alignment by measuring the similarities between entity embeddings. Existing approaches have achieved promising results, however, they are still challenged by the lack of enough prior alignment as labeled training data. In this paper, we propose a bootstrapping approach to embedding-based entity alignment. It iteratively labels likely entity alignment as training data for learning alignment-oriented KG embeddings. Furthermore, it employs an alignment editing method to reduce error accumulation during iterations. Our experiments on real-world datasets showed that the proposed approach significantly outperformed the state-of-the-art embedding-based ones for entity alignment. The proposed alignment-oriented KG embedding, bootstrapping process and alignment editing method all contributed to the performance improvement.

Download Full-text

Context Attention Heterogeneous Network Embedding

Computational Intelligence and Neuroscience ◽

10.1155/2019/8106073 ◽

2019 ◽

Vol 2019 ◽

pp. 1-15

Author(s):

Wei Zhuo ◽

Qianyi Zhan ◽

Yuan Liu ◽

Zhenping Xie ◽

Jing Lu

Keyword(s):

Real World ◽

Online Social Networks ◽

Heterogeneous Network ◽

Network Embedding ◽

Node Importance ◽

Unweighted Network ◽

Real World Datasets ◽

Low Dimensional ◽

Types Of Information ◽

The Impact

Network embedding (NE), which maps nodes into a low-dimensional latent Euclidean space to represent effective features of each node in the network, has obtained considerable attention in recent years. Many popular NE methods, such as DeepWalk, Node2vec, and LINE, are capable of handling homogeneous networks. However, nodes are always fully accompanied by heterogeneous information (e.g., text descriptions, node properties, and hashtags) in the real-world network, which remains a great challenge to jointly project the topological structure and different types of information into the fixed-dimensional embedding space due to heterogeneity. Besides, in the unweighted network, how to quantify the strength of edges (tightness of connections between nodes) accurately is also a difficulty faced by existing methods. To bridge the gap, in this paper, we propose CAHNE (context attention heterogeneous network embedding), a novel network embedding method, to accurately determine the learning result. Specifically, we propose the concept of node importance to measure the strength of edges, which can better preserve the context relations of a node in unweighted networks. Moreover, text information is a widely ubiquitous feature in real-world networks, e.g., online social networks and citation networks. On account of the sophisticated interactions between the network structure and text features of nodes, CAHNE learns context embeddings for nodes by introducing the context node sequence, and the attention mechanism is also integrated into our model to better reflect the impact of context nodes on the current node. To corroborate the efficacy of CAHNE, we apply our method and various baseline methods on several real-world datasets. The experimental results show that CAHNE achieves higher quality compared to a number of state-of-the-art network embedding methods on the tasks of network reconstruction, link prediction, node classification, and visualization.

Download Full-text

SpHMC: Spectral Hamiltonian Monte Carlo

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015516 ◽

2019 ◽

Vol 33 ◽

pp. 5516-5524

Author(s):

Haoyi Xiong ◽

Kafeng Wang ◽

Jiang Bian ◽

Zhanxing Zhu ◽

Cheng-Zhong Xu ◽

...

Keyword(s):

Monte Carlo ◽

Probability Distribution ◽

Real World ◽

Dimensional Space ◽

Probability Distributions ◽

Superior Performance ◽

High Dimensional ◽

Hamiltonian Monte Carlo ◽

Real World Datasets ◽

Low Dimensional

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) methods have been widely used to sample from certain probability distributions, incorporating (kernel) density derivatives and/or given datasets. Instead of exploring new samples from kernel spaces, this piece of work proposed a novel SGHMC sampler, namely Spectral Hamiltonian Monte Carlo (SpHMC), that produces the high dimensional sparse representations of given datasets through sparse sensing and SGHMC. Inspired by compressed sensing, we assume all given samples are low-dimensional measurements of certain high-dimensional sparse vectors, while a continuous probability distribution exists in such high-dimensional space. Specifically, given a dictionary for sparse coding, SpHMC first derives a novel likelihood evaluator of the probability distribution from the loss function of LASSO, then samples from the high-dimensional distribution using stochastic Langevin dynamics with derivatives of the logarithm likelihood and Metropolis–Hastings sampling. In addition, new samples in low-dimensional measuring spaces can be regenerated using the sampled high-dimensional vectors and the dictionary. Extensive experiments have been conducted to evaluate the proposed algorithm using real-world datasets. The performance comparisons on three real-world applications demonstrate the superior performance of SpHMC beyond baseline methods.

Download Full-text

Low Dimensional Representation of Space Structure and Clustering of Categorical Data

2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) ◽

10.1109/bdcloud.2018.00161 ◽

2018 ◽

Author(s):

Jianjun Cao ◽

Qibin Zheng ◽

Nianfeng Weng ◽

Xingchun Diao

Keyword(s):

Categorical Data ◽

Space Structure ◽

Dimensional Representation ◽

Representation Of Space ◽

Low Dimensional

Download Full-text

Time-Efficient Ensemble Learning with Sample Exchange for Edge Computing

ACM Transactions on Internet Technology ◽

10.1145/3409265 ◽

2021 ◽

Vol 21 (3) ◽

pp. 1-17

Author(s):

Wu Chen ◽

Yong Yu ◽

Keke Gai ◽

Jiamou Liu ◽

Kim-Kwang Raymond Choo

Keyword(s):

Ensemble Learning ◽

Real World ◽

Interaction Mechanism ◽

Training Model ◽

Edge Computing ◽

Learning Techniques ◽

Multi Agent ◽

Real World Datasets ◽

Entire Dataset ◽

Exchange Data

In existing ensemble learning algorithms (e.g., random forest), each base learner’s model needs the entire dataset for sampling and training. However, this may not be practical in many real-world applications, and it incurs additional computational costs. To achieve better efficiency, we propose a decentralized framework: Multi-Agent Ensemble. The framework leverages edge computing to facilitate ensemble learning techniques by focusing on the balancing of access restrictions (small sub-dataset) and accuracy enhancement. Specifically, network edge nodes (learners) are utilized to model classifications and predictions in our framework. Data is then distributed to multiple base learners who exchange data via an interaction mechanism to achieve improved prediction. The proposed approach relies on a training model rather than conventional centralized learning. Findings from the experimental evaluations using 20 real-world datasets suggest that Multi-Agent Ensemble outperforms other ensemble approaches in terms of accuracy even though the base learners require fewer samples (i.e., significant reduction in computation costs).

Download Full-text