Large-Scale Multi-View Subspace Clustering in Linear Time

Zhao Kang; Wangtao Zhou; Zhitong Zhao; Junming Shao; Meng Han; Zenglin Xu

doi:10.1609/aaai.v34i04.5867

Large-Scale Multi-View Subspace Clustering in Linear Time

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5867 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4412-4419 ◽

Cited By ~ 3

Author(s):

Zhao Kang ◽

Wangtao Zhou ◽

Zhitong Zhao ◽

Junming Shao ◽

Meng Han ◽

...

Keyword(s):

Large Scale ◽

State Of The Art ◽

Linear Time ◽

Subspace Clustering ◽

Data Sets ◽

Clustering Methods ◽

Single View ◽

Novel Approach ◽

Points Of View ◽

Effectiveness And Efficiency

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.

Download Full-text

Adaptive Structure Concept Factorization for Multiview Clustering

Neural Computation ◽

10.1162/neco_a_01055 ◽

2018 ◽

Vol 30 (4) ◽

pp. 1080-1103 ◽

Cited By ~ 10

Author(s):

Kun Zhan ◽

Jinhui Shi ◽

Jing Wang ◽

Haibo Wang ◽

Yuange Xie

Keyword(s):

Nonnegative Matrix Factorization ◽

State Of The Art ◽

Nonnegative Matrix ◽

Adaptive Method ◽

Data Sets ◽

Clustering Methods ◽

Normalized Mutual Information ◽

Adaptive Structure ◽

Concept Factorization ◽

Multiview Clustering

Most existing multiview clustering methods require that graph matrices in different views are computed beforehand and that each graph is obtained independently. However, this requirement ignores the correlation between multiple views. In this letter, we tackle the problem of multiview clustering by jointly optimizing the graph matrix to make full use of the data correlation between views. With the interview correlation, a concept factorization–based multiview clustering method is developed for data integration, and the adaptive method correlates the affinity weights of all views. This method differs from nonnegative matrix factorization–based clustering methods in that it can be applicable to data sets containing negative values. Experiments are conducted to demonstrate the effectiveness of the proposed method in comparison with state-of-the-art approaches in terms of accuracy, normalized mutual information, and purity.

Download Full-text

Discovering Latent Class Labels for Multi-Label Learning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/423 ◽

2020 ◽

Author(s):

Jun Huang ◽

Linchuan Xu ◽

Jing Wang ◽

Lei Feng ◽

Kenji Yamanishi

Keyword(s):

Large Scale ◽

Latent Class ◽

Training Data ◽

Data Sets ◽

Robust Learning ◽

Large Scale Data ◽

Novel Approach ◽

Fixed Set ◽

Class Labels ◽

Scale Data

Existing multi-label learning (MLL) approaches mainly assume all the labels are observed and construct classification models with a fixed set of target labels (known labels). However, in some real applications, multiple latent labels may exist outside this set and hide in the data, especially for large-scale data sets. Discovering and exploring the latent labels hidden in the data may not only find interesting knowledge but also help us to build a more robust learning model. In this paper, a novel approach named DLCL (i.e., Discovering Latent Class Labels for MLL) is proposed which can not only discover the latent labels in the training data but also predict new instances with the latent and known labels simultaneously. Extensive experiments show a competitive performance of DLCL against other state-of-the-art MLL approaches.

Download Full-text

A Method To Improve The Time Of Computing For Detecting Community Structure In Social Network Graph

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f8235.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 933-937

Keyword(s):

Community Structure ◽

Social Network ◽

Complex Networks ◽

Large Scale ◽

Linear Time ◽

A Priori ◽

Label Propagation ◽

Data Sets ◽

Network Graph ◽

Priori Information

Identifying communities has always been a fundamental task in analysis of complex networks. Currently used algorithms that identify the community structures in large-scale real-world networks require a priori information such as the number and sizes of communities or are computationally expensive. Amongst them, the label propagation algorithm (LPA) brings great scaslability together with high accuracy but which is not accurate enough because of its randomness. In this paper, we study the equivalence properties of nodes on social network graphs according to the labeling criteria to shorten social network graphs and develop label propagation algorithms on shortened graphs to discover effective social networking communities without requiring optimization of the objective function as well as advanced information about communities. Test results on sample data sets show that the proposed algorithm execution time is significantly reduced compared to the published algorithms. The proposed algorithm takes an almost linear time and improves the overall quality of the identified community in complex networks with a clear community structure.

Download Full-text

Assessing Study Reproducibility through M2RI: A Novel Approach for Large-scale High-throughput Association Studies

10.1101/2020.08.18.253740 ◽

2020 ◽

Author(s):

Zeyu Jiao ◽

Yinglei Lai ◽

Jujiao Kang ◽

Weikang Gong ◽

Liang Ma ◽

...

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Association Studies ◽

Structural Mri ◽

Data Sets ◽

Sequencing Data ◽

Novel Approach ◽

Magnetic Resonance Imaging Mri

AbstractHigh-throughput technologies, such as magnetic resonance imaging (MRI) and DNA/RNA sequencing (DNA-seq/RNA-seq), have been increasingly used in large-scale association studies. With these technologies, important biomedical research findings have been generated. The reproducibility of these findings, especially from structural MRI (sMRI) and functional MRI (fMRI) association studies, has recently been questioned. There is an urgent demand for a reliable overall reproducibility assessment for large-scale high-throughput association studies. It is also desirable to understand the relationship between study reproducibility and sample size in an experimental design. In this study, we developed a novel approach: the mixture model reproducibility index (M2RI) for assessing study reproducibility of large-scale association studies. With M2RI, we performed study reproducibility analysis for several recent large sMRI/fMRI data sets. The advantages of our approach were clearly demonstrated, and the sample size requirements for different phenotypes were also clearly demonstrated, especially when compared to the Dice coefficient (DC). We applied M2RI to compare two MRI or RNA sequencing data sets. The reproducibility assessment results were consistent with our expectations. In summary, M2RI is a novel and useful approach for assessing study reproducibility, calculating sample sizes and evaluating the similarity between two closely related studies.

Download Full-text

Weighted Low-Rank Tensor Representation for Multi-View Subspace Clustering

Frontiers in Physics ◽

10.3389/fphy.2020.618224 ◽

2021 ◽

Vol 8 ◽

Author(s):

Shuqin Wang ◽

Yongyong Chen ◽

Fangying Zheng

Keyword(s):

Augmented Lagrangian ◽

State Of The Art ◽

Augmented Lagrangian Method ◽

Subspace Clustering ◽

Low Rank ◽

Multiple Views ◽

Clustering Methods ◽

Tensor Representation ◽

Numerical Studies ◽

Rank Tensor

Multi-view clustering has been deeply explored since the compatible and complementary information among views can be well captured. Recently, the low-rank tensor representation-based methods have effectively improved the clustering performance by exploring high-order correlations between multiple views. However, most of them often express the low-rank structure of the self-representative tensor by the sum of unfolded matrix nuclear norms, which may cause the loss of information in the tensor structure. In addition, the amount of effective information in all views is not consistent, and it is unreasonable to treat their contribution to clustering equally. To address the above issues, we propose a novel weighted low-rank tensor representation (WLRTR) method for multi-view subspace clustering, which encodes the low-rank structure of the representation tensor through Tucker decomposition and weights the core tensor to retain the main information of the views. Under the augmented Lagrangian method framework, an iterative algorithm is designed to solve the WLRTR method. Numerical studies on four real databases have proved that WLRTR is superior to eight state-of-the-art clustering methods.

Download Full-text

Latent Distribution Preserving Deep Subspace Clustering

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/617 ◽

2019 ◽

Cited By ~ 8

Author(s):

Lei Zhou ◽

Xiao Bai ◽

Dong Wang ◽

Xianglong Liu ◽

Jun Zhou ◽

...

Keyword(s):

Linear Subspace ◽

State Of The Art ◽

Subspace Clustering ◽

Real Data ◽

High Dimensional ◽

Clustering Methods ◽

Real World Data ◽

Latent Distribution ◽

Computer Vision Applications ◽

Strong Capacity

Subspace clustering is a useful technique for many computer vision applications in which the intrinsic dimension of high-dimensional data is smaller than the ambient dimension. Traditional subspace clustering methods often rely on the self-expressiveness property, which has proven effective for linear subspace clustering. However, they perform unsatisfactorily on real data with complex nonlinear subspaces. More recently, deep autoencoder based subspace clustering methods have achieved success owning to the more powerful representation extracted by the autoencoder network. Unfortunately, these methods only considering the reconstruction of original input data can hardly guarantee the latent representation for the data distributed in subspaces, which inevitably limits the performance in practice. In this paper, we propose a novel deep subspace clustering method based on a latent distribution-preserving autoencoder, which introduces a distribution consistency loss to guide the learning of distribution-preserving latent representation, and consequently enables strong capacity of characterizing the real-world data for subspace clustering. Experimental results on several public databases show that our method achieves significant improvement compared with the state-of-the-art subspace clustering methods.

Download Full-text

Content-Based Video Scene Clustering and Segmentation

Computer Vision for Multimedia Applications ◽

10.4018/978-1-60960-024-2.ch010 ◽

2011 ◽

pp. 166-179

Author(s):

Hong Lu ◽

Xiangyang Xue

Keyword(s):

Large Scale ◽

Video Summarization ◽

Gaussian Mixture ◽

Visual Similarity ◽

Video Data ◽

Data Sets ◽

Scene Segmentation ◽

Clustering Methods ◽

Video Frames ◽

Video Scene

With the amount of video data increasing rapidly, automatic methods are needed to deal with large-scale video data sets in various applications. In content-based video analysis, a common and fundamental preprocess for these applications is video segmentation. Based on the segmentation results, video has a hierarchical representation structure of frames, shots, and scenes from the low level to high level. Due to the huge amount of video frames, it is not appropriate to represent video contents using frames. In the levels of video structure, shot is defined as an unbroken sequence of frames from one camera; however, the contents in shots are trivial and can hardly convey valuable semantic information. On the other hand, scene is a group of consecutive shots that focuses on an object or objects of interest. And a scene can represent a semantic unit for further processing such as story extraction, video summarization, etc. In this chapter, we will survey the methods on video scene segmentation. Specifically, there are two kinds of scenes. One kind of scene is to just consider the visual similarity of video shots and clustering methods are used for scene clustering. Another kind of scene is to consider both the visual similarity and temporal constraints of video shots, i.e., shots with similar contents and not lying too far in temporal order. Also, we will present our proposed methods on scene clustering and scene segmentation by using Gaussian mixture model, graph theory, sequential change detection, and spectral methods.

Download Full-text

Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics

Molecular Biology and Evolution ◽

10.1093/molbev/msaa130 ◽

2020 ◽

Vol 37 (10) ◽

pp. 3047-3060

Author(s):

Xiang Ji ◽

Zhenyu Zhang ◽

Andrew Holbrook ◽

Akihiko Nishimura ◽

Guy Baele ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Linear Time ◽

Phylogenetic Reconstruction ◽

Fold Increase ◽

Time Algorithm ◽

Data Sets ◽

Lassa Virus ◽

Computational Performance ◽

Computational Bottleneck

Abstract Calculation of the log-likelihood stands as the computational bottleneck for many statistical phylogenetic algorithms. Even worse is its gradient evaluation, often used to target regions of high probability. Order O(N)-dimensional gradient calculations based on the standard pruning algorithm require O(N2) operations, where N is the number of sampled molecular sequences. With the advent of high-throughput sequencing, recent phylogenetic studies have analyzed hundreds to thousands of sequences, with an apparent trend toward even larger data sets as a result of advancing technology. Such large-scale analyses challenge phylogenetic reconstruction by requiring inference on larger sets of process parameters to model the increasing data heterogeneity. To make these analyses tractable, we present a linear-time algorithm for O(N)-dimensional gradient evaluation and apply it to general continuous-time Markov processes of sequence substitution on a phylogenetic tree without a need to assume either stationarity or reversibility. We apply this approach to learn the branch-specific evolutionary rates of three pathogenic viruses: West Nile virus, Dengue virus, and Lassa virus. Our proposed algorithm significantly improves inference efficiency with a 126- to 234-fold increase in maximum-likelihood optimization and a 16- to 33-fold computational performance increase in a Bayesian framework.

Download Full-text

Metagenome sequence clustering with hash-based canopies

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017400066 ◽

2017 ◽

Vol 15 (06) ◽

pp. 1740006 ◽

Cited By ~ 6

Author(s):

Mohammad Arifur Rahman ◽

Nathan LaPierre ◽

Huzefa Rangwala ◽

Daniel Barbara

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

State Of The Art ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Operational Taxonomic Units ◽

Sequence Clustering ◽

Scalable Clustering ◽

Metagenome Sequence

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a

Download Full-text

An evaluation of the accuracy and speed of metagenome analysis tools

10.1101/017830 ◽

2015 ◽

Cited By ~ 10

Author(s):

Stinus Lindgreen ◽

Karen L Adair ◽

Paul Gardner

Keyword(s):

Aquatic Ecosystems ◽

Large Scale ◽

High Throughput Sequencing ◽

State Of The Art ◽

Data Sets ◽

Metagenome Analysis ◽

Analysis Tools ◽

Sequencing Platforms ◽

High Degree ◽

Realistic Data

Metagenome studies are becoming increasingly widespread, yielding important insights into microbial communities covering diverse environments from terrestrial and aquatic ecosystems to human skin and gut. With the advent of high-throughput sequencing platforms, the use of large scale shotgun sequencing approaches is now commonplace. However, a thorough independent benchmark comparing state-of-the-art metagenome analysis tools is lacking. Here, we present a benchmark where the most widely used tools are tested on complex, realistic data sets. Our results clearly show that the most widely used tools are not necessarily the most accurate, that the most accurate tool is not necessarily the most time consuming, and that there is a high degree of variability between available tools. These findings are important as the conclusions of any metagenomics study are affected by errors in the predicted community composition. Data sets and results are freely available from http://www.ucbioinformatics.org/metabenchmark.html

Download Full-text