Orthogonal and Nonnegative Graph Reconstruction for Large Scale Clustering

Spectral clustering has been widely used due to its simplicity for solving graph clustering problem in recent years. However, it suffers from the high computational cost as data grow in scale, and is limited by the performance of post-processing. To address these two problems simultaneously, in this paper, we propose a novel approach denoted by orthogonal and nonnegative graph reconstruction (ONGR) that scales linearly with the data size. For the relaxation of Normalized Cut, we add nonnegative constraint to the objective. Due to the nonnegativity, ONGR offers interpretability that the final cluster labels can be directly obtained without post-processing. Extensive experiments on clustering tasks demonstrate the effectiveness of the proposed method.

Download Full-text

A NOTE ON PHASING LONG GENOMIC REGIONS USING LOCAL HAPLOTYPE PREDICTIONS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002272 ◽

2006 ◽

Vol 04 (03) ◽

pp. 639-647 ◽

Cited By ~ 6

Author(s):

ELEAZAR ESKIN ◽

RODED SHARAN ◽

ERAN HALPERIN

Keyword(s):

Large Scale ◽

Computational Cost ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Novel Approach ◽

Maximum Likelihood Criterion ◽

The Common ◽

Genomic Regions ◽

High Computational Cost ◽

Combining Information

The common approaches for haplotype inference from genotype data are targeted toward phasing short genomic regions. Longer regions are often tackled in a heuristic manner, due to the high computational cost. Here, we describe a novel approach for phasing genotypes over long regions, which is based on combining information from local predictions on short, overlapping regions. The phasing is done in a way, which maximizes a natural maximum likelihood criterion. Among other things, this criterion takes into account the physical length between neighboring single nucleotide polymorphisms. The approach is very efficient and is applied to several large scale datasets and is shown to be successful in two recent benchmarking studies (Zaitlen et al., in press; Marchini et al., in preparation). Our method is publicly available via a webserver at .

Download Full-text

Projective Low-rank Subspace Clustering via Learning Deep Encoder

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/298 ◽

2017 ◽

Cited By ~ 7

Author(s):

Jun Li ◽

Liu Hongfu ◽

Handong Zhao ◽

Yun Fu

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Computational Cost ◽

Subspace Clustering ◽

The State ◽

Low Rank ◽

Clustering Problem ◽

Rank Decomposition ◽

Row Space ◽

Small Dataset

Low-rank subspace clustering (LRSC) has been considered as the state-of-the-art method on small datasets. LRSC constructs a desired similarity graph by low-rank representation (LRR), and employs a spectral clustering to segment the data samples. However, effectively applying LRSC into clustering big data becomes a challenge because both LRR and spectral clustering suffer from high computational cost. To address this challenge, we create a projective low-rank subspace clustering (PLrSC) scheme for large scale clustering problem. First, a small dataset is randomly sampled from big dataset. Second, our proposed predictive low-rank decomposition (PLD) is applied to train a deep encoder by using the small dataset, and the deep encoder is used to fast compute the low-rank representations of all data samples. Third, fast spectral clustering is employed to segment the representations. As a non-trivial contribution, we theoretically prove the deep encoder can universally approximate to the exact (or bounded) recovery of the row space. Experiments verify that our scheme outperforms the related methods on large scale datasets in a small amount of time. We achieve the state-of-art clustering accuracy by 95.8% on MNIST using scattering convolution features.

Download Full-text

Spectral Clustering of Large-scale Data by Directly Solving Normalized Cut

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ◽

10.1145/3219819.3220039 ◽

2018 ◽

Cited By ~ 8

Author(s):

Xiaojun Chen ◽

Weijun Hong ◽

Feiping Nie ◽

Dan He ◽

Min Yang ◽

...

Keyword(s):

Spectral Clustering ◽

Large Scale ◽

Normalized Cut ◽

Large Scale Data ◽

Scale Data

Download Full-text

Grafting for combinatorial binary model using frequent itemset mining

Data Mining and Knowledge Discovery ◽

10.1007/s10618-019-00657-9 ◽

2019 ◽

Vol 34 (1) ◽

pp. 101-123 ◽

Cited By ~ 1

Author(s):

Taito Lee ◽

Shin Matsushima ◽

Kenji Yamanishi

Keyword(s):

Large Scale ◽

Computational Cost ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Itemset Mining ◽

Binary Model ◽

High Knowledge ◽

Linear Predictors ◽

Computational Difficulty ◽

High Computational Cost

Abstract We consider the class of linear predictors over all logical conjunctions of binary attributes, which we refer to as the class of combinatorial binary models (CBMs) in this paper. CBMs are of high knowledge interpretability but naïve learning of them from labeled data requires exponentially high computational cost with respect to the length of the conjunctions. On the other hand, in the case of large-scale datasets, long conjunctions are effective for learning predictors. To overcome this computational difficulty, we propose an algorithm, GRAfting for Binary datasets (GRAB), which efficiently learns CBMs within the $$L_1$$L1-regularized loss minimization framework. The key idea of GRAB is to adopt weighted frequent itemset mining for the most time-consuming step in the grafting algorithm, which is designed to solve large-scale $$L_1$$L1-RERM problems by an iterative approach. Furthermore, we experimentally showed that linear predictors of CBMs are effective in terms of prediction accuracy and knowledge discovery.

Download Full-text

Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

F1000Research ◽

10.12688/f1000research.9416.3 ◽

2017 ◽

Vol 5 ◽

pp. 1987 ◽

Cited By ~ 2

Author(s):

Jasper J. Koehorst ◽

Edoardo Saccenti ◽

Peter J. Schaap ◽

Vitor A. P. Martins dos Santos ◽

Maria Suarez-Diez

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

Sequence Similarity ◽

Computational Cost ◽

Protein Domain ◽

Gene Acquisition ◽

Bacterial Fitness ◽

Efficient Alternative ◽

Comparative Functional Genomics ◽

High Computational Cost

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic boundaries, and it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

Download Full-text

Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics

F1000Research ◽

10.12688/f1000research.9416.1 ◽

2016 ◽

Vol 5 ◽

pp. 1987 ◽

Cited By ~ 7

Author(s):

Jasper J. Koehorst ◽

Edoardo Saccenti ◽

Peter J. Schaap ◽

Vitor A. P. Martins dos Santos ◽

Maria Suarez-Diez

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

Sequence Similarity ◽

Computational Cost ◽

Protein Domain ◽

Gene Acquisition ◽

Bacterial Fitness ◽

Efficient Alternative ◽

Comparative Functional Genomics ◽

High Computational Cost

A functional comparative genome analysis is essential to understand the mechanisms underlying bacterial evolution and adaptation. Detection of functional orthologs using standard global sequence similarity methods faces several problems; the need for defining arbitrary acceptance thresholds for similarity and alignment length, lateral gene acquisition and the high computational cost for finding bi-directional best matches at a large scale. We investigated the use of protein domain architectures for large scale functional comparative analysis as an alternative method. The performance of both approaches was assessed through functional comparison of 446 bacterial genomes sampled at different taxonomic levels. We show that protein domain architectures provide a fast and efficient alternative to methods based on sequence similarity to identify groups of functionally equivalent proteins within and across taxonomic bounderies. As the computational cost scales linearly, and not quadratically with the number of genomes, it is suitable for large scale comparative analysis. Running both methods in parallel pinpoints potential functional adaptations that may add to bacterial fitness.

Download Full-text

Impact of UAV Surveying Parameters on Mixed Urban Landuse Surface Modelling

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9110656 ◽

2020 ◽

Vol 9 (11) ◽

pp. 656

Author(s):

Muhammad Hamid Chaudhry ◽

Anuar Ahmad ◽

Qudsia Gulzar

Keyword(s):

Point Cloud ◽

Large Scale ◽

Accuracy Assessment ◽

Computational Cost ◽

Three Dimensional ◽

Point Clouds ◽

Ground Control ◽

Control Points ◽

Ground Control Points ◽

High Computational Cost

Unmanned Aerial Vehicles (UAVs) as a surveying tool are mainly characterized by a large amount of data and high computational cost. This research investigates the use of a small amount of data with less computational cost for more accurate three-dimensional (3D) photogrammetric products by manipulating UAV surveying parameters such as flight lines pattern and image overlap percentages. Sixteen photogrammetric projects with perpendicular flight plans and a variation of 55% to 85% side and forward overlap were processed in Pix4DMapper. For UAV data georeferencing and accuracy assessment, 10 Ground Control Points (GCPs) and 18 Check Points (CPs) were used. Comparative analysis was done by incorporating the median of tie points, the number of 3D point cloud, horizontal/vertical Root Mean Square Error (RMSE), and large-scale topographic variations. The results show that an increased forward overlap also increases the median of the tie points, and an increase in both side and forward overlap results in the increased number of point clouds. The horizontal accuracy of 16 projects varies from ±0.13m to ±0.17m whereas the vertical accuracy varies from ± 0.09 m to ± 0.32 m. However, the lowest vertical RMSE value was not for highest overlap percentage. The tradeoff among UAV surveying parameters can result in high accuracy products with less computational cost.

Download Full-text

SolidBin: improving metagenome binning with semi-supervised normalized cut

Bioinformatics ◽

10.1093/bioinformatics/btz253 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4229-4238 ◽

Cited By ~ 5

Author(s):

Ziye Wang ◽

Zhengyang Wang ◽

Yang Young Lu ◽

Fengzhu Sun ◽

Shanfeng Zhu

Keyword(s):

Spectral Clustering ◽

State Of The Art ◽

Single Sample ◽

Biological Information ◽

Supplementary Information ◽

Adjusted Rand Index ◽

Normalized Cut ◽

Sequence Composition ◽

Normalized Mutual Information ◽

Clustering Problem

Abstract Motivation Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current state-of-the-art contig binning methods do not make full use of the additional biological information except the coverage and sequence composition of the contigs. Results We developed a novel contig binning method, Semi-supervised Spectral Normalized Cut for Binning (SolidBin), based on semi-supervised spectral clustering. Using sequence feature similarity and/or additional biological information, such as the reliable taxonomy assignments of some contigs, SolidBin constructs two types of prior information: must-link and cannot-link constraints. Must-link constraints mean that the pair of contigs should be clustered into the same group, while cannot-link constraints mean that the pair of contigs should be clustered in different groups. These constraints are then integrated into a classical spectral clustering approach, normalized cut, for improved contig binning. The performance of SolidBin is compared with five state-of-the-art genome binners, CONCOCT, COCACOLA, MaxBin, MetaBAT and BMC3C on five next-generation sequencing benchmark datasets including simulated multi- and single-sample datasets and real multi-sample datasets. The experimental results show that, SolidBin has achieved the best performance in terms of F-score, Adjusted Rand Index and Normalized Mutual Information, especially while using the real datasets and the single-sample dataset. Availability and implementation https://github.com/sufforest/SolidBin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SOLVING MEDICAL AKZO NOBEL PROBLEM USING FUNCTIONAL LOAD BALANCING ALGORITHM OF 4(3) DIRK METHOD

International Journal of Modern Physics Conference Series ◽

10.1142/s2010194512005569 ◽

2012 ◽

Vol 09 ◽

pp. 480-487 ◽

Cited By ~ 1

Author(s):

UMMUL KHAIR SALMA DIN ◽

FUDZIAH ISMAIL ◽

ZANARIAH ABDUL MAJID ◽

ROKIAH ROZITA AHMAD

Keyword(s):

Load Balancing ◽

Differential Equations ◽

Large Scale ◽

Computational Cost ◽

Akzo Nobel ◽

Functional Load ◽

One Dimensional ◽

Sparsity Pattern ◽

Load Balancing Algorithm ◽

High Computational Cost

Medical Akzo Nobel problem (MEDAKZO) is known for its tenancy of incurring high computational cost. Originates from the penetration of radio-labeled antibodies into a tissue that has been infected by a tumor, the problem has been derived from a one dimensional partial differential equations to a two dimensional ordinary differential equations thus generates a large scale of problem to be solved. This paper presents the performance of a new 4(3) diagonally implicit Runge-Kutta (DIRK) method which is suitable to excellently solve MEDAKZO problem that is stiff in nature. The sparsity pattern designed on the method enable the functions evaluations to be computed simultaneously on two processors. The functional load balancing can be profitable especially in solving large problems.

Download Full-text

Dynamic zoom simulations: A fast, adaptive algorithm for simulating light-cones

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa2064 ◽

2020 ◽

Vol 499 (2) ◽

pp. 2685-2700

Author(s):

Enrico Garaldi ◽

Matteo Nori ◽

Marco Baldi

Keyword(s):

Numerical Simulations ◽

Data Storage ◽

Light Cone ◽

Large Scale ◽

Computational Cost ◽

Work Load ◽

Mass Function ◽

Galaxy Surveys ◽

Novel Approach ◽

Technical Challenges

ABSTRACT The advent of a new generation of large-scale galaxy surveys is pushing cosmological numerical simulations in an uncharted territory. The simultaneous requirements of high resolution and very large volume pose serious technical challenges, due to their computational and data storage demand. In this paper, we present a novel approach dubbed dynamic zoom simulations – or dzs – developed to tackle these issues. Our method is tailored to the production of light-cone outputs from N-body numerical simulations, which allow for a more efficient storage and post-processing compared to standard comoving snapshots, and more directly mimic the format of survey data. In dzs, the resolution of the simulation is dynamically decreased outside the light-cone surface, reducing the computational work load, while simultaneously preserving the accuracy inside the light-cone and the large-scale gravitational field. We show that our approach can achieve virtually identical results to traditional simulations at half of the computational cost for our largest box. We also forecast this speedup to increase up to a factor of 5 for larger and/or higher resolution simulations. We assess the accuracy of the numerical integration by comparing pairs of identical simulations run with and without dzs. Deviations in the light-cone halo mass function, in the sky-projected light-cone, and in the 3D matter light-cone always remain below 0.1 per cent. In summary, our results indicate that the dzs technique may provide a highly valuable tool to address the technical challenges that will characterize the next generation of large-scale cosmological simulations.

Download Full-text