A Clustering System for Dynamic Data Streams Based on Metaheuristic Optimisation

This article presents the Optimised Stream clustering algorithm (OpStream), a novel approach to cluster dynamic data streams. The proposed system displays desirable features, such as a low number of parameters and good scalability capabilities to both high-dimensional data and numbers of clusters in the dataset, and it is based on a hybrid structure using deterministic clustering methods and stochastic optimisation approaches to optimally centre the clusters. Similar to other state-of-the-art methods available in the literature, it uses “microclusters” and other established techniques, such as density based clustering. Unlike other methods, it makes use of metaheuristic optimisation to maximise performances during the initialisation phase, which precedes the classic online phase. Experimental results show that OpStream outperforms the state-of-the-art methods in several cases, and it is always competitive against other comparison algorithms regardless of the chosen optimisation method. Three variants of OpStream, each coming with a different optimisation algorithm, are presented in this study. A thorough sensitive analysis is performed by using the best variant to point out OpStream’s robustness to noise and resiliency to parameter changes.

Download Full-text

Large-Scale Multi-View Subspace Clustering in Linear Time

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5867 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4412-4419 ◽

Cited By ~ 3

Author(s):

Zhao Kang ◽

Wangtao Zhou ◽

Zhitong Zhao ◽

Junming Shao ◽

Meng Han ◽

...

Keyword(s):

Large Scale ◽

State Of The Art ◽

Linear Time ◽

Subspace Clustering ◽

Data Sets ◽

Clustering Methods ◽

Single View ◽

Novel Approach ◽

Points Of View ◽

Effectiveness And Efficiency

A plethora of multi-view subspace clustering (MVSC) methods have been proposed over the past few years. Researchers manage to boost clustering accuracy from different points of view. However, many state-of-the-art MVSC algorithms, typically have a quadratic or even cubic complexity, are inefficient and inherently difficult to apply at large scales. In the era of big data, the computational issue becomes critical. To fill this gap, we propose a large-scale MVSC (LMVSC) algorithm with linear order complexity. Inspired by the idea of anchor graph, we first learn a smaller graph for each view. Then, a novel approach is designed to integrate those graphs so that we can implement spectral clustering on a smaller graph. Interestingly, it turns out that our model also applies to single-view scenario. Extensive experiments on various large-scale benchmark data sets validate the effectiveness and efficiency of our approach with respect to state-of-the-art clustering methods.

Download Full-text

Learning to Transfer Relational Representations through Analogy

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.330110015 ◽

2019 ◽

Vol 33 ◽

pp. 10015-10016

Author(s):

Gaetano Rossiello ◽

Alfio Gliozzo ◽

Michael Glass

Keyword(s):

State Of The Art ◽

Relation Extraction ◽

Knowledge Bases ◽

The State ◽

Large Set ◽

Relational Information ◽

Siamese Network ◽

Distant Supervision ◽

Novel Approach ◽

Art Methods

We propose a novel approach to learn representations of relations expressed by their textual mentions. In our assumption, if two pairs of entities belong to the same relation, then those two pairs are analogous. We collect a large set of analogous pairs by matching triples in knowledge bases with web-scale corpora through distant supervision. This dataset is adopted to train a hierarchical siamese network in order to learn entity-entity embeddings which encode relational information through the different linguistic paraphrasing expressing the same relation. The model can be used to generate pre-trained embeddings which provide a valuable signal when integrated into an existing neural-based model by outperforming the state-of-the-art methods on a relation extraction task.

Download Full-text

IMPROVED DENSITY BASED ALGORITHM FOR DATA STREAM CLUSTERING

Jurnal Teknologi ◽

10.11113/jt.v77.6492 ◽

2015 ◽

Vol 77 (18) ◽

Cited By ~ 2

Author(s):

Maryam Mousavi ◽

Azuraliza Abu Bakar

Keyword(s):

Data Streams ◽

Data Stream ◽

Clustering Algorithm ◽

Local Density ◽

Clustering Methods ◽

Clustering Techniques ◽

Stream Clustering ◽

Density Based Clustering ◽

Clustering Quality ◽

Data Stream Clustering

In recent years, clustering methods have attracted more attention in analysing and monitoring data streams. Density-based techniques are the remarkable category of clustering techniques that are able to detect the clusters with arbitrary shapes and noises. However, finding the clusters with local density varieties is a difficult task. For handling this problem, in this paper, a new density-based clustering algorithm for data streams is proposed. This algorithm can improve the offline phase of density-based algorithm based on MinPts parameter. The experimental results show that the proposed technique can improve the clustering quality in data streams with different densities.

Download Full-text

Structural Correlation Based Method for Image Forgery Classification and Localization

Applied Sciences ◽

10.3390/app10134458 ◽

2020 ◽

Vol 10 (13) ◽

pp. 4458

Author(s):

Nam Thanh Pham ◽

Jong-Weon Lee ◽

Chun-Su Park

Keyword(s):

Clustering Algorithm ◽

State Of The Art ◽

Image Forgery ◽

Structural Correlation ◽

Art Methods

In the image forgery problems, previous works has been chiefly designed considering only one of two forgery types: copy-move and splicing. In this paper, we propose a scheme to handle both copy-move and splicing image forgery by concurrently classifying the image forgery types and localizing the forged regions. The structural correlations between images are employed in the forgery clustering algorithm to assemble relevant images into clusters. Then, we search for the matching of image regions inside each cluster to classify and localize tampered images. Comprehensive experiments are conducted on three datasets (MICC-600, GRIP, and CASIA 2) to demonstrate the better performance in forgery classification and localization of the proposed method in comparison with state-of-the-art methods. Further, in copy-move localization, the source and target regions are explicitly specified.

Download Full-text

Metagenome sequence clustering with hash-based canopies

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017400066 ◽

2017 ◽

Vol 15 (06) ◽

pp. 1740006 ◽

Cited By ~ 6

Author(s):

Mohammad Arifur Rahman ◽

Nathan LaPierre ◽

Huzefa Rangwala ◽

Daniel Barbara

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

State Of The Art ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Operational Taxonomic Units ◽

Sequence Clustering ◽

Scalable Clustering ◽

Metagenome Sequence

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a

Download Full-text

Research on Dynamic Data Streams Clustering Algorithm –Pdstream Based on PCA and Density

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.26-28.108 ◽

2010 ◽

Vol 26-28 ◽

pp. 108-112

Author(s):

Mei Zheng ◽

Chun Hua Ju ◽

Zhang Rui

Keyword(s):

Principal Component Analysis ◽

Data Streams ◽

Clustering Algorithm ◽

Historical Data ◽

Principal Component ◽

Huge Amount ◽

High Quality ◽

Dynamic Data ◽

Mass Data ◽

Limited Memory

The research on data streams clustering has become a focus in the field of data streams mining. Because the number of data streams is too large, and CPU of the computer has limited memory and time, it’s difficult to carry out clustering quickly and effectively. For that problem, we design an improved clustering algorithm for dynamic data streams based on principal component analysis and density. The PDStream algorithm effectively overcomes the shortcomings of the STREAM algorithm controlled by historical data and the CluStream algorithm is difficult to describe non-spherical and out "old data", resulting in huge amount of data. In the course of the experiment, we compare with the STREAM algorithm, the PDStream algorithm shows the superiority of handling mass data and the characteristics of high-quality clustering.

Download Full-text

ECG Denoising Using Artificial Neural Networks and Complete Ensemble Empirical Mode Decomposition

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i2.2033 ◽

2021 ◽

Vol 12 (2) ◽

pp. 2382-2389

Author(s):

Rajesh Birok, Et. al.

Keyword(s):

Neural Networks ◽

Empirical Mode Decomposition ◽

State Of The Art ◽

Signal To Noise Ratio ◽

Low Frequency ◽

Ensemble Empirical Mode Decomposition ◽

Novel Approach ◽

Mode Decomposition ◽

The Neural Network ◽

Art Methods

Electrocardiogram (ECG) is a documentation of the electrical activities of the heart. It is used to identify a number of cardiac faults such as arrhythmias, AF etc. Quite often the ECG gets corrupted by various kinds of artifacts, thus in order to gain correct information from them, they must first be denoised. This paper presents a novel approach for the filtering of low frequency artifacts of ECG signals by using Complete Ensemble Empirical Mode Decomposition (CEED) and Neural Networks, which removes most of the constituent noise while assuring no loss of information in terms of the morphology of the ECG signal. The contribution of the method lies in the fact that it combines the advantages of both EEMD and ANN. The use of CEEMD ensures that the Neural Network does not get over fitted. It also significantly helps in building better predictors at individual frequency levels. The proposed method is compared with other state-of-the-art methods in terms of Mean Square Error (MSE), Signal to Noise Ratio (SNR) and Correlation Coefficient. The results show that the proposed method has better performance as compared to other state-of-the-art methods for low frequency artifacts removal from EEG.

Download Full-text

Ant Colony Stream Clustering: A Fast Density Clustering Algorithm for Dynamic Data Streams

IEEE Transactions on Cybernetics ◽

10.1109/tcyb.2018.2822552 ◽

2019 ◽

Vol 49 (6) ◽

pp. 2215-2228 ◽

Cited By ~ 14

Author(s):

Conor Fahy ◽

Shengxiang Yang ◽

Mario Gongora

Keyword(s):

Data Streams ◽

Clustering Algorithm ◽

Ant Colony ◽

Dynamic Data ◽

Stream Clustering ◽

Density Clustering

Download Full-text

Multi-pose 3D facial texture refinement for face recognition

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691318400064 ◽

2018 ◽

Vol 16 (02) ◽

pp. 1840006

Author(s):

Wanshun Gao ◽

Xi Zhao ◽

Jun An ◽

Jianhua Zou

Keyword(s):

Face Recognition ◽

State Of The Art ◽

Competitive Performance ◽

3D Face Reconstruction ◽

3D Face ◽

Face Reconstruction ◽

Novel Approach ◽

Art Methods ◽

The Face ◽

Facial Images

In this paper, we propose a novel approach for 3D face reconstruction from multi-facial images. Given original pose-variant images, coarse 3D face templates are initialized to reconstruct a refined 3D face mesh in an iteration manner. Then, we warp original facial images to the 2D meshes projected from 3D using Sparse Mesh Affine Warp (SMAW). Finally, we weight the face patches in each view respectively and map the patch with higher weight to a canonical UV space. For facial images with arbitrary pose, their invisible regions are filled with the corresponding UV patches. Poisson editing is applied to blend different patches seamlessly. We evaluate the proposed method on LFW dataset in terms of texture refinement and face recognition. The results demonstrate competitive performance compared to state-of-the-art methods.

Download Full-text

Supervised Application of Internal Validation Measures to Benchmark Dimensionality Reduction Methods in scRNA-seq Data

10.1101/2020.10.29.361451 ◽

2020 ◽

Author(s):

Forrest C Koch ◽

Gavin J Sutton ◽

Irina Voineagu ◽

Fatemeh Vafaee

Keyword(s):

Dimensionality Reduction ◽

Best Practice ◽

Clustering Algorithm ◽

Latent Dirichlet Allocation ◽

Heat Diffusion ◽

Clustering Methods ◽

Internal Validation ◽

Novel Approach ◽

Reduction Methods ◽

Downstream Analysis

AbstractA typical single-cell RNA sequencing (scRNA-seq) experiment will measure on the order of 20,000 transcripts and thousands, if not millions, of cells. The high dimensionality of such data presents serious complications for traditional data analysis methods and, as such, methods to reduce dimensionality play an integral role in many analysis pipelines. However, few studies benchmark the performance of these methods on scRNA-seq data, with existing comparisons assessing performance via downstream analysis accuracy measures which may confound the interpretation of their results. Here, we present the most comprehensive benchmark of dimensionality reduction methods in scRNA-seq data to date, utilizing over 300,000 compute hours to assess the performance of over 25,000 low dimension embeddings across 33 dimensionality reduction methods and 55 scRNA-seq datasets (ranging from 66-27,500 cells). We employ a simple-yet-novel approach which does not rely on the results of downstream analyses. Internal validation measures (IVMs), traditionally used as an unsupervised method to assess clustering performance, are repurposed to measure how well-formed biological clusters are after dimensionality reduction. Performance was further evaluated using nearly 200,000,000 iterations of DBSCAN, a density-based clustering algorithm, showing that hyperparameter optimization using IVMs as the objective function leads to near-optimal clustering. Methods were also assessed on the extent to which they preserve the global structure of the data, and on their computational memory and time requirements across a large range of sample sizes. Our comprehensive benchmarking analysis provides a valuable resource for researchers and aims to guide best practice for dimensionality reduction in scRNA-seq analyses, and we highlight LDA (Latent Dirichlet Allocation) and PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) as high-performing algorithms.

Download Full-text