Unified Embedding and Clustering

10.36227/techrxiv.16926754 ◽

2021 ◽

Author(s):

Mebarka Allaoui ◽

Mohammed Lamine Kherfi ◽

Abdelhakim Cheriet ◽

Abdelhamid Bouchachia

Keyword(s):

Loss Functions ◽

High Dimensional ◽

Original Structure ◽

Original Formulation ◽

Clustering Methods ◽

Manifold Embedding ◽

Data Points ◽

Real World Datasets ◽

State Of Art ◽

Novel Algorithm

In this paper, we introduce a novel algorithm that unifies manifold embedding and clustering (UEC) which efficiently predicts clustering assignments of the high dimensional data points in a new embedding space. The algorithm is based on a bi-objective optimisation problem combining embedding and clustering loss functions. Such original formulation will allow to simultaneously preserve the original structure of the data in the embedding space and produce better clustering assignments. The experimental results using a number of real-world datasets show that UEC is competitive with the state-of-art clustering methods.

Download Full-text

A Novel High-Dimensional Trajectories Construction Network based on Multi-Clustering Algorithm

10.21203/rs.3.rs-1060086/v1 ◽

2021 ◽

Author(s):

Feiyang Ren ◽

Yi Han ◽

Shaohan Wang ◽

He Jiang

Keyword(s):

Economic Analysis ◽

Clustering Algorithm ◽

Transportation Network ◽

High Dimensional ◽

Clustering Methods ◽

Marine Transportation ◽

Network Construction ◽

National Economic ◽

Multi Level ◽

State Of Art

Abstract A novel marine transportation network based on high-dimensional AIS data with a multi-level clustering algorithm is proposed to discover important waypoints in trajectories based on selected navigation features. This network contains two parts: the calculation of major nodes with CLIQUE and BIRCH clustering methods and navigation network construction with edge construction theory. Unlike the state-of-art work for navigation clustering with only ship coordinate, the proposed method contains more high-dimensional features such as drafting, weather, and fuel consumption. By comparing the historical AIS data, more than 220,133 lines of data in 30 days were used to extract 440 major nodal points in less than 4 minutes with ordinary PC specs (i5 processer). The proposed method can be performed on more dimensional data for better ship path planning or even national economic analysis. Current work has shown good performance on complex ship trajectories distinction and great potential for future shipping transportation market analytical predictions.

Download Full-text

A Preview on Subspace Clustering of High Dimensional Data

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v6i3.4466 ◽

2013 ◽

Vol 6 (3) ◽

pp. 441-448 ◽

Cited By ~ 1

Author(s):

Sajid Nagi ◽

Dhruba Kumar Bhattacharyya ◽

Jugal K. Kalita

Keyword(s):

Search Strategy ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Expression Data ◽

Clustering Methods ◽

Top Down ◽

Data Points ◽

Low Dimensional ◽

Entire Dataset

When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace clustering, we identify the challenges and issues involved with clustering gene expression data.

Download Full-text

EMM-CLODS: An Effective Microcluster and Minimal Pruning CLustering-Based Technique for Detecting Outliers in Data Streams

Complexity ◽

10.1155/2021/9178461 ◽

2021 ◽

Vol 2021 ◽

pp. 1-20

Author(s):

Mohamed Jaward Bah ◽

Hongzhi Wang ◽

Li-Hui Zhao ◽

Ji Zhang ◽

Jie Xiao

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Experimental Studies ◽

Streaming Data ◽

Detection Accuracy ◽

Clustering Methods ◽

Time Consumption ◽

Data Points ◽

Evolving Data ◽

Real World Datasets

Detecting outliers in data streams is a challenging problem since, in a data stream scenario, scanning the data multiple times is unfeasible, and the incoming streaming data keep evolving. Over the years, a common approach to outlier detection is using clustering-based methods, but these methods have inherent challenges and drawbacks. These include to effectively cluster sparse data points which has to do with the quality of clustering methods, dealing with continuous fast-incoming data streams, high memory and time consumption, and lack of high outlier detection accuracy. This paper aims at proposing an effective clustering-based approach to detect outliers in evolving data streams. We propose a new method called Effective Microcluster and Minimal pruning CLustering-based method for Outlier detection in Data Streams (EMM-CLODS). It is a clustering-based outlier detection approach that detects outliers in evolving data streams by first applying microclustering technique to cluster dense data points and effectively handle objects within a sliding window according to the relevance of their status to their respective neighbors or position. The analysis from our experimental studies on both synthetic and real-world datasets shows that the technique performs well with minimal memory and time consumption when compared to the other baseline algorithms, making it a very promising technique in dealing with outlier detection problems in data streams.

Download Full-text

The Sparse MinMax k-Means Algorithm for High-Dimensional Clustering

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/291 ◽

2020 ◽

Author(s):

Sayak Dey ◽

Swagatam Das ◽

Rammohan Mallipeddi

Keyword(s):

Real World ◽

Dimensional Space ◽

Sum Of Squares ◽

High Dimensional ◽

Clustering Methods ◽

High Dimensional Space ◽

Sparse Regularization ◽

Clustering Approach ◽

Real World Datasets ◽

High Dimensional Clustering

Classical clustering methods usually face tough challenges when we have a larger set of features compared to the number of items to be partitioned. We propose a Sparse MinMax k-Means Clustering approach by reformulating the objective of the MinMax k-Means algorithm (a variation of classical k-Means that minimizes the maximum intra-cluster variance instead of the sum of intra-cluster variances), into a new weighted between-cluster sum of squares (BCSS) form. We impose sparse regularization on these weights to make it suitable for high-dimensional clustering. We seek to use the advantages of the MinMax k-Means algorithm in the high-dimensional space to generate good quality clusters. The efficacy of the proposal is showcased through comparison against a few representative clustering methods over several real world datasets.

Download Full-text

Partitioning subjects based on high-dimensional fMRI data: comparison of several clustering methods and studying the influence of ICA data reduction in big data

Behaviormetrika ◽

10.1007/s41237-019-00086-4 ◽

2019 ◽

Vol 46 (2) ◽

pp. 271-311 ◽

Cited By ~ 2

Author(s):

Jeffrey Durieux ◽

Tom F. Wilderjans

Keyword(s):

Big Data ◽

Data Reduction ◽

Fmri Data ◽

High Dimensional ◽

Clustering Methods ◽

Data Comparison

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

A Novel Density-based Technique for Outlier Detection of High Dimensional Data Utilizing Full Feature Space

Information Technology And Control ◽

10.5755/j01.itc.50.1.25588 ◽

2021 ◽

Vol 50 (1) ◽

pp. 138-152

Author(s):

Mujeeb Ur Rehman ◽

Dost Muhammad Khan

Keyword(s):

Data Mining ◽

Outlier Detection ◽

High Dimensional Data ◽

Research Work ◽

Feature Space ◽

High Dimensional ◽

Data Set ◽

Data Points ◽

Low Dimensional ◽

Intrinsic Feature

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.

Download Full-text

A Practical Guide to Sparse k-Means Clustering for Studying Molecular Development of the Human Brain

Frontiers in Neuroscience ◽

10.3389/fnins.2021.668293 ◽

2021 ◽

Vol 15 ◽

Author(s):

Justin L. Balsor ◽

Keon Arbabi ◽

Desmond Singh ◽

Rachel Kwan ◽

Jonathan Zaslavsky ◽

...

Keyword(s):

Visual Cortex ◽

Human Brain ◽

Small Sample ◽

Postmortem Brain ◽

High Dimensional ◽

Clustering Methods ◽

Sample Sizes ◽

Human Visual Cortex ◽

Age Related ◽

Postmortem Brain Tissue

Studying the molecular development of the human brain presents unique challenges for selecting a data analysis approach. The rare and valuable nature of human postmortem brain tissue, especially for developmental studies, means the sample sizes are small (n), but the use of high throughput genomic and proteomic methods measure the expression levels for hundreds or thousands of variables [e.g., genes or proteins (p)] for each sample. This leads to a data structure that is high dimensional (p ≫ n) and introduces the curse of dimensionality, which poses a challenge for traditional statistical approaches. In contrast, high dimensional analyses, especially cluster analyses developed for sparse data, have worked well for analyzing genomic datasets where p ≫ n. Here we explore applying a lasso-based clustering method developed for high dimensional genomic data with small sample sizes. Using protein and gene data from the developing human visual cortex, we compared clustering methods. We identified an application of sparse k-means clustering [robust sparse k-means clustering (RSKC)] that partitioned samples into age-related clusters that reflect lifespan stages from birth to aging. RSKC adaptively selects a subset of the genes or proteins contributing to partitioning samples into age-related clusters that progress across the lifespan. This approach addresses a problem in current studies that could not identify multiple postnatal clusters. Moreover, clusters encompassed a range of ages like a series of overlapping waves illustrating that chronological- and brain-age have a complex relationship. In addition, a recently developed workflow to create plasticity phenotypes (Balsor et al., 2020) was applied to the clusters and revealed neurobiologically relevant features that identified how the human visual cortex changes across the lifespan. These methods can help address the growing demand for multimodal integration, from molecular machinery to brain imaging signals, to understand the human brain’s development.

Download Full-text

APMFT: Anamoly Prediction Model for Financial Transactions Using Learning Methods in Machine Learning and Deep Learning

10.3233/apc210101 ◽

2021 ◽

Author(s):

R. Priyadarshini ◽

K. Anuratha ◽

N. Rajendran ◽

S. Sujeetha

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Prediction Models ◽

General Pattern ◽

High Dimensional ◽

Learning Methods ◽

Data Points ◽

The Times ◽

Financial Transactions ◽

Journal Entries

Anamoly is an uncommon and it represents an outlier i.e, a nonconforming case. According to Oxford Dictionary of Mathematics anamoly is defined as an unusal and erroneous observation that usually doesn’t follow the general pattern of drawn population. The process of detecting the anmolies is a process of data mining and it aims at finding the data points or patterns that do not adapt with the actual complete pattern of the data.The study on anamoly behavior and its impact has been done on areas such as Network Security, Finance, Healthcare and Earth Sciences etc. The proper detection and prediction of anamolies are of great importance as these rare observations may carry siginificant information. In today’s finanicial world, the enterprise data is digitized and stored in the cloudand so there is a significant need to detect the anaomalies in financial data which will help the enterprises to deal with the huge amount of auditing The corporate and enterprise is conducting auidts on large number of ledgers and journal entries. The monitoring of those kinds of auidts is performed manually most of the times. There should be proper anamoly detection in the high dimensional data published in the ledger format for auditing purpose. This work aims at analyzing and predicting unusal fraudulent financial transations by emplyoing few Machine Learning and Deep Learning Methods. Even if any of the anamoly like manipulation or tampering of data detected, such anamolies and errors can be identified and marked with proper proof with the help of the machine learning based algorithms. The accuracy of the prediction is increased by 7% by implementing the proposed prediction models.

Download Full-text